In Spark or PySpark, you can use show(n) to get the top or first N (5,10,100 ..) rows of the DataFrame and display them to a console or a log file. And use Spark actions like take(), head(), and first() to get the first n rows as a list (Array[Row] for Scala).
Spark actions get the result to Spark Driver. Hence, you have to be very careful when you are extracting large datasets. Collecting large datasets that are larger than Spark Driver memory returns OutOfMemoryError and the job fails.
Related: How to Get First Row of Each Group in Spark
1. Show Top N Rows in Spark/PySpark
The Following actions get the top/first n rows from Spark DataFrame. Except show(), most of all actions return a list of Row for PySpark and Array[Row] for Spark with Scala.
| Action | Description | Return |
|---|---|---|
show() | Print/Show top 20 rows in a tabular form | PySpark – no return Scala – Unit |
show(n) | Print/Show top N rows in a tabular form | PySpark – no return Scala – Unit |
take(n)df.takeAsList(3) (Only for Scala) | Returns top N row. | PySpark – Return list of Row Scala – Return Array[Row]. |
first() | Returns the first row | PySpark – Return list of Row Scala – Return Array[Row] Scala – takeAsList returns List[Row] |
head() | Returns the first row | PySpark – Return list of Row Scala – Return Array[Row] |
head(n) | Returns top N rows Similar to take(n) | PySpark – Return list of Row Scala – Return Array[Row] |
collect() | Returns all dataset | PySpark – Return All as a list of Row Scala – Return Array[Row] |
Note: take(), first() and head() actions internally calls limit() transformation and finally calls collect() action to collect the data.
Let’s create a sample DataFrame, and I will use this to explain the above function to get the first or top N rows from Spark/PySpark DataFrame.
# Import
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
# Create SparkDataFrame
simpleData = [("James",34),("Ann",34),
("Michael",33),("Scott",53),
("Robert",37),("Chad",27)
]
columns = ["firstname","age",]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.show()
# Output to console
# +---------+---+
# |firstname|age|
# +---------+---+
# | James| 34|
# | Ann| 34|
# | Michael| 33|
# | Scott | 53|
# +---------+---+
2. Show Top N Rows from DataFrame Examples
2.1 Using show()
Spark DataFrame show() method shows the first 20 Rows, and the column values are truncated at 20 characters. This method is used to display the contents of the DataFrame in a Table Row & Column Format. You can pass a numeric argument to this method to get the top N rows.
# Get top 2 rows to console
df.show(2)
# Output:
#+---------+---+
#|firstname|age|
#+---------+---+
#| James| 34|
#| Ann| 34|
#+---------+---+
#only showing top 2 rows
2.2 Using take()
The take() action is used to retrieve a specified top number of elements from a Spark DataFrame or RDD (Resilient Distributed Dataset) as a list of class Row. It returns a list containing the elements, allowing users to inspect a small subset of data without retrieving the entire dataset.
# Get first 2 rows using take()
print(df.take(2))
# Output:
# [Row(firstname='James', age=34), Row(firstname='Ann', age=34)]
2.3 Using first()
The frist() action returns the top first row of the DataFrame as a Row type.
# Gets only first row
print(df.first())
# Output:
# Row(firstname='James', age=34)
2.4 Using head()
The Spark head() method returns the top first N rows from the DataFrame. By default, it returns 1 row. By passing N argument, it returns several rows. If n is greater than 1, return a list of class Row type. If n is 1, return a single Row.
# Returns 1 row
print(df.head())
#Row(firstname='James', age=34)
# Returns first N rows
print(df.head(2))
#[Row(firstname='James', age=34), Row(firstname='Ann', age=34)]
3. Get Top N Rows using limit()
In PySpark, limit() is a DataFrame transformation that returns a Spark DataFrame with top N rows. For Spark with Scala/Java it returns a Dataset.
| Transformation | Description | Return |
|---|---|---|
limit(n) | Returns Top N rows | PySpark – Returns a new DataFrame Scala -Returns a new Dataset |
Example:
# Get DataFrame with 2 rows
df.limit(2).show()
# Output:
# +---------+---+
# |firstname|age|
# +---------+---+
# | James| 34|
# | Ann| 34|
# +---------+---+
4. Get Top First N Rows to Pandas DataFrame
While working with Python and Spark, we often required to convert Pandas to PySpark DataFrame and vice-versa, you can limit the top n number of records when you convert back to pandas, below code snippet provides an example of getting first 3 records from DataFrame and converted to pandas..
# Convert to Pandas DataFrame
pandasDF=df.limit(3).toPandas()
print(pandasDF)
# Output Pandas DataFrame
# firstname age
# 0 James 34
# 1 Ann 34
# 2 Michael 33
5. Conclusion
In this article, you have learned show() is used to get the top first n records from Spark or PySpark DataFrame, take() and head() are used to get the top first n records as a list of Row (Array[Row] for scala). and also learned limit() is a transformation that is used to get the top N rows as a DataFrame/Dataset.
Related Articles
- PySpark Count Distinct from DataFrame
- PySpark Groupby Count Distinct
- PySpark – Find Count of null, None, NaN Values
- PySpark Select Top N Rows From Each Group
- PySpark Find Maximum Row per Group in DataFrame
- PySpark show() – Display DataFrame Contents in Table
Reference
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html
Thanks Paul for your comment.
This was helpful! Thanks for writing it out so plainly