In Spark/PySpark, you can use show()
action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take()
, tail()
, collect()
, head()
, first()
that return top and last n rows as a list of Rows (Array[Row] for Scala). Spark Actions get the result to Spark Driver, hence you have to be very careful when you are extracting large datasets. Collecting large datasets that are larger than Spark Driver memory returns OutOfMemoryError
and the job fails.
Related: How to Get First Row of Each Group in Spark
1. Show Top N Rows in Spark/PySpark
Following are actions that Get’s top/first n rows from DataFrame, except show(), most of all actions returns list of class Row for PySpark and Array[Row] for Spark with Scala. If you are using PySpark, you usually get the First N records and Convert the PySpark DataFrame to Pandas
Action | Description | Return |
---|---|---|
show() | Print/Show top 20 rows in a tabular form | PySpark – no return Scala – Unit |
show(n) | Print/Show top N rows in a tabular form | PySpark – no return Scala – Unit |
take(n) df.takeAsList(3) (Only for Scala) | Returns top N row. | PySpark – Return list of Row Scala – Return Array[Row]. |
first() | Returns the first row | PySpark – Return list of Row Scala – Return Array[Row] Scala – takeAsList returns List[Row] |
head() | Returns the first row | PySpark – Return list of Row Scala – Return Array[Row] |
head(n) | Returns top N rows Similar to take(n) | PySpark – Return list of Row Scala – Return Array[Row] |
collect() | Returns all dataset | PySpark – Return All as a list of Row Scala – Return Array[Row] |
Note: take(), first() and head() actions internally calls limit() transformation and finally calls collect() action to collect the data.
2. Show Last N Rows in Spark/PySpark
Use tail() action to get the Last N rows from a DataFrame, this returns a list of class Row for PySpark and Array[Row] for Spark with Scala. Remember tail() also moves the selected number of rows to Spark Driver hence limit your data that could fit in Spark Driver’s memory.
Action | Description | Return |
---|---|---|
tail(n) | Returns Last N rows | PySpark – Return list of class Row Scala – Return Array[Row] |
3. Return Top N Rows After Transformation
In PySpark, limit() is a DataFrame transformation that returns a DataFrame with top N rows, for Spark with Scala/Java it returns a Dataset.
Transformation | Description | Return |
---|---|---|
limit(n) | Returns Top N rows | PySpark – Returns a new DataFrame Scala -Returns a new Dataset |
4. PySpark Example
Below is a PySpark example demonstrating all actions explained above.
# PySpark Example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = [("James",34),("Ann",34),
("Michael",33),("Scott",53),
("Robert",37),("Chad",27)
]
columns = ["firstname","age",]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.show()
# Output to console
+---------+---+
|firstname|age|
+---------+---+
| James| 34|
| Ann| 34|
| Michael| 33|
| Scott | 53|
+---------+---+
print(df.take(2))
# [Row(firstname='James', age=34), Row(firstname='Ann', age=34)]
print(df.tail(2))
# [Row(firstname='Robert', age=37), Row(firstname='Chad', age=27)]
print(df.first())
print(df.head())
# [Row(firstname='James', age=34)]
print(df.collect())
[Row(firstname='James', age=34), Row(firstname='Ann', age=34), Row(firstname='Michael', age=33), Row(firstname='Scott', age=53), Row(firstname='Robert', age=37), Row(firstname='Chad', age=27)]
df.limit(3).show()
# Output:
+---------+---+
|firstname|age|
+---------+---+
| James| 34|
| Ann| 34|
| Michael| 33|
+---------+---+
5. Get Top First N Rows to Pandas DataFrame
While working with Python and Spark, we often required to convert Pandas to PySpark DataFrame and vice-versa, you can limit the top n number of records when you convert back to pandas, below code snippet provides an example of getting first 3 records from DataFrame and converted to pandas..
# Convert to Pandas DataFrame
pandasDF=df.limit(3).toPandas()
print(pandasDF)
# Output Pandas DataFrame
firstname age
0 James 34
1 Ann 34
2 Michael 33
6. Conclusion
In this article, you have learned show() is used to get the top first n records from DataFrame, take() and head() are used to get the top first n records, tail() is used to get the last N records as a list of Row (Array[Row] for scala). and also learned limit() is a transformation that is used to get the top N rows as a DataFrame/Dataset.
Related Articles
- PySpark foreach() Usage with Examples
- PySpark apply Function to Column
- PySpark max() – Different Methods Explained
- PySpark sum() Columns Example
- PySpark unionByName()
- PySpark between() Example
- PySpark show() – Display DataFrame Contents in Table
- Spark show() – Display DataFrame Contents in Table
- Spark Dataframe – Show Full Column Contents?
Reference
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html
This was helpful! Thanks for writing it out so plainly
Thanks Paul for your comment.