Show First Top N Rows in Spark | PySpark

In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take(), tail(), collect(), head(), first() that return top and last n rows as a list of Rows (Array[Row] for Scala). Spark Actions get the result to Spark Driver, hence you have to be very careful when you are extracting large datasets. Collecting large datasets that are larger than Spark Driver memory returns OutOfMemoryError and the job fails.

Related: How to Get First Row of Each Group in Spark

1. Show Top N Rows in Spark/PySpark

Following are actions that Get’s top/first n rows from DataFrame, except show(), most of all actions returns list of class Row for PySpark and Array[Row] for Spark with Scala. If you are using PySpark, you usually get the First N records and Convert the PySpark DataFrame to Pandas

Action DescriptionReturn
show()Print/Show top 20 rows
in a tabular form
PySpark – no return
Scala – Unit
show(n)Print/Show top N rows
in a tabular form
PySpark – no return
Scala – Unit
take(n)
df.takeAsList(3) (Only for Scala)
Returns top N row.PySpark – Return list of Row
Scala – Return Array[Row].
first()Returns the first rowPySpark – Return list of Row
Scala – Return Array[Row]
Scala – takeAsList returns List[Row]
head()Returns the first rowPySpark – Return list of Row
Scala – Return Array[Row]
head(n)Returns top N rows
Similar to take(n)
PySpark – Return list of Row
Scala – Return Array[Row]
collect()Returns all datasetPySpark – Return All as a list of Row
Scala – Return Array[Row]

Note: take(), first() and head() actions internally calls limit() transformation and finally calls collect() action to collect the data.

2. Show Last N Rows in Spark/PySpark

Use tail() action to get the Last N rows from a DataFrame, this returns a list of class Row for PySpark and Array[Row] for Spark with Scala. Remember tail() also moves the selected number of rows to Spark Driver hence limit your data that could fit in Spark Driver’s memory.

ActionDescriptionReturn
tail(n)Returns Last N rowsPySpark – Return list of class Row
Scala – Return Array[Row]

3. Return Top N Rows After Transformation

In PySpark, limit() is a DataFrame transformation that returns a DataFrame with top N rows, for Spark with Scala/Java it returns a Dataset.

TransformationDescriptionReturn
limit(n)Returns Top N rowsPySpark – Returns a new DataFrame
Scala -Returns a new Dataset

4. PySpark Example

Below is a PySpark example demonstrating all actions explained above.


# PySpark Example 
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = [("James",34),("Ann",34),
    ("Michael",33),("Scott",53),
    ("Robert",37),("Chad",27)
  ]

columns = ["firstname","age",]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.show()
# Output to console
    +---------+---+
   |firstname|age|
    +---------+---+
   |    James| 34|
   |      Ann| 34|
   |  Michael| 33|
   |   Scott | 53|
   +---------+---+

print(df.take(2))
# [Row(firstname='James', age=34), Row(firstname='Ann', age=34)]

print(df.tail(2))
# [Row(firstname='Robert', age=37), Row(firstname='Chad', age=27)]

print(df.first())
print(df.head())
# [Row(firstname='James', age=34)]

print(df.collect())
 [Row(firstname='James', age=34), Row(firstname='Ann', age=34), Row(firstname='Michael', age=33), Row(firstname='Scott', age=53), Row(firstname='Robert', age=37), Row(firstname='Chad', age=27)]

df.limit(3).show()
# Output:
 +---------+---+
 |firstname|age|
 +---------+---+
 |    James| 34|
 |      Ann| 34|
 |  Michael| 33|
 +---------+---+

5. Get Top First N Rows to Pandas DataFrame

While working with Python and Spark, we often required to convert Pandas to PySpark DataFrame and vice-versa, you can limit the top n number of records when you convert back to pandas, below code snippet provides an example of getting first 3 records from DataFrame and converted to pandas..


# Convert to Pandas DataFrame
pandasDF=df.limit(3).toPandas()
print(pandasDF)
# Output Pandas DataFrame
  firstname  age
 0     James   34
 1       Ann   34
 2   Michael   33

6. Conclusion

In this article, you have learned show() is used to get the top first n records from DataFrame, take() and head() are used to get the top first n records, tail() is used to get the last N records as a list of Row (Array[Row] for scala). and also learned limit() is a transformation that is used to get the top N rows as a DataFrame/Dataset.

Reference

https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

This Post Has 2 Comments

  1. Paul Johnson

    This was helpful! Thanks for writing it out so plainly

    1. NNK

      Thanks Paul for your comment.

You are currently viewing Show First Top N Rows in Spark | PySpark