You are currently viewing Show First Top N Rows in Spark | PySpark

In Spark or PySpark, you can use show(n) to get the top or first N (5,10,100 ..) rows of the DataFrame and display them to a console or a log file. And use Spark actions like take(), head(), and first() to get the first n rows as a list (Array[Row] for Scala).

Advertisements

Spark actions get the result to Spark Driver. Hence, you have to be very careful when you are extracting large datasets. Collecting large datasets that are larger than Spark Driver memory returns OutOfMemoryError and the job fails.

Related: How to Get First Row of Each Group in Spark

1. Show Top N Rows in Spark/PySpark

The Following actions get the top/first n rows from Spark DataFrame. Except show(), most of all actions return a list of Row for PySpark and Array[Row] for Spark with Scala.

Action DescriptionReturn
show()Print/Show top 20 rows
in a tabular form
PySpark – no return
Scala – Unit
show(n)Print/Show top N rows
in a tabular form
PySpark – no return
Scala – Unit
take(n)
df.takeAsList(3) (Only for Scala)
Returns top N row.PySpark – Return list of Row
Scala – Return Array[Row].
first()Returns the first rowPySpark – Return list of Row
Scala – Return Array[Row]
Scala – takeAsList returns List[Row]
head()Returns the first rowPySpark – Return list of Row
Scala – Return Array[Row]
head(n)Returns top N rows
Similar to take(n)
PySpark – Return list of Row
Scala – Return Array[Row]
collect()Returns all datasetPySpark – Return All as a list of Row
Scala – Return Array[Row]
Spark Show Top N Rows Examples

Note: take(), first() and head() actions internally calls limit() transformation and finally calls collect() action to collect the data.

Let’s create a sample DataFrame, and I will use this to explain the above function to get the first or top N rows from Spark/PySpark DataFrame.


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

# Create SparkDataFrame
simpleData = [("James",34),("Ann",34),
    ("Michael",33),("Scott",53),
    ("Robert",37),("Chad",27)
  ]
columns = ["firstname","age",]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.show()

# Output to console
# +---------+---+
# |firstname|age|
# +---------+---+
# |    James| 34|
# |      Ann| 34|
# |  Michael| 33|
# |   Scott | 53|
# +---------+---+

2. Show Top N Rows from DataFrame Examples

2.1 Using show()

Spark DataFrame show() method shows the first 20 Rows, and the column values are truncated at 20 characters. This method is used to display the contents of the DataFrame in a Table Row & Column Format. You can pass a numeric argument to this method to get the top N rows.


# Get top 2 rows to console
df.show(2)

# Output:
#+---------+---+
#|firstname|age|
#+---------+---+
#|    James| 34|
#|      Ann| 34|
#+---------+---+
#only showing top 2 rows

2.2 Using take()

The take() action is used to retrieve a specified top number of elements from a Spark DataFrame or RDD (Resilient Distributed Dataset) as a list of class Row. It returns a list containing the elements, allowing users to inspect a small subset of data without retrieving the entire dataset.


# Get first 2 rows using take()
print(df.take(2))

# Output:
# [Row(firstname='James', age=34), Row(firstname='Ann', age=34)]

2.3 Using first()

The frist() action returns the top first row of the DataFrame as a Row type.


# Gets only first row
print(df.first())

# Output:
# Row(firstname='James', age=34)

2.4 Using head()

The Spark head() method returns the top first N rows from the DataFrame. By default, it returns 1 row. By passing N argument, it returns several rows. If n is greater than 1, return a list of class Row type. If n is 1, return a single Row.


# Returns 1 row
print(df.head())
#Row(firstname='James', age=34)

# Returns first N rows
print(df.head(2))
#[Row(firstname='James', age=34), Row(firstname='Ann', age=34)]

3. Get Top N Rows using limit()

In PySpark, limit() is a DataFrame transformation that returns a Spark DataFrame with top N rows. For Spark with Scala/Java it returns a Dataset.

TransformationDescriptionReturn
limit(n)Returns Top N rowsPySpark – Returns a new DataFrame
Scala -Returns a new Dataset

Example:


# Get DataFrame with 2 rows
df.limit(2).show()

# Output:
# +---------+---+
# |firstname|age|
# +---------+---+
# |    James| 34|
# |      Ann| 34|
# +---------+---+

4. Get Top First N Rows to Pandas DataFrame

While working with Python and Spark, we often required to convert Pandas to PySpark DataFrame and vice-versa, you can limit the top n number of records when you convert back to pandas, below code snippet provides an example of getting first 3 records from DataFrame and converted to pandas..


# Convert to Pandas DataFrame
pandasDF=df.limit(3).toPandas()
print(pandasDF)

# Output Pandas DataFrame
#  firstname  age
# 0     James   34
# 1       Ann   34
# 2   Michael   33

5. Conclusion

In this article, you have learned show() is used to get the top first n records from Spark or PySpark DataFrame, take() and head() are used to get the top first n records as a list of Row (Array[Row] for scala). and also learned limit() is a transformation that is used to get the top N rows as a DataFrame/Dataset.

Reference

https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html

This Post Has 2 Comments

  1. NNK

    Thanks Paul for your comment.

  2. Paul Johnson

    This was helpful! Thanks for writing it out so plainly

Comments are closed.