You are currently viewing Print the contents of RDD in Spark & PySpark

In Spark or PySpark, we can print or show the contents of an RDD by following the below steps

Advertisements
  1. First Apply the transformations on RDD
  2. Make sure your RDD is small enough to store in Spark driver’s memory.
  3. use collect() method to retrieve the data from RDD. This returns an Array type in Scala.
  4. Finally, Iterate the result of the collect() and print /show it on the console.

Usually, collect() is used to retrieve the action output when you have a very small result set, and calling collect() on an RDD with a bigger result set causes out of memory as it returns the entire dataset (from all workers) to the driver hence we should avoid calling collect() on a larger dataset.

In this article, I will explain how to print the contents of a Spark RDD to a console with an example in Scala and PySpark (Spark with Python).

1. Default print() Doesn’t Show

When you try to print an RDD variable using a print() statement in Scala or Python (PySpark), it displays something like the below but not the actual elements of RDD.

Scala:


print(rdd)
// Outputs:
// ParallelCollectionRDD[0] at parallelize at RDDPrint.scala:13 //RDD

Python:


print(rdd)
# Outputs
# ParallelCollectionRDD[192] at readRDDFromFile at PythonRDD.scala:262 # RDD

2. Printing Contents From RDD

In order to retrieve and print/show the values of an RDD, first, you need to collect() the data to the driver and loop through the result and print the contents of each element of RDD to the console.

2.1 Show Contents From Spark (Scala)


// Show Contents From Spark (Scala)
  val dept = List(("Finance",10),("Marketing",20),
      ("Sales",30), ("IT",40))
  val rdd=spark.sparkContext.parallelize(dept)
  val dataColl=rdd.collect()
  dataColl.foreach(println)

This displays the contents of an RDD as a tuple to the console.


(Finance,10)
(Marketing,20)
(Sales,30)
(IT,40)

If you wanted to retrieve the individual elements do the following.


dataColl.foreach(f=>println(f._1 +","+f._2))
val dataCollLis=rdd.collectAsMap()
dataCollLis.foreach(f=>println(f._1 +","+f._2))

This yields the below output.


// Output:
Finance,10
Marketing,20
Sales,30
IT,40

2.2 Show Contents From PySpark (Python)

The below example demonstrates how to print/display/show the PySpark RDD contents to the console.


# Show Contents From PySpark (Python)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
dept = [("Finance",10), \
    ("Marketing",20), \
    ("Sales",30), \
    ("IT",40) \
  ]
rdd=spark.sparkContext.parallelize(dept)
dataColl=rdd.collect()
for row in dataColl:
    print(row[0] + "," +str(row[1]))

This yields the same output as above.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium