Print the contents of RDD in Spark & PySpark

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:Apache Spark / PySpark
Post last modified:May 6, 2024
Reading time:5 mins read

You are currently viewing Print the contents of RDD in Spark & PySpark

In Spark or PySpark, we can print or show the contents of an RDD by following the below steps

1. Default print() Doesn’t Show

When you try to print an RDD variable using a print() statement in Scala or Python (PySpark), it displays something like the below but not the actual elements of RDD.

Scala:


print(rdd)
// Outputs:
// ParallelCollectionRDD[0] at parallelize at RDDPrint.scala:13 //RDD

Python:


print(rdd)
# Outputs
# ParallelCollectionRDD[192] at readRDDFromFile at PythonRDD.scala:262 # RDD

2. Printing Contents From RDD

In order to retrieve and print/show the values of an RDD, first, you need to collect() the data to the driver and loop through the result and print the contents of each element of RDD to the console.

2.1 Show Contents From Spark (Scala)


// Show Contents From Spark (Scala)
  val dept = List(("Finance",10),("Marketing",20),
      ("Sales",30), ("IT",40))
  val rdd=spark.sparkContext.parallelize(dept)
  val dataColl=rdd.collect()
  dataColl.foreach(println)

This displays the contents of an RDD as a tuple to the console.


(Finance,10)
(Marketing,20)
(Sales,30)
(IT,40)

If you wanted to retrieve the individual elements do the following.


dataColl.foreach(f=>println(f._1 +","+f._2))
val dataCollLis=rdd.collectAsMap()
dataCollLis.foreach(f=>println(f._1 +","+f._2))

This yields the below output.


// Output:
Finance,10
Marketing,20
Sales,30
IT,40

2.2 Show Contents From PySpark (Python)

The below example demonstrates how to print/display/show the PySpark RDD contents to the console.


# Show Contents From PySpark (Python)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
dept = [("Finance",10), \
    ("Marketing",20), \
    ("Sales",30), \
    ("IT",40) \
  ]
rdd=spark.sparkContext.parallelize(dept)
dataColl=rdd.collect()
for row in dataColl:
    print(row[0] + "," +str(row[1]))

This yields the same output as above.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium