Print the contents of RDD in Spark & PySpark

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:Apache Spark / PySpark
Post last modified:May 6, 2024
Reading time:5 mins read

You are currently viewing Print the contents of RDD in Spark & PySpark

In Spark or PySpark, we can print or show the contents of an RDD by following the below steps

1. Default print() Doesn’t Show

When you try to print an RDD variable using a print() statement in Scala or Python (PySpark), it displays something like the below but not the actual elements of RDD.

Scala:


print(rdd)
// Outputs:
// ParallelCollectionRDD[0] at parallelize at RDDPrint.scala:13 //RDD

Python:


print(rdd)
# Outputs
# ParallelCollectionRDD[192] at readRDDFromFile at PythonRDD.scala:262 # RDD

2. Printing Contents From RDD

In order to retrieve and print/show the values of an RDD, first, you need to collect() the data to the driver and loop through the result and print the contents of each element of RDD to the console.

2.1 Show Contents From Spark (Scala)


// Show Contents From Spark (Scala)
  val dept = List(("Finance",10),("Marketing",20),
      ("Sales",30), ("IT",40))
  val rdd=spark.sparkContext.parallelize(dept)
  val dataColl=rdd.collect()
  dataColl.foreach(println)

This displays the contents of an RDD as a tuple to the console.


(Finance,10)
(Marketing,20)
(Sales,30)
(IT,40)

If you wanted to retrieve the individual elements do the following.


dataColl.foreach(f=>println(f._1 +","+f._2))
val dataCollLis=rdd.collectAsMap()
dataCollLis.foreach(f=>println(f._1 +","+f._2))

This yields the below output.


// Output:
Finance,10
Marketing,20
Sales,30
IT,40

2.2 Show Contents From PySpark (Python)

The below example demonstrates how to print/display/show the PySpark RDD contents to the console.


# Show Contents From PySpark (Python)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
dept = [("Finance",10), \
    ("Marketing",20), \
    ("Sales",30), \
    ("IT",40) \
  ]
rdd=spark.sparkContext.parallelize(dept)
dataColl=rdd.collect()
for row in dataColl:
    print(row[0] + "," +str(row[1]))

This yields the same output as above.

Happy Learning !!