In Spark or PySpark, we can print or show the contents of an RDD by following the below steps
- First Apply the transformations on RDD
- Make sure your RDD is small enough to store in Spark driver’s memory.
- use collect() method to retrieve the data from RDD. This returns an Array type in Scala.
- Finally, Iterate the result of the collect() and print /show it on the console.
Usually, collect() is used to retrieve the action output when you have a very small result set, and calling collect()
on an RDD with a bigger result set causes out of memory as it returns the entire dataset (from all workers) to the driver hence we should avoid calling collect() on a larger dataset.
In this article, I will explain how to print the contents of a Spark RDD to a console with an example in Scala and PySpark (Spark with Python).
1. Default print() Doesn’t Show
When you try to print an RDD variable using a print() statement in Scala or Python (PySpark), it displays something like the below but not the actual elements of RDD.
Scala:
print(rdd)
// Outputs:
// ParallelCollectionRDD[0] at parallelize at RDDPrint.scala:13 //RDD
Python:
print(rdd)
# Outputs
# ParallelCollectionRDD[192] at readRDDFromFile at PythonRDD.scala:262 # RDD
2. Printing Contents From RDD
In order to retrieve and print/show the values of an RDD, first, you need to collect()
the data to the driver and loop through the result and print the contents of each element of RDD to the console.
2.1 Show Contents From Spark (Scala)
// Show Contents From Spark (Scala)
val dept = List(("Finance",10),("Marketing",20),
("Sales",30), ("IT",40))
val rdd=spark.sparkContext.parallelize(dept)
val dataColl=rdd.collect()
dataColl.foreach(println)
This displays the contents of an RDD as a tuple to the console.
(Finance,10)
(Marketing,20)
(Sales,30)
(IT,40)
If you wanted to retrieve the individual elements do the following.
dataColl.foreach(f=>println(f._1 +","+f._2))
val dataCollLis=rdd.collectAsMap()
dataCollLis.foreach(f=>println(f._1 +","+f._2))
This yields the below output.
// Output:
Finance,10
Marketing,20
Sales,30
IT,40
2.2 Show Contents From PySpark (Python)
The below example demonstrates how to print/display/show the PySpark RDD contents to the console.
# Show Contents From PySpark (Python)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
dept = [("Finance",10), \
("Marketing",20), \
("Sales",30), \
("IT",40) \
]
rdd=spark.sparkContext.parallelize(dept)
dataColl=rdd.collect()
for row in dataColl:
print(row[0] + "," +str(row[1]))
This yields the same output as above.
Happy Learning !!