PySpark – Loop/Iterate Through Rows in DataFrame

PySpark provides map(), mapPartitions() to loop/iterate through rows in RDD/DataFrame to perform the complex transformations, and these two returns the same number of records as in the original DataFrame but the number of columns could be different (after add/update). PySpark also provides foreach() & foreachPartitions() actions to loop/iterate through each…

Continue Reading PySpark – Loop/Iterate Through Rows in DataFrame

Spark – Extract DataFrame Column as List

Let's see how to convert/extract the Spark DataFrame column as a List (Scala/Java Collection), there are multiple ways to convert this, I will explain most of them with examples. Remember that when you use DataFrame collect() you get Array[Row] not List[Stirng] hence you need to use a map() function to…

Continue Reading Spark – Extract DataFrame Column as List

Collect() – Retrieve data from Spark RDD/DataFrame

Spark collect() and collectAsList() are action operation that is used to retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() e.t.c. Retrieving on larger dataset results in out of memory. In this…

Continue Reading Collect() – Retrieve data from Spark RDD/DataFrame

PySpark Collect() – Retrieve data from DataFrame

PySpark RDD/DataFrame collect() is an action operation that is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group() e.t.c. Retrieving larger datasets results in OutOfMemory error. In this PySpark article, I…

Continue Reading PySpark Collect() – Retrieve data from DataFrame