Collect() – Retrieve data from Spark RDD/DataFrame

Spark collect() and collectAsList() are action operation that is used to retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() e.t.c. Retrieving on larger dataset results in out of memory. In this…

Continue Reading Collect() – Retrieve data from Spark RDD/DataFrame

Spark – Convert array of String to a String column

In this Spark article, I will explain how to convert an array of String column on DataFrame to a String column (separated or concatenated with a comma, space, or any delimiter character) using Spark function concat_ws() (translates to concat with separator), map() transformation and with SQL expression using Scala example.…

Continue Reading Spark – Convert array of String to a String column