You are currently viewing Spark – Convert array of String to a String column

In this Spark article, I will explain how to convert an array of String column on DataFrame to a String column (separated or concatenated with a comma, space, or any delimiter character) using Spark function concat_ws() (translates to concat with separator), map() transformation and with SQL expression using Scala example.

When curating data on DataFrame we may want to convert the Dataframe with complex struct datatypes, arrays and maps to a flat structure. here we will see how to convert array type to string type.

Before we start, first let’s create a DataFrame with array of string column.


  val arrayData = Seq(
    Row("James,,Smith",List("Java","Scala","C++"),"CA"),
    Row("Michael,Rose,",List("Spark","Java","C++"),"NJ"),
    Row("Robert,,Williams",List("CSharp","VB"),"NV")
  )

  val arraySchema = new StructType()
    .add("name",StringType)
    .add("languagesAtSchool", ArrayType(StringType))
    .add("currentState", StringType)

  val df = spark.createDataFrame(
    
  spark.sparkContext.parallelize(arrayData),arraySchema)
    df.printSchema()
    df.show()

In this example “languagesAtSchool” is a column of type array. In the next section, we will convert this to a String. This example yields below schema and DataFrame.


// Output:
root
 |-- name: string (nullable = true)
 |-- languagesAtSchool: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- currentState: string (nullable = true)

+----------------+------------------+------------+
|            name| languagesAtSchool|currentState|
+----------------+------------------+------------+
|    James,,Smith|[Java, Scala, C++]|          CA|
|   Michael,Rose,|[Spark, Java, C++]|          NJ|
|Robert,,Williams|      [CSharp, VB]|          NV|
+----------------+------------------+------------+

1. Convert an array of String to String column using concat_ws()

In order to convert array to a string, Spark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument.

Syntax


// Syntax
concat_ws(sep : scala.Predef.String, exprs : org.apache.spark.sql.Column*) 
      : org.apache.spark.sql.Column

Usage

In order to use concat_ws() function, you need to import it using org.apache.spark.sql.functions.concat_ws . Since this function takes the Column type as a second argument, you need to use col("languagesAtSchool").


val df2 = df.withColumn("languagesAtSchool",
   concat_ws(",",col("languagesAtSchool")))
df2.printSchema()
df2.show()

This yields below output


// Output:
+----------------+-----------------+------------+
|            name|languagesAtSchool|currentState|
+----------------+-----------------+------------+
|    James,,Smith|   Java,Scala,C++|          CA|
|   Michael,Rose,|   Spark,Java,C++|          NJ|
|Robert,,Williams|        CSharp,VB|          NV|
+----------------+-----------------+------------+

2. Convert an array of String to String column using map()

If you are using older version of Spark (< 2.1 ) or In order to get more control, you can use DataFrame map() transformation to do the same.

Syntax:


map[U](func : scala.Function1[T, U])(implicit evidence$6 : org.apache.spark.sql.Encoder[U]) : org.apache.spark.sql.Dataset[U]

Usage:


import spark.implicits._
val df3 = df.map(f=>{
    val name = f.getString(0)
    val lang = f.getList(1).toArray.mkString(",")
    (name,lang,f.getString(2))
})

//Above return column names _0, _1 & _3. 
//to change the column names use below snippet.
df3.toDF("Name","Languages","currentState")
     .show(false)

Using map() transformation you get more control to do custom transformations for example getting the first character of every array element and merging them into a string.

Note: When possible I recommend to use concat_ws() as Spark provides optimizations for built-in functions.

3. Using Spark SQL expression

You can also use concat_ws() function with SQL expression.


// Using Spark SQL expression
df.createOrReplaceTempView("ARRAY_STRING")
spark.sql("select name, concat_ws(',',languagesAtSchool) as languagesAtSchool," +
    " currentState from ARRAY_STRING")
    .show(false)

4. Complete Example

Below is a complete Spark DataFrame example of converting an array of String column to a String using a Scala example.


import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
import org.apache.spark.sql.functions.{col,concat_ws}

object ArrayOfString extends App{

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val arrayStructureData = Seq(
    Row("James,,Smith",List("Java","Scala","C++"),"CA"),
    Row("Michael,Rose,",List("Spark","Java","C++"),"NJ"),
    Row("Robert,,Williams",List("CSharp","VB"),"NV")
  )

  val arrayStructureSchema = new StructType()
    .add("name",StringType)
    .add("languagesAtSchool", ArrayType(StringType))
    .add("currentState", StringType)


  val df = spark.createDataFrame(
    spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
  df.printSchema()
  df.show()

  val df2 = df.withColumn("languagesAtSchool",
    concat_ws(",",col("languagesAtSchool")))
  df2.printSchema()
  df2.show()

  import spark.implicits._
  val df3 = df.map(f=>{
    val name = f.getString(0)
    val lang = f.getList(1).toArray.mkString(",")
    (name,lang,f.getString(2))
  })

  df3.toDF("Name","Languages","currentState")
     .show(false)

  df.createOrReplaceTempView("ARRAY_STRING")
    spark.sql("select name, concat_ws(',',languagesAtSchool) as 
     languagesAtSchool," +
    " currentState from ARRAY_STRING")
    .show(false)
}

This example is also available at the Spark Github example project for reference.

Hope it helps you !! Thanks for reading.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has One Comment

  1. keerthi

    How can we achieve [“a”, “b”, “c”] to “[“a”, “b”, “c”]”

Comments are closed.