You are currently viewing Spark split() function to convert string to Array column

Spark SQL provides split() function to convert delimiter separated String to array (StringType to ArrayType) column on Dataframe. This can be done by splitting a string column based on a delimiter like space, comma, pipe e.t.c, and converting into ArrayType.

In this article, I will explain split() function syntax and usage using a scala example. Though I’ve used here with a scala example, you can use the same approach with PySpark (Spark with Python).

Split() function syntax

Spark SQL split() is grouped under Array Functions in Spark SQL Functions class with the below syntax.


split(str : org.apache.spark.sql.Column, pattern : scala.Predef.String) : org.apache.spark.sql.Column

The split() function takes the first argument as the DataFrame column of type String and the second argument string delimiter that you want to split on. You can also use the pattern as a delimiter. This function returns org.apache.spark.sql.Column of type Array.

Before we start with an usage, first, let’s create a Dataframe with a string column with text separated with comma delimiter


val spark = SparkSession.builder()
    .appName("SparkByExamples.com")
    .master("local")
    .getOrCreate()

val data = Seq(("James, A, Smith","2018","M",3000),
    ("Michael, Rose, Jones","2010","M",4000),
    ("Robert,K,Williams","2010","M",4000),
    ("Maria,Anne,Jones","2005","F",4000),
    ("Jen,Mary,Brown","2010","",-1)
  )

import spark.sqlContext.implicits._
val df = data.toDF("name","dob_year","gender","salary")
df.printSchema()
df.show(false)

This yields the below output. As you notice we have a name column with takens firstname, middle and lastname with comma separated.


root
 |-- name: string (nullable = true)
 |-- dob_year: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = false)

+--------------------+--------+------+------+
|name                |dob_year|gender|salary|
+--------------------+--------+------+------+
|James, A, Smith     |2018    |M     |3000  |
|Michael, Rose, Jones|2010    |M     |4000  |
|Robert,K,Williams   |2010    |M     |4000  |
|Maria,Anne,Jones    |2005    |F     |4000  |
|Jen,Mary,Brown      |2010    |      |-1    |
+--------------------+--------+------+------+

Split() function usage

The Spark SQL Split() function is used to convert the delimiter separated string to an array (ArrayType) column. Below example snippet splits the name on comma delimiter and converts it to an array.


val df2 = df.select(split(col("name"),",").as("NameArray"))
    .drop("name")

df2.printSchema()
df2.show(false)

This yields below output. As you see below schema NameArray is a array type.


root
 |-- NameArray: array (nullable = true)
 |    |-- element: string (containsNull = true)

+------------------------+
|NameArray               |
+------------------------+
|[James,  A,  Smith]     |
|[Michael,  Rose,  Jones]|
|[Robert, K, Williams]   |
|[Maria, Anne, Jones]    |
|[Jen, Mary, Brown]      |
+------------------------+

Using split on SQL expression

Since Spark provides a way to execute the raw SQL, let’s learn how to write the same slice() example using Spark SQL expression. In order to use raw SQL, first, you need to create a table using createOrReplaceTempView(). This creates a temporary view from the Dataframe and this view is available lifetime of current Spark context.


df.createOrReplaceTempView("PERSON")
spark.sql("select SPLIT(name,',') as NameArray from PERSON")
    .show(false)

This yields the same output as above example.

Complete Example

Below is the complete example of splitting an String type column based on a delimiter or patterns and converting into ArrayType column.

This example is also available at spark-scala-examples GitHub project for reference.


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, split}

object StringToArray extends App {
  val spark = SparkSession.builder()
    .appName("SparkByExamples.com")
    .master("local")
    .getOrCreate()

  val data = Seq(("James, A, Smith","2018","M",3000),
    ("Michael, Rose, Jones","2010","M",4000),
    ("Robert,K,Williams","2010","M",4000),
    ("Maria,Anne,Jones","2005","F",4000),
    ("Jen,Mary,Brown","2010","",-1)
  )

  import spark.sqlContext.implicits._
  val df = data.toDF("name","dob_year","gender","salary")
  df.printSchema()
  df.show(false)

  val df2 = df.select(split(col("name"),",").as("NameArray"))
    .drop("name")
  df2.printSchema()
  df2.show(false)

  df.createOrReplaceTempView("PERSON")
  spark.sql("select SPLIT(name,',') as NameArray from PERSON")
    .show(false)
}

Conclusion

In this article, you have learned how to split the string column into array column by splitting the string by delimiter and also learned how to use split function on Spark SQL expression.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has 2 Comments

  1. Allan

    A little late, but you can access the individual elements like any other array, e.g. to select the first name in the example:

    “`
    spark.sql(“select SPLIT(name,’,’)[0] as FirstName from PERSON”)
    “`

  2. Prajyot Naik

    Hi, Can you help me access the individual elements of the array?

Comments are closed.