Spark – Split DataFrame single column into multiple columns

  • Post author:
  • Post category:Apache Spark

Using Spark SQL split() function we can split a DataFrame column from a single string column to multiple columns, In this article, I will explain the syntax of the Split function and its usage in different ways by using Scala example.

Syntax


split(str : Column, pattern : String) : Column

As you see above, the split() function takes an existing column of the DataFrame as a first argument and a pattern you wanted to split upon as the second argument (this usually is a delimiter) and this function returns an array of Column type.

Before we start with an example of Spark split function, first let’s create a DataFrame and will use one of the column from this DataFrame to split into multiple columns


  val data = Seq(("James, A, Smith","2018","M",3000),
    ("Michael, Rose, Jones","2010","M",4000),
    ("Robert,K,Williams","2010","M",4000),
    ("Maria,Anne,Jones","2005","F",4000),
    ("Jen,Mary,Brown","2010","",-1)
  )
  
  import spark.sqlContext.implicits._
  val df = data.toDF("name","dob_year","gender","salary")
  df.printSchema()
  df.show(false)

This outputs the below


root
 |-- name: string (nullable = true)
 |-- dob_year: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = false)


+--------------------+--------+------+------+
|name                |dob_year|gender|salary|
+--------------------+--------+------+------+
|James, A, Smith     |2018    |M     |3000  |
|Michael, Rose, Jones|2010    |M     |4000  |
|Robert,K,Williams   |2010    |M     |4000  |
|Maria,Anne,Jones    |2005    |F     |4000  |
|Jen,Mary,Brown      |2010    |      |-1    |
+--------------------+--------+------+------+

Split DataFrame column to multiple columns

From the above DataFrame, column name of type String is a combined field of the first name, middle & lastname separated by comma delimiter. On the below example, we will split this column into Firstname, MiddleName and LastName columns.


  val df2 = df.select(split(col("name"),",").getItem(0).as("FirstName"),
    split(col("name"),",").getItem(1).as("MiddleName"),
    split(col("name"),",").getItem(2).as("LastName"))
    .drop("name")

  df2.printSchema()
  df2.show(false)

Since the split function returns an ArrayType , we use getItem(idx) to get the indexed value. This yields below output


root
 |-- FirstName: string (nullable = true)
 |-- MiddleName: string (nullable = true)
 |-- LastName: string (nullable = true)

+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
|James    | A        | Smith  |
|Michael  | Rose     | Jones  |
|Robert   |K         |Williams|
|Maria    |Anne      |Jones   |
|Jen      |Mary      |Brown   |
+---------+----------+--------+

Splitting column using withColumn

Let’s see how to split a column using DataFrame withColumn() , Using this function operation we can add a new column to the existing Dataframe.


  val splitDF = df.withColumn("FirstName",split(col("name"),",").getItem(0))
    .withColumn("MiddleName",split(col("name"),",").getItem(1))
    .withColumn("LastName",split(col("name"),",").getItem(2))
    .withColumn("NameArray",split(col("name"),","))
    .drop("name")
  splitDF.printSchema()
  splitDF.show(false)

In this example, I’ve also created another column NameArray with the actual return of the split() function.


root
 |-- dob_year: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = false)
 |-- FirstName: string (nullable = true)
 |-- MiddleName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- NameArray: array (nullable = true)
 |    |-- element: string (containsNull = true)


+--------+------+------+---------+----------+--------+------------------------+
|dob_year|gender|salary|FirstName|MiddleName|LastName|NameArray               |
+--------+------+------+---------+----------+--------+------------------------+
|2018    |M     |3000  |James    | A        | Smith  |[James,  A,  Smith]     |
|2010    |M     |4000  |Michael  | Rose     | Jones  |[Michael,  Rose,  Jones]|
|2010    |M     |4000  |Robert   |K         |Williams|[Robert, K, Williams]   |
|2005    |F     |4000  |Maria    |Anne      |Jones   |[Maria, Anne, Jones]    |
|2010    |      |-1    |Jen      |Mary      |Brown   |[Jen, Mary, Brown]      |
+--------+------+------+---------+----------+--------+------------------------+

Split DataFrame column using raw Spark SQL

Here, let’s use the same example using Spark SQL syntax. As you might already aware in order to use RAW SQL, first you need to create a view from the DataFrame.


  df.createOrReplaceTempView("PERSON")
  spark.sql("select SPLIT(name,',') as NameArray from PERSON")
    .show(false)

This yields below output


+------------------------+
|NameArray               |
+------------------------+
|[James,  A,  Smith]     |
|[Michael,  Rose,  Jones]|
|[Robert, K, Williams]   |
|[Maria, Anne, Jones]    |
|[Jen, Mary, Brown]      |
+------------------------+

Complete source code for reference


package com.sparkbyexamples.spark.dataframe.functions.string

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object SplitExample extends App{

  val spark = SparkSession.builder()
    .appName("SparkByExamples.com")
    .master("local")
    .getOrCreate()

  val data = Seq(("James, A, Smith","2018","M",3000),
    ("Michael, Rose, Jones","2010","M",4000),
    ("Robert,K,Williams","2010","M",4000),
    ("Maria,Anne,Jones","2005","F",4000),
    ("Jen,Mary,Brown","2010","",-1)
  )

  import spark.sqlContext.implicits._
  val df = data.toDF("name","dob_year","gender","salary")
  df.printSchema()
  df.show(false)

  val df2 = df.select(split(col("name"),",").getItem(0).as("FirstName"),
    split(col("name"),",").getItem(1).as("MiddleName"),
    split(col("name"),",").getItem(2).as("LastName"))
    .drop("name")

  df2.printSchema()
  df2.show(false)


  val splitDF = df.withColumn("FirstName",split(col("name"),",").getItem(0))
    .withColumn("MiddleName",split(col("name"),",").getItem(1))
    .withColumn("LastName",split(col("name"),",").getItem(2))
    .withColumn("NameArray",split(col("name"),","))
    .drop("name")
  splitDF.printSchema()
  splitDF.show(false)

  df.createOrReplaceTempView("PERSON")
  spark.sql("select SPLIT(name,',') as NameArray from PERSON")
    .show(false)
}

This complete example is also available at GitHub project

Conclusion

In this article, you have learned using Spark SQL split() function to split one string column to multiple DataFrame columns using select, withColumn() and finally using raw Spark SQL

Happy Learning !!

NNK

SparkByExamples.com is a BigData and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment using Scala and Maven.

This Post Has 8 Comments

  1. 杨勇

    “Spark – Split DataFrame single column into multiple columns”; What if I have a thousand rows, obviously this is not a universal solution

    1. NNK

      I agree this is not a universal solution. Here I’ve explained the concept how to do it and it is easily extendable for many columns.

  2. John J

    I notice that `split(col(“name”),”,”)` keeps getting used. Is there a way to leverage the result from its first use on the producation all the other derived columns?

    1. NNK

      Sorry for the late reply. Yes, you can dot it. First you need to create a dataframe with split() and then use the newly created dataframe to extract the index you want.

  3. singam

    Thank you NNK. It worked.

  4. singam

    Hello there, nice content and explanation, i’ll always refer your blog to learn new things in spark.
    I have a doubt, could you please clarify.

    i have a column called Contact Reason, now i have to split this column to multiple columns(till 5), the 5th split table column should consider all values

    Example:

    input string is :
    10004 / 10035 / 10146 / 20184 / 23344 / 12445 / 134555

    df.withColumn(“Contact_Reason_L1”, ltrim(split(col(“Contact Reason,”), “/”).getItem(0)))
    .withColumn(“Contact_Reason_L2”, ltrim(split(col(“Contact Reason,”), “/”).getItem(1)))
    .withColumn(“Contact_Reason_L3”, ltrim(split(col(“Contact Reason,”), “/”).getItem(2)))
    .withColumn(“Contact_Reason_L4”, ltrim(split(col(“Contact Reason,”), “/”).getItem(3)))
    .withColumn(“Contact_Reason_L5”, ltrim(split(col(“Contact Reason,”), “/”).getItem(4)))

    Expected Output

    Contact_Reason_L1,Contact_Reason_L2 ,Contact_Reason_L3,Contact_Reason_L4,Contact_Reason_L5

    10004, 10035 , 10146 , 20184 , 23344 / 12445 / 134555

    is this possible??

    1. NNK

      Hi Singam, Thanks for reading converting the single column to multiple columns in Spark and for your wonderful words. Glad examples provided are helping you.

      Yes, It is possible and you can get your expected output using UDF or try the following on your last withColumn
      array_join(slice(split(col(“Contact Reason,”),”/”),4,6),”/”)

      Please comment on whether it works or not.

      1. Anonymous

        Thank you NNK, it worked.

Leave a Reply