Using Spark SQL split()
function we can split a DataFrame column from a single string column to multiple columns, In this article, I will explain the syntax of the Split function and its usage in different ways by using Scala example.
Syntax
split(str : Column, pattern : String) : Column
As you see above, the split()
function takes an existing column of the DataFrame as a first argument and a pattern you wanted to split upon as the second argument (this usually is a delimiter) and this function returns an array of Column
type.
Before we start with an example of Spark split function, first let’s create a DataFrame and will use one of the column from this DataFrame to split into multiple columns
val data = Seq(("James, A, Smith","2018","M",3000),
("Michael, Rose, Jones","2010","M",4000),
("Robert,K,Williams","2010","M",4000),
("Maria,Anne,Jones","2005","F",4000),
("Jen,Mary,Brown","2010","",-1)
)
import spark.sqlContext.implicits._
val df = data.toDF("name","dob_year","gender","salary")
df.printSchema()
df.show(false)
This outputs the below
// Output:
root
|-- name: string (nullable = true)
|-- dob_year: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: integer (nullable = false)
+--------------------+--------+------+------+
|name |dob_year|gender|salary|
+--------------------+--------+------+------+
|James, A, Smith |2018 |M |3000 |
|Michael, Rose, Jones|2010 |M |4000 |
|Robert,K,Williams |2010 |M |4000 |
|Maria,Anne,Jones |2005 |F |4000 |
|Jen,Mary,Brown |2010 | |-1 |
+--------------------+--------+------+------+
1. Split DataFrame column to multiple columns
From the above DataFrame, column name
of type String is a combined field of the first name, middle & lastname separated by comma delimiter. On the below example, we will split this column into Firstname
, MiddleName
and LastName
columns.
// Split DataFrame column to multiple columns
val df2 = df.select(split(col("name"),",").getItem(0).as("FirstName"),
split(col("name"),",").getItem(1).as("MiddleName"),
split(col("name"),",").getItem(2).as("LastName"))
.drop("name")
df2.printSchema()
df2.show(false)
Since the split function returns an ArrayType , we use getItem(idx)
to get the indexed value. This yields below output
// Output:
root
|-- FirstName: string (nullable = true)
|-- MiddleName: string (nullable = true)
|-- LastName: string (nullable = true)
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
|James | A | Smith |
|Michael | Rose | Jones |
|Robert |K |Williams|
|Maria |Anne |Jones |
|Jen |Mary |Brown |
+---------+----------+--------+
2. Splitting column using withColumn
Let’s see how to split a column using DataFrame withColumn() , Using this function operation we can add a new column to the existing Dataframe.
// Splitting column using withColumn
val splitDF = df.withColumn("FirstName",split(col("name"),",").getItem(0))
.withColumn("MiddleName",split(col("name"),",").getItem(1))
.withColumn("LastName",split(col("name"),",").getItem(2))
.withColumn("NameArray",split(col("name"),","))
.drop("name")
splitDF.printSchema()
splitDF.show(false)
In this example, I’ve also created another column NameArray with the actual return of the split() function.
root
|-- dob_year: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: integer (nullable = false)
|-- FirstName: string (nullable = true)
|-- MiddleName: string (nullable = true)
|-- LastName: string (nullable = true)
|-- NameArray: array (nullable = true)
| |-- element: string (containsNull = true)
+--------+------+------+---------+----------+--------+------------------------+
|dob_year|gender|salary|FirstName|MiddleName|LastName|NameArray |
+--------+------+------+---------+----------+--------+------------------------+
|2018 |M |3000 |James | A | Smith |[James, A, Smith] |
|2010 |M |4000 |Michael | Rose | Jones |[Michael, Rose, Jones]|
|2010 |M |4000 |Robert |K |Williams|[Robert, K, Williams] |
|2005 |F |4000 |Maria |Anne |Jones |[Maria, Anne, Jones] |
|2010 | |-1 |Jen |Mary |Brown |[Jen, Mary, Brown] |
+--------+------+------+---------+----------+--------+------------------------+
3. Split DataFrame column using raw Spark SQL
Here, let’s use the same example using Spark SQL syntax. As you might already aware in order to use RAW SQL, first you need to create a view from the DataFrame.
df.createOrReplaceTempView("PERSON")
spark.sql("select SPLIT(name,',') as NameArray from PERSON")
.show(false)
This yields below output
// Output:
+------------------------+
|NameArray |
+------------------------+
|[James, A, Smith] |
|[Michael, Rose, Jones]|
|[Robert, K, Williams] |
|[Maria, Anne, Jones] |
|[Jen, Mary, Brown] |
+------------------------+
4. Complete source code for reference
// Complete source code for reference
package com.sparkbyexamples.spark.dataframe.functions.string
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SplitExample extends App{
val spark = SparkSession.builder()
.appName("SparkByExamples.com")
.master("local")
.getOrCreate()
val data = Seq(("James, A, Smith","2018","M",3000),
("Michael, Rose, Jones","2010","M",4000),
("Robert,K,Williams","2010","M",4000),
("Maria,Anne,Jones","2005","F",4000),
("Jen,Mary,Brown","2010","",-1)
)
import spark.sqlContext.implicits._
val df = data.toDF("name","dob_year","gender","salary")
df.printSchema()
df.show(false)
val df2 = df.select(split(col("name"),",").getItem(0).as("FirstName"),
split(col("name"),",").getItem(1).as("MiddleName"),
split(col("name"),",").getItem(2).as("LastName"))
.drop("name")
df2.printSchema()
df2.show(false)
val splitDF = df.withColumn("FirstName",split(col("name"),",").getItem(0))
.withColumn("MiddleName",split(col("name"),",").getItem(1))
.withColumn("LastName",split(col("name"),",").getItem(2))
.withColumn("NameArray",split(col("name"),","))
.drop("name")
splitDF.printSchema()
splitDF.show(false)
df.createOrReplaceTempView("PERSON")
spark.sql("select SPLIT(name,',') as NameArray from PERSON")
.show(false)
}
This complete example is also available at GitHub project
Conclusion
In this article, you have learned using Spark SQL split() function to split one string column to multiple DataFrame columns using select, withColumn() and finally using raw Spark SQL
Happy Learning !!