Spark SQL provides split()
function to convert delimiter separated String to array (StringType to ArrayType) column on Dataframe. This can be done by splitting a string column based on a delimiter like space, comma, pipe e.t.c, and converting into ArrayType.
In this article, I will explain split() function syntax and usage using a scala example. Though I’ve used here with a scala example, you can use the same approach with PySpark (Spark with Python).
Split() function syntax
Spark SQL split()
is grouped under Array Functions in Spark SQL Functions class with the below syntax.
split(str : org.apache.spark.sql.Column, pattern : scala.Predef.String) : org.apache.spark.sql.Column
The split() function takes the first argument as the DataFrame column of type String and the second argument string delimiter that you want to split on. You can also use the pattern as a delimiter. This function returns org.apache.spark.sql.Column
of type Array.
Before we start with an usage, first, let’s create a Dataframe with a string column with text separated with comma delimiter
val spark = SparkSession.builder()
.appName("SparkByExamples.com")
.master("local")
.getOrCreate()
val data = Seq(("James, A, Smith","2018","M",3000),
("Michael, Rose, Jones","2010","M",4000),
("Robert,K,Williams","2010","M",4000),
("Maria,Anne,Jones","2005","F",4000),
("Jen,Mary,Brown","2010","",-1)
)
import spark.sqlContext.implicits._
val df = data.toDF("name","dob_year","gender","salary")
df.printSchema()
df.show(false)
This yields the below output. As you notice we have a name column with takens firstname, middle and lastname with comma separated.
root
|-- name: string (nullable = true)
|-- dob_year: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: integer (nullable = false)
+--------------------+--------+------+------+
|name |dob_year|gender|salary|
+--------------------+--------+------+------+
|James, A, Smith |2018 |M |3000 |
|Michael, Rose, Jones|2010 |M |4000 |
|Robert,K,Williams |2010 |M |4000 |
|Maria,Anne,Jones |2005 |F |4000 |
|Jen,Mary,Brown |2010 | |-1 |
+--------------------+--------+------+------+
Split() function usage
The Spark SQL Split() function is used to convert the delimiter separated string to an array (ArrayType) column. Below example snippet splits the name on comma delimiter and converts it to an array.
val df2 = df.select(split(col("name"),",").as("NameArray"))
.drop("name")
df2.printSchema()
df2.show(false)
This yields below output. As you see below schema NameArray is a array type.
root
|-- NameArray: array (nullable = true)
| |-- element: string (containsNull = true)
+------------------------+
|NameArray |
+------------------------+
|[James, A, Smith] |
|[Michael, Rose, Jones]|
|[Robert, K, Williams] |
|[Maria, Anne, Jones] |
|[Jen, Mary, Brown] |
+------------------------+
Using split on SQL expression
Since Spark provides a way to execute the raw SQL, let’s learn how to write the same slice() example using Spark SQL expression. In order to use raw SQL, first, you need to create a table using createOrReplaceTempView()
. This creates a temporary view from the Dataframe and this view is available lifetime of current Spark context.
df.createOrReplaceTempView("PERSON")
spark.sql("select SPLIT(name,',') as NameArray from PERSON")
.show(false)
This yields the same output as above example.
Complete Example
Below is the complete example of splitting an String type column based on a delimiter or patterns and converting into ArrayType column.
This example is also available at spark-scala-examples GitHub project for reference.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, split}
object StringToArray extends App {
val spark = SparkSession.builder()
.appName("SparkByExamples.com")
.master("local")
.getOrCreate()
val data = Seq(("James, A, Smith","2018","M",3000),
("Michael, Rose, Jones","2010","M",4000),
("Robert,K,Williams","2010","M",4000),
("Maria,Anne,Jones","2005","F",4000),
("Jen,Mary,Brown","2010","",-1)
)
import spark.sqlContext.implicits._
val df = data.toDF("name","dob_year","gender","salary")
df.printSchema()
df.show(false)
val df2 = df.select(split(col("name"),",").as("NameArray"))
.drop("name")
df2.printSchema()
df2.show(false)
df.createOrReplaceTempView("PERSON")
spark.sql("select SPLIT(name,',') as NameArray from PERSON")
.show(false)
}
Conclusion
In this article, you have learned how to split the string column into array column by splitting the string by delimiter and also learned how to use split function on Spark SQL expression.
Happy Learning !!
Hi, Can you help me access the individual elements of the array?
A little late, but you can access the individual elements like any other array, e.g. to select the first name in the example:
“`
spark.sql(“select SPLIT(name,’,’)[0] as FirstName from PERSON”)
“`