PySpark Convert String to Array Column

  • Post author:
  • Post category:PySpark
  • Post last modified:February 7, 2023
  • Reading time:7 mins read

PySpark SQL provides split() function to convert delimiter separated String to an Array (StringType to ArrayType) column on DataFrame. This can be done by splitting a string column based on a delimiter like space, comma, pipe e.t.c, and converting it into ArrayType.

In this article, I will explain converting String to Array column using split() function on DataFrame and SQL query.

Split() function syntax

PySpark SQL split() is grouped under Array Functions in PySpark SQL Functions class with the below syntax.


pyspark.sql.functions.split(str, pattern, limit=-1)

The split() function takes the first argument as the DataFrame column of type String and the second argument string delimiter that you want to split on. You can also use the pattern as a delimiter. This function returns pyspark.sql.Column of type Array.

Before we start with usage, first, let’s create a DataFrame with a string column with text separated with comma delimiter


from pyspark.sql import SparkSession
spark = SparkSession.builder \
         .appName('SparkByExamples.com') \
         .getOrCreate()

data = [("James, A, Smith","2018","M",3000),
            ("Michael, Rose, Jones","2010","M",4000),
            ("Robert,K,Williams","2010","M",4000),
            ("Maria,Anne,Jones","2005","F",4000),
            ("Jen,Mary,Brown","2010","",-1)
            ]

columns=["name","dob_year","gender","salary"]
df=spark.createDataFrame(data,columns)
df.printSchema()

This yields the below output. As you notice we have a name column with takens firstname, middle and lastname with comma separated.


root
 |-- name: string (nullable = true)
 |-- dob_year: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = false)

PySpark Convert String to Array Column

Below PySpark example snippet splits the String column name on comma delimiter and convert it to an Array. If you do not need the original column, use drop() to remove the column.


from pyspark.sql.functions import split, col
df2 = df.select(split(col("name"),",").alias("NameArray")) \
    .drop("name")
df2.printSchema()
df2.show()

This yields below output. As you see below schema NameArray is a array type.


root
 |-- NameArray: array (nullable = true)
 |    |-- element: string (containsNull = true)

+------------------------+
|NameArray               |
+------------------------+
|[James,  A,  Smith]     |
|[Michael,  Rose,  Jones]|
|[Robert, K, Williams]   |
|[Maria, Anne, Jones]    |
|[Jen, Mary, Brown]      |
+------------------------+

Convert String to Array Column using SQL Query

Since PySpark provides a way to execute the raw SQL, let’s learn how to write the same example using Spark SQL expression.

In order to use raw SQL, first, you need to create a table using createOrReplaceTempView(). This creates a temporary view from the Dataframe and this view is available lifetime of the current Spark context.


df.createOrReplaceTempView("PERSON")
spark.sql("select SPLIT(name,',') as NameArray from PERSON") \
    .show()

This yields the same output as above example.

Complete Example

Below is the complete example of splitting an String type column based on a delimiter or patterns and converting into ArrayType column.

This example is also available at PySpark-Examples GitHub project for reference.


from pyspark.sql import SparkSession
spark = SparkSession.builder \
         .appName('SparkByExamples.com') \
         .getOrCreate()

data = [("James, A, Smith","2018","M",3000),
            ("Michael, Rose, Jones","2010","M",4000),
            ("Robert,K,Williams","2010","M",4000),
            ("Maria,Anne,Jones","2005","F",4000),
            ("Jen,Mary,Brown","2010","",-1)
            ]

columns=["name","dob_year","gender","salary"]
df=spark.createDataFrame(data,columns)
df.printSchema()
df.show(truncate=False)

from pyspark.sql.functions import split, col
df2 = df.select(split(col("name"),",").alias("NameArray")) \
    .drop("name")
df2.printSchema()
df2.show()

df.createOrReplaceTempView("PERSON")
spark.sql("select SPLIT(name,',') as NameArray from PERSON") \
    .show()

Conclusion

In this simple article, you have learned how to Convert the string column into an array column by splitting the string by delimiter and also learned how to use the split function on PySpark SQL expression.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply