• Post author:
  • Post category:PySpark
  • Post last modified:January 9, 2024
  • Reading time:10 mins read
You are currently viewing PySpark split() Column into Multiple Columns

pyspark.sql.functions provides a function split() to split DataFrame string Column into multiple columns. In this tutorial, you will learn how to split Dataframe single column into multiple columns using withColumn() and select() and also will explain how to use regular expression (regex) on split function.

Advertisements

PySpark Split Column into multiple columns.

Following is the syntax of split() function. In order to use this first you need to import pyspark.sql.functions.split

Syntax:


pyspark.sql.functions.split(str, pattern, limit=-1)

Parameters:

  • str – a string expression to split
  • pattern – a string representing a regular expression.
  • limit –an integer that controls the number of times pattern is applied.

Note:  Spark 3.0 split() function takes an optional limit field. If not provided, the default limit value is -1.

Before we start with an example of PySpark split function, first let’s create a DataFrame and will use one of the column from this DataFrame to split into multiple columns.


# Create SparkSession and Prepare Sample Data
spark = SparkSession.builder.appName("sparkbyexamples").getOrCreate()
data=data = [('James','','Smith','1991-04-01'),
  ('Michael','Rose','','2000-05-19'),
  ('Robert','','Williams','1978-09-05'),
  ('Maria','Anne','Jones','1967-12-01'),
  ('Jen','Mary','Brown','1980-02-17')
]

columns=["firstname","middlename","lastname","dob"]
df=spark.createDataFrame(data,columns)
df.printSchema()
df.show(truncate=False)

Output is shown below for the above code.


# Output
+---------+----------+--------+----------+
|firstname|middlename|lastname|dob       |
+---------+----------+--------+----------+
|James    |          |Smith   |1991-04-01|
|Michael  |Rose      |        |2000-05-19|
|Robert   |          |Williams|1978-09-05|
|Maria    |Anne      |Jones   |1967-12-01|
|Jen      |Mary      |Brown   |1980-02-17|
+---------+----------+--------+----------+

Now, let’s start working on the Pyspark split() function to split the dob column which is a combination of year-month-day into individual columns like year, month, and day. Below are the different ways to do split() on the column.

Split Column using withColumn()

Let’s use withColumn() function of DataFame to create new columns. Below example creates a new Dataframe with Columns year, month, and the day after performing a split() function on dob Column of string type.


# split() with withColumn
df1 = df.withColumn('year', split(df['dob'], '-').getItem(0)) \
       .withColumn('month', split(df['dob'], '-').getItem(1)) \
       .withColumn('day', split(df['dob'], '-').getItem(2))
df1.show(truncate=False)

Below is the output.


# Output
+---------+----------+--------+----------+----+-----+---+
|firstname|middlename|lastname|dob       |year|month|day|
+---------+----------+--------+----------+----+-----+---+
|James    |          |Smith   |1991-04-01|1991|04   |01 |
|Michael  |Rose      |        |2000-05-19|2000|05   |19 |
|Robert   |          |Williams|1978-09-05|1978|09   |05 |
|Maria    |Anne      |Jones   |1967-12-01|1967|12   |01 |
|Jen      |Mary      |Brown   |1980-02-17|1980|02   |17 |
+---------+----------+--------+----------+----+-----+---+

Alternatively, you can do like below by creating a function variable and reusing it.


# split() with function variable
split_col = pyspark.sql.functions.split(df['dob'], '-')
df2 = df.withColumn('year', split_col.getItem(0)) \
       .withColumn('month', split_col.getItem(1)) \
       .withColumn('day', split_col.getItem(2))
df2.show(truncate=False) 

This also results in the same output as above.

Another way of doing Column split() with a select transformation of Dataframe.

Split Column using Select().


# split() with select()
split_col = pyspark.sql.functions.split(df['dob'], '-')
df3 = df.select("firstname","middlename","lastname","dob", split_col.getItem(0).alias('year'),split_col.getItem(1).alias('month'),split_col.getItem(2).alias('day'))   
df3.show(truncate=False)

This also results in same output as above.

Split with Regular Expression

Let’s take another example and split using a regular expression pattern. In this example, we are splitting a string on multiple characters A and B.


# split() with regular expression
df4 = spark.createDataFrame([('oneAtwoBthree',)], ['str',])
df4.select(split(df4.str, '[AB]').alias('str')).show()

As you know split() results in an ArrayType column, above example returns a DataFrame with ArrayType.


# Output
+-----------------+
|              str|
+-----------------+
|[one, two, three]|
+-----------------+

Split Using limit Argument

Let’s see an example using the limit option on the split.

limit > 0: The resulting array’s length will not be more than `limit`, and the resulting array’s last entry will contain all input beyond the last matched pattern.

limit <= 0 will be applied as many times as possible, and the resulting array can be of any size.


df4.select(split(df4.str, '[AB]',2).alias('str')).show()
+----------------+
|             str|
+----------------+
|[one, twoBthree]|
+----------------+

Another example using limit.


df4.select(split(df4.str, '[AB]',3).alias('str')).show()
+-----------------+
|              str|
+-----------------+
|[one, two, three]|
+-----------------+

PySpark split() Column Example


import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col,substring,regexp_replace
spark=SparkSession.builder.appName("sparkbyexamples").getOrCreate()

data=data = [('James','','Smith','1991-04-01'),
  ('Michael','Rose','','2000-05-19'),
  ('Robert','','Williams','1978-09-05'),
  ('Maria','Anne','Jones','1967-12-01'),
  ('Jen','Mary','Brown','1980-02-17')
]

columns=["firstname","middlename","lastname","dob"]
df=spark.createDataFrame(data,columns)
df.printSchema()
df.show(truncate=False)
df1 = df.withColumn('year', split(df['dob'], '-').getItem(0)) \
       .withColumn('month', split(df['dob'], '-').getItem(1)) \
       .withColumn('day', split(df['dob'], '-').getItem(2))
df1.printSchema()
df1.show(truncate=False)

 # Alternatively we can do like below      
split_col = pyspark.sql.functions.split(df['dob'], '-')
df2 = df.withColumn('year', split_col.getItem(0)) \
       .withColumn('month', split_col.getItem(1)) \
       .withColumn('day', split_col.getItem(2))
df2.show(truncate=False)      

# Using split() function of Column class
split_col = pyspark.sql.functions.split(df['dob'], '-')
df3 = df.select("firstname","middlename","lastname","dob", split_col.getItem(0).alias('year'),split_col.getItem(1).alias('month'),split_col.getItem(2).alias('day'))   
df3.show(truncate=False)
"""
df4=spark.createDataFrame([("20-13-2012-monday",)], ['date',])

df4.select(split(df4.date,'^([\d]+-[\d]+-[\d])').alias('date'),
    regexp_replace(split(df4.date,'^([\d]+-[\d]+-[\d]+)').getItem(1),'-','').alias('day')).show()
    """
df4 = spark.createDataFrame([('oneAtwoBthree',)], ['str',])
df4.select(split(df4.str, '[AB]').alias('str')).show()

df4.select(split(df4.str, '[AB]',2).alias('str')).show()
df4.select(split(df4.str, '[AB]',3).alias('str')).show()

This complete example is also available at Github pyspark example project

Conclusion

This gives you a brief understanding of using pyspark.sql.functions.split() to split a string dataframe column into multiple columns. I hope you understand and keep practicing. For any queries please do comment in the comment section. Thank you!!

Leave a Reply