PySpark Concatenate Columns

pyspark.sql.functions provides two functions concat() and concat_ws() to concatenate DataFrame multiple columns into a single column. In this article, I will explain the differences between concat() and concat_ws() (concat with separator) by examples.

PySpark Concatenate Using concat()

concat() function of Pyspark SQL is used to concatenate multiple DataFrame columns into a single column. It can also be used to concatenate column types string, binary, and compatible array columns.


pyspark.sql.functions.concat(*cols)

Below is the example of using Pysaprk conat() function on select() function of Pyspark. select() is a transformation function in PySpark and returns a new DataFrame with the selected columns.


from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,col
data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df2=df.select(concat(df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df2.show(truncate=False)

In the above example, using concat() function of Pyspark SQL, I have concatenated three input string columns(firstname, middlename, lastname) into a single string column(FullName). Below is the output from the above example.


--------------+----------+------+------+
|FullName      |dob       |gender|salary|
+--------------+----------+------+------+
|JamesSmith    |1991-04-01|M     |3000  |
|MichaelRose   |2000-05-19|M     |4000  |
|RobertWilliams|1978-09-05|M     |4000  |
|MariaAnneJones|1967-12-01|F     |4000  |
|JenMaryBrown  |1980-02-17|F     |-1    |
+--------------+----------+------+------+

PySpark concat_ws() Usage

concat_ws() function of Pyspark concatenates multiple string columns into a single column with a given separator or delimiter.


pyspark.sql.functions.concat_ws(sep,*cols)

Below is an example of concat_ws() function.


from pyspark.sql.functions import concat_ws,col
df3=df.select(concat_ws('_',df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df3.show(truncate=False)

Using concat_ws() function of Pypsark SQL concatenated three string input columns (firstname, middlename, lastname) into a single string column (Fullname) and separated each column with “_” separator.

Below is the output for for concat_ws() funtion of Pyspark sql.


----------------+----------+------+------+
|FullName        |dob       |gender|salary|
+----------------+----------+------+------+
|James__Smith    |1991-04-01|M     |3000  |
|Michael_Rose_   |2000-05-19|M     |4000  |
|Robert__Williams|1978-09-05|M     |4000  |
|Maria_Anne_Jones|1967-12-01|F     |4000  |
|Jen_Mary_Brown  |1980-02-17|F     |-1    |
+----------------+----------+------+------+

Complete PySpark Concatenate Columns Example

Below is complete example of how to merge multiple columns.


from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,concat_ws
spark=SparkSession.builder.appName("concate").getOrCreate()

data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df2=df.select(concat(df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df2.show(truncate=False)

df3=df.select(concat_ws('_',df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df3.show(truncate=False)

Conclusion

You have learned Pyspark functions concat() is used to concatenate multiple columns into a single column without a separator and, concat_ws() is used to concatenate with separator. You have also learned these two functions are available in pyspark.sql.functions module. Happy learning! Do comment in the comment section.

Reference

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions.

Happy Learning !!

Leave a Reply