• Post author:
  • Post category:PySpark
  • Post last modified:October 5, 2023
  • Reading time:6 mins read
You are currently viewing PySpark Concatenate Columns

pyspark.sql.functions provides two functions concat() and concat_ws() to concatenate DataFrame multiple columns into a single column. In this article, I will explain the differences between concat() and concat_ws() (concat with separator) by examples.

PySpark Concatenate Using concat()

concat() function of Pyspark SQL is used to concatenate multiple DataFrame columns into a single column. It can also be used to concatenate column types string, binary, and compatible array columns.


pyspark.sql.functions.concat(*cols)

Below is the example of using Pysaprk conat() function on select() function of Pyspark. select() is a transformation function in PySpark and returns a new DataFrame with the selected columns.


from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,col
data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df2=df.select(concat(df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df2.show(truncate=False)

In the above example, using concat() function of Pyspark SQL, I have concatenated three input string columns(firstname, middlename, lastname) into a single string column(FullName). Below is the output from the above example.


--------------+----------+------+------+
|FullName      |dob       |gender|salary|
+--------------+----------+------+------+
|JamesSmith    |1991-04-01|M     |3000  |
|MichaelRose   |2000-05-19|M     |4000  |
|RobertWilliams|1978-09-05|M     |4000  |
|MariaAnneJones|1967-12-01|F     |4000  |
|JenMaryBrown  |1980-02-17|F     |-1    |
+--------------+----------+------+------+

PySpark concat_ws() Usage

concat_ws() function of Pyspark concatenates multiple string columns into a single column with a given separator or delimiter.


pyspark.sql.functions.concat_ws(sep,*cols)

Below is an example of concat_ws() function.


from pyspark.sql.functions import concat_ws,col
df3=df.select(concat_ws('_',df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df3.show(truncate=False)

Using concat_ws() function of Pypsark SQL concatenated three string input columns (firstname, middlename, lastname) into a single string column (Fullname) and separated each column with “_” separator.

Below is the output for for concat_ws() funtion of Pyspark sql.


----------------+----------+------+------+
|FullName        |dob       |gender|salary|
+----------------+----------+------+------+
|James__Smith    |1991-04-01|M     |3000  |
|Michael_Rose_   |2000-05-19|M     |4000  |
|Robert__Williams|1978-09-05|M     |4000  |
|Maria_Anne_Jones|1967-12-01|F     |4000  |
|Jen_Mary_Brown  |1980-02-17|F     |-1    |
+----------------+----------+------+------+

Complete PySpark Concatenate Columns Example

Below is complete example of how to merge multiple columns.


from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,concat_ws
spark=SparkSession.builder.appName("concate").getOrCreate()

data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df2=df.select(concat(df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df2.show(truncate=False)

df3=df.select(concat_ws('_',df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df3.show(truncate=False)

Conclusion

You have learned Pyspark functions concat() is used to concatenate multiple columns into a single column without a separator and, concat_ws() is used to concatenate with separator. You have also learned these two functions are available in pyspark.sql.functions module. Happy learning! Do comment in the comment section.

Happy Learning !!

Prabha

Prabha is an accomplished data engineer with a wealth of experience in architecting, developing, and optimizing data pipelines and infrastructure. With a strong foundation in software engineering and a deep understanding of data systems, Prabha excels in building scalable solutions that handle diverse and large datasets efficiently. At SparkbyExamples.com Prabha writes her experience in Spark, PySpark, Python and Pandas.

Leave a Reply

This Post Has One Comment

  1. Sam

    If one of the columns contains a null value, will it treat it as an empty string for concatenation?