PySpark Concatenate Columns

pyspark.sql.functions provides two functions concat() and concat_ws() to concatenate DataFrame multiple columns into a single column. In this article, I will explain the differences between concat() and concat_ws() (concat with separator) by examples.

PySpark Concatenate Using concat()

concat() function of Pyspark SQL is used to concatenate multiple DataFrame columns into a single column. It can also be used to concatenate column types string, binary, and compatible array columns.


pyspark.sql.functions.concat(*cols)

Below is the example of using Pysaprk conat() function on select() function of Pyspark. select() is a transformation function in PySpark and returns a new DataFrame with the selected columns.


from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,col
data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df2=df.select(concat(df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df2.show(truncate=False)

In the above example, using concat() function of Pyspark SQL, I have concatenated three input string columns(firstname, middlename, lastname) into a single string column(FullName). Below is the output from the above example.


--------------+----------+------+------+
|FullName      |dob       |gender|salary|
+--------------+----------+------+------+
|JamesSmith    |1991-04-01|M     |3000  |
|MichaelRose   |2000-05-19|M     |4000  |
|RobertWilliams|1978-09-05|M     |4000  |
|MariaAnneJones|1967-12-01|F     |4000  |
|JenMaryBrown  |1980-02-17|F     |-1    |
+--------------+----------+------+------+

PySpark concat_ws() Usage

concat_ws() function of Pyspark concatenates multiple string columns into a single column with a given separator or delimiter.


pyspark.sql.functions.concat_ws(sep,*cols)

Below is an example of concat_ws() function.


from pyspark.sql.functions import concat_ws,col
df3=df.select(concat_ws('_',df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df3.show(truncate=False)

Using concat_ws() function of Pypsark SQL concatenated three string input columns (firstname, middlename, lastname) into a single string column (Fullname) and separated each column with “_” separator.

Below is the output for for concat_ws() funtion of Pyspark sql.


----------------+----------+------+------+
|FullName        |dob       |gender|salary|
+----------------+----------+------+------+
|James__Smith    |1991-04-01|M     |3000  |
|Michael_Rose_   |2000-05-19|M     |4000  |
|Robert__Williams|1978-09-05|M     |4000  |
|Maria_Anne_Jones|1967-12-01|F     |4000  |
|Jen_Mary_Brown  |1980-02-17|F     |-1    |
+----------------+----------+------+------+

Complete PySpark Concatenate Columns Example

Below is complete example of how to merge multiple columns.


from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,concat_ws
spark=SparkSession.builder.appName("concate").getOrCreate()

data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df2=df.select(concat(df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df2.show(truncate=False)

df3=df.select(concat_ws('_',df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df3.show(truncate=False)

Conclusion

You have learned Pyspark functions concat() is used to concatenate multiple columns into a single column without a separator and, concat_ws() is used to concatenate with separator. You have also learned these two functions are available in pyspark.sql.functions module. Happy learning! Do comment in the comment section.

Happy Learning !!

PySpark Concatenate Using concat()

PySpark concat_ws() Usage

Complete PySpark Concatenate Columns Example

Conclusion

Leave a Reply

This Post Has One Comment

PySpark Concatenate Using concat()

PySpark concat_ws() Usage

Complete PySpark Concatenate Columns Example

Conclusion

Related Articles

Leave a Reply

This Post Has One Comment