pyspark.sql.functions
provides two functions concat()
and concat_ws()
to concatenate DataFrame multiple columns into a single column. In this article, I will explain the differences between concat() and concat_ws() (concat with separator) by examples.
PySpark Concatenate Using concat()
concat()
function of Pyspark SQL is used to concatenate multiple DataFrame columns into a single column. It can also be used to concatenate column types string, binary, and compatible array columns.
pyspark.sql.functions.concat(*cols)
Below is the example of using Pysaprk conat() function on select() function of Pyspark. select() is a transformation function in PySpark and returns a new DataFrame with the selected columns.
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,col
data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]
columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df2=df.select(concat(df.firstname,df.middlename,df.lastname)
.alias("FullName"),"dob","gender","salary")
df2.show(truncate=False)
In the above example, using concat()
function of Pyspark SQL, I have concatenated three input string columns(firstname, middlename, lastname) into a single string column(FullName). Below is the output from the above example.
--------------+----------+------+------+
|FullName |dob |gender|salary|
+--------------+----------+------+------+
|JamesSmith |1991-04-01|M |3000 |
|MichaelRose |2000-05-19|M |4000 |
|RobertWilliams|1978-09-05|M |4000 |
|MariaAnneJones|1967-12-01|F |4000 |
|JenMaryBrown |1980-02-17|F |-1 |
+--------------+----------+------+------+
PySpark concat_ws() Usage
concat_ws()
function of Pyspark concatenates multiple string columns into a single column with a given separator or delimiter.
pyspark.sql.functions.concat_ws(sep,*cols)
Below is an example of concat_ws() function.
from pyspark.sql.functions import concat_ws,col
df3=df.select(concat_ws('_',df.firstname,df.middlename,df.lastname)
.alias("FullName"),"dob","gender","salary")
df3.show(truncate=False)
Using concat_ws()
function of Pypsark SQL concatenated three string input columns (firstname, middlename, lastname) into a single string column (Fullname) and separated each column with “_” separator.
Below is the output for for concat_ws() funtion of Pyspark sql.
----------------+----------+------+------+
|FullName |dob |gender|salary|
+----------------+----------+------+------+
|James__Smith |1991-04-01|M |3000 |
|Michael_Rose_ |2000-05-19|M |4000 |
|Robert__Williams|1978-09-05|M |4000 |
|Maria_Anne_Jones|1967-12-01|F |4000 |
|Jen_Mary_Brown |1980-02-17|F |-1 |
+----------------+----------+------+------+
Complete PySpark Concatenate Columns Example
Below is complete example of how to merge multiple columns.
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,concat_ws
spark=SparkSession.builder.appName("concate").getOrCreate()
data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]
columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df2=df.select(concat(df.firstname,df.middlename,df.lastname)
.alias("FullName"),"dob","gender","salary")
df2.show(truncate=False)
df3=df.select(concat_ws('_',df.firstname,df.middlename,df.lastname)
.alias("FullName"),"dob","gender","salary")
df3.show(truncate=False)
Conclusion
You have learned Pyspark functions concat()
is used to concatenate multiple columns into a single column without a separator and, concat_ws()
is used to concatenate with separator. You have also learned these two functions are available in pyspark.sql.functions module. Happy learning! Do comment in the comment section.
Reference
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions.
Happy Learning !!
If one of the columns contains a null value, will it treat it as an empty string for concatenation?