• Post author:
  • Post category:PySpark
  • Post last modified:October 10, 2025
  • Reading time:14 mins read
You are currently viewing Explain PySpark concat() Function

In PySpark, the concat() function is used to concatenate multiple string columns into a single column without any separator. It joins the values of two or more columns or string expressions directly, producing a new string column.

Advertisements

Both concat() and concat_ws() belong to the pyspark.sql.functions module and are often used for combining multiple string columns into one. However, unlike concat_ws(), the concat() function does not include a separator between values and does not skip null values automatically.

In this article, we’ll explore how the concat() function works, how it differs from concat_ws(), and several use cases such as merging multiple columns, adding fixed strings, handling null values, and using it in SQL queries.

Key Points-

  • You can use concat() to merge multiple columns or string expressions into a single string column.
  • Unlike concat_ws(), it does not add any separator between values.
  • If any column contains null, the result will also be null.
  • You can combine column values with fixed strings for formatting.
  • Works seamlessly with both DataFrame API and Spark SQL.
  • Commonly used for generating IDs, full names, or concatenated keys without separators.
  • To handle nulls automatically, prefer using concat_ws() instead.

PySpark concat() Function

The concat() function merges multiple input string columns into one single string column without any separator. It returns a column containing the concatenated values in order.

Syntax

Following is the syntax of the concat() function.


# Syntax of concat()
concat(*cols)

Parameters

  • *cols : (string or Column)
    One or more column names or column expressions to concatenate.

Return Value

Returns a single string column that joins all specified input columns or string expressions without any separator. If any column is null, the entire result becomes null.

We’ll use the following sample DataFrame throughout the examples:


from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, concat_ws, col

# Create SparkSession
spark = SparkSession.builder.appName("sparkbyexamples").getOrCreate()

# Sample Data
data = [
    ('James','','Smith','1991-04-01','M',3000),
    ('Michael','Rose','','2000-05-19','M',4000),
    ('Robert','','Williams','1978-09-05','M',4000),
    ('Maria','Anne','Jones','1967-12-01','F',4000),
    ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]

df = spark.createDataFrame(data=data, schema=columns)
df.show(truncate=False)

Yields below the output.

PySpark concat

Concatenate Multiple Columns using concat()

You can use the concat() function to join multiple columns directly into one string without any separator.


from pyspark.sql.functions import concat

# Concatenate multiple columns
df_concat = df.select(
    concat(df.firstname, df.middlename, df.lastname).alias("FullName"),
    "dob", "gender", "salary"
)
df_concat.show(truncate=False)

Yields below the output.

PySpark concat

Null Values with concat()

By default, if any column involved in concatenation is null, concat() will return null.


# Null Values with concat()
data = [
    ("James", None, "Smith"),
    ("Michael", "Rose", None),
    ("Robert", None, None)
]
columns = ["firstname", "middlename", "lastname"]

df_null = spark.createDataFrame(data, columns)

df_null_concat = df_null.select(
    concat("firstname", "middlename", "lastname").alias("FullName")
)
df_null_concat.show(truncate=False)

Yields below the output.


# Output:
+--------+
|FullName|
+--------+
|NULL    |
|NULL    |
|NULL    |
+--------+

To ignore nulls during concatenation, use concat_ws() instead.

Add Fixed Strings using concat()

You can include fixed strings (like separators, prefixes, or suffixes) inside the concat() function by wrapping them with lit() from pyspark.sql.functions.


from pyspark.sql.functions import lit

# Add fixed string between columns
df_fixed = df.select(
    concat(col("firstname"), lit(" "), col("lastname")).alias("FullNameWithSpace"),
    "gender"
)
df_fixed.show(truncate=False)

Yields below the output.


# Output:
+-----------------+------+
|FullNameWithSpace|gender|
+-----------------+------+
|James Smith      |M     |
|Michael          |M     |
|Robert Williams  |M     |
|Maria Jones      |F     |
|Jen Brown        |F     |
+-----------------+------+

Use concat() in SQL Queries

You can register your DataFrame as a temporary SQL view and use concat directly in SQL SELECT statements to merge multiple columns into one string.


# Use concat() in SQL Queries
# Register DataFrame as a temporary view
df.createOrReplaceTempView("people")

# Use concat() in SQL
df_sql = spark.sql("""
    SELECT concat(firstname, middlename, lastname) AS FullName, dob, gender, salary
    FROM people
""")
df_sql.show(truncate=False)

Yields below the output.


# Output:
+--------------+----------+------+------+
|FullName      |dob       |gender|salary|
+--------------+----------+------+------+
|JamesSmith    |1991-04-01|M     |3000  |
|MichaelRose   |2000-05-19|M     |4000  |
|RobertWilliams|1978-09-05|M     |4000  |
|MariaAnneJones|1967-12-01|F     |4000  |
|JenMaryBrown  |1980-02-17|F     |-1    |
+--------------+----------+------+------+

PySpark concat() vs concat_ws()

Both functions are used for concatenating string columns, but they differ in handling separators and null values.

Featureconcat()concat_ws()
SeparatorNo separatorAdds specified separator
Null HandlingReturns null if any column is nullIgnores null values
Use CaseWhen joining raw stringsWhen you need a delimiter like space, comma, or dash

For example:


# concat() vs concat_ws()
from pyspark.sql.functions import concat_ws

df_compare = df.select(
    concat(df.firstname, df.middlename, df.lastname).alias("concat_output"),
    concat_ws(" ", df.firstname, df.middlename, df.lastname).alias("concat_ws_output")
)
df_compare.show(truncate=False)

Yields below the output.


# Output:
+--------------+----------------+
|concat_output |concat_ws_output|
+--------------+----------------+
|JamesSmith    |James  Smith    |
|MichaelRose   |Michael Rose    |
|RobertWilliams|Robert  Williams|
|MariaAnneJones|Maria Anne Jones|
|JenMaryBrown  |Jen Mary Brown  |
+--------------+----------------+

Frequently Asked Questions of PySpark concat()

What is the concat function in PySpark?

The concat function in PySpark is used to combine multiple string columns or expressions into a single column. It merges values directly without adding any separator between them.

What happens if one of the columns has a null value?

If any of the columns involved in concatenation contain a null value, the entire result will be null. This is a key difference between concat and concat_ws, as concat_ws automatically skips nulls.

What type of data can be concatenated using concat?

You can concatenate string columns, string literals, or column expressions. For numeric columns, you should first cast them to string before using concat.

How does concat differ from concat_ws in PySpark?

concat merges multiple columns directly, whereas concat_ws allows you to include a custom separator such as space, comma, or hyphen between values. Additionally, concat_ws automatically ignores null values.

How can you include fixed strings like spaces or commas when using concat?

You can include fixed strings by wrapping them with the lit() function. For example, to add a space between two names, you can use lit(” “) inside concat.

How can concat be used inside SQL queries?

You can register your DataFrame as a temporary SQL view and use concat directly in SQL SELECT statements to merge multiple columns into one string.

Conclusion

In this article, you learned how to concatenate multiple columns into a single string using PySpark’s concat() function.
While it joins columns directly without separators, it does not handle null values automatically. To manage separators and nulls effectively, you can use concat_ws() instead.

By combining concat() with lit() and conditional expressions, you can easily format and generate clean string outputs for IDs, labels, or full names.

Happy Learning!!

Reference