You are currently viewing Spark – How to Concatenate DataFrame columns

Using concat() or concat_ws() Spark SQL functions we can concatenate one or more DataFrame columns into a single column, In this article, you will learn using these functions and also using raw SQL to concatenate columns with Scala example.

Related: Concatenate PySpark (Python) DataFrame column

1. Preparing Data & DataFrame


// Preparing Data & DataFrame
  val data = Seq(("James","A","Smith","2018","M",3000),
    ("Michael","Rose","Jones","2010","M",4000),
    ("Robert","K","Williams","2010","M",4000),
    ("Maria","Anne","Jones","2005","F",4000),
    ("Jen","Mary","Brown","2010","",-1)
  )

  val columns = Seq("fname","mname","lname","dob_year","gender","salary")
  import spark.sqlContext.implicits._
  val df = data.toDF(columns:_*)
  df.show(false)

Note that we need to import implicits on “spark” object which is an instance of SparkSession in order to use toDF() on Seq collection and df.show() yields below output.


// Output:
+-------+-----+--------+--------+------+------+
|fname  |mname|lname   |dob_year|gender|salary|
+-------+-----+--------+--------+------+------+
|James  |A    |Smith   |2018    |M     |3000  |
|Michael|Rose |Jones   |2010    |M     |4000  |
|Robert |K    |Williams|2010    |M     |4000  |
|Maria  |Anne |Jones   |2005    |F     |4000  |
|Jen    |Mary |Brown   |2010    |      |-1    |
+-------+-----+--------+--------+------+------+

2. Using concat() Function to Concatenate DataFrame Columns

Spark SQL functions provide concat() to concatenate two or more DataFrame columns into a single Column.

Syntax


// Using concat() Function to Concatenate DataFrame Columns 
concat(exprs: Column*): Column

It can also take columns of different Data Types and concatenate them into a single column. for example, it supports String, Int, Boolean and also arrays.


  df.select(concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")).as("FullName"))
      .show(false)

This statement creates “FullName” column by concatenating columns fname, mname, lname separating by delimiter comma. To add a delimiter, we have used lit() function. This yields output with just a concatenated column.


+------------------+
|FullName          |
+------------------+
|James,A,Smith     |
|Michael,Rose,Jones|
|Robert,K,Williams |
|Maria,Anne,Jones  |
|Jen,Mary,Brown    |
+------------------+

3. Using concat() Function on withColumn()

Let’s see another example using concat() function on withColumn(), here we will add a new column FullName by concatenating columns names.


// Using concat() Function on withColumn()
  df.withColumn("FullName",concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")))
    .show(false)

The above snippet also keeps the individual names, if you do not need it you can drop them using the below statement.


  df.withColumn("FullName",concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")))
    .drop("fname")
    .drop("mname")
    .drop("lname")
    .show(false)

This yields the below output.


// Output:
+--------+------+------+------------------+
|dob_year|gender|salary|FullName          |
+--------+------+------+------------------+
|2018    |M     |3000  |James,A,Smith     |
|2010    |M     |4000  |Michael,Rose,Jones|
|2010    |M     |4000  |Robert,K,Williams |
|2005    |F     |4000  |Maria,Anne,Jones  |
|2010    |      |-1    |Jen,Mary,Brown    |
+--------+------+------+------------------+

4. Using concat_ws() Function to Concatenate with Delimiter

Adding a delimiter while concatenating DataFrame columns can be easily done using another function concat_ws().

syntax


// Using concat_ws() Function to Concatenate with Delimiter
concat_ws(sep: String, exprs: Column*): Column

concat_ws() function takes the first argument as delimiter following with columns that need to concatenate.


  df.withColumn("FullName",concat_ws(",",col("fname"),col("mname"),col("lname")))
    .drop("fname")
    .drop("mname")
    .drop("lname")
      .show(false)

This yields the below output.


// Output:
+--------+------+------+------------------+
|dob_year|gender|salary|FullName          |
+--------+------+------+------------------+
|2018    |M     |3000  |James,A,Smith     |
|2010    |M     |4000  |Michael,Rose,Jones|
|2010    |M     |4000  |Robert,K,Williams |
|2005    |F     |4000  |Maria,Anne,Jones  |
|2010    |      |-1    |Jen,Mary,Brown    |
+--------+------+------+------------------+

5. Using Raw SQL

If you are coming from SQL background, don’t get disappointed, Spark SQL also provides a way to concatenate using Raw SQL syntax. But In order to use this first you need to create a temporary view using df.createOrReplaceTempView("EMP"). This creates a temporary table "EMP".


// Using Raw SQL
  df.createOrReplaceTempView("EMP")

  spark.sql("select CONCAT(fname,' ',lname,' ',mname) as FullName from EMP")
    .show(false)

We can also use concat() function on the raw SQL statements.


+------------------+
|FullName          |
+------------------+
|James Smith A     |
|Michael Jones Rose|
|Robert Williams K |
|Maria Jones Anne  |
|Jen Brown Mary    |
+------------------+

So far we have been using concat() function, now let’s see another way to concatenate string using operator “||”.


  spark.sql("select fname ||' '|| lname ||' '|| mname as FullName from EMP")
    .show(false)

This yields the same output as the above statement.

6. Complete Example


package com.sparkbyexamples.spark.dataframe.functions.string

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{lit, _}
object ConcatExample extends App {

  val spark = SparkSession.builder()
    .appName("SparkByExamples.com")
    .master("local")
    .getOrCreate()

  val data = Seq(("James","A","Smith","2018","M",3000),
    ("Michael","Rose","Jones","2010","M",4000),
    ("Robert","K","Williams","2010","M",4000),
    ("Maria","Anne","Jones","2005","F",4000),
    ("Jen","Mary","Brown","2010","",-1)
  )

  val columns = Seq("fname","mname","lname","dob_year","gender","salary")
  import spark.sqlContext.implicits._
  val df = data.toDF(columns:_*)
  df.printSchema()
  df.show(false)

  df.select(concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")).as("FullName"))
      .show(false)

  df.withColumn("FullName",concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")))
    .drop("fname")
    .drop("mname")
    .drop("lname")
    .show(false)

  df.withColumn("FullName",concat_ws(",",col("fname"),col("mname"),col("lname")))
    .drop("fname")
    .drop("mname")
    .drop("lname")
      .show(false)

  df.createOrReplaceTempView("EMP")

  spark.sql("select CONCAT(fname,' ',lname,' ',mname) as FullName from EMP")
    .show(false)

  spark.sql("select fname ||' '|| lname ||' '|| mname as FullName from EMP")
    .show(false)
}

This example is also available at GitHub for reference

7. Conclusion

In this article, you have learned different ways to concatenate two or more string Dataframe columns into a single column using Spark SQL concat() and concat_ws() functions and finally learned to concatenate by leveraging RAW SQL syntax along with several Scala examples. Hope you like it.

For more Spark SQL functions, please refer SQL Functions.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium