Spark – How to Concatenate DataFrame columns

Using concat() or concat_ws() Spark SQL functions we can concatenate one or more DataFrame columns into a single column, In this article, you will learn using these functions and also using raw SQL to concatenate columns with Scala example.

Related: Concatenate PySpark (Python) DataFrame column

1. Preparing Data & DataFrame


  val data = Seq(("James","A","Smith","2018","M",3000),
    ("Michael","Rose","Jones","2010","M",4000),
    ("Robert","K","Williams","2010","M",4000),
    ("Maria","Anne","Jones","2005","F",4000),
    ("Jen","Mary","Brown","2010","",-1)
  )

  val columns = Seq("fname","mname","lname","dob_year","gender","salary")
  import spark.sqlContext.implicits._
  val df = data.toDF(columns:_*)
  df.show(false)

Note that we need to import implicits on “spark” object which is an instance of SparkSession in order to use toDF() on Seq collection and df.show() yields below output.


+-------+-----+--------+--------+------+------+
|fname  |mname|lname   |dob_year|gender|salary|
+-------+-----+--------+--------+------+------+
|James  |A    |Smith   |2018    |M     |3000  |
|Michael|Rose |Jones   |2010    |M     |4000  |
|Robert |K    |Williams|2010    |M     |4000  |
|Maria  |Anne |Jones   |2005    |F     |4000  |
|Jen    |Mary |Brown   |2010    |      |-1    |
+-------+-----+--------+--------+------+------+

2. Using concat() Function to Concatenate DataFrame Columns

Spark SQL functions provide concat() to concatenate two or more DataFrame columns into a single Column.

Syntax


concat(exprs: Column*): Column

It can also take columns of different Data Types and concatenate them into a single column. for example, it supports String, Int, Boolean and also arrays.


  df.select(concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")).as("FullName"))
      .show(false)

This statement creates “FullName” column by concatenating columns fname, mname, lname separating by delimiter comma. To add a delimiter, we have used lit() function. This yields output with just a concatenated column.


+------------------+
|FullName          |
+------------------+
|James,A,Smith     |
|Michael,Rose,Jones|
|Robert,K,Williams |
|Maria,Anne,Jones  |
|Jen,Mary,Brown    |
+------------------+

3. Using concat() Function on withColumn()

Let’s see another example using concat() function on withColumn(), here we will add a new column FullName by concatenating columns names.


  df.withColumn("FullName",concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")))
    .show(false)

The above snippet also keeps the individual names, if you do not need it you can drop them using the below statement.


  df.withColumn("FullName",concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")))
    .drop("fname")
    .drop("mname")
    .drop("lname")
    .show(false)

This yields the below output.


+--------+------+------+------------------+
|dob_year|gender|salary|FullName          |
+--------+------+------+------------------+
|2018    |M     |3000  |James,A,Smith     |
|2010    |M     |4000  |Michael,Rose,Jones|
|2010    |M     |4000  |Robert,K,Williams |
|2005    |F     |4000  |Maria,Anne,Jones  |
|2010    |      |-1    |Jen,Mary,Brown    |
+--------+------+------+------------------+

4. Using concat_ws() Function to Concatenate with Delimiter

Adding a delimiter while concatenating DataFrame columns can be easily done using another function concat_ws().

syntax


concat_ws(sep: String, exprs: Column*): Column

concat_ws() function takes the first argument as delimiter following with columns that need to concatenate.


  df.withColumn("FullName",concat_ws(",",col("fname"),col("mname"),col("lname")))
    .drop("fname")
    .drop("mname")
    .drop("lname")
      .show(false)

This yields the below output.


+--------+------+------+------------------+
|dob_year|gender|salary|FullName          |
+--------+------+------+------------------+
|2018    |M     |3000  |James,A,Smith     |
|2010    |M     |4000  |Michael,Rose,Jones|
|2010    |M     |4000  |Robert,K,Williams |
|2005    |F     |4000  |Maria,Anne,Jones  |
|2010    |      |-1    |Jen,Mary,Brown    |
+--------+------+------+------------------+

5. Using Raw SQL

If you are coming from SQL background, don’t get disappointed, Spark SQL also provides a way to concatenate using Raw SQL syntax. But In order to use this first you need to create a temporary view using df.createOrReplaceTempView("EMP"). This creates a temporary table "EMP".


  df.createOrReplaceTempView("EMP")

  spark.sql("select CONCAT(fname,' ',lname,' ',mname) as FullName from EMP")
    .show(false)

We can also use concat() function on the raw SQL statements.


+------------------+
|FullName          |
+------------------+
|James Smith A     |
|Michael Jones Rose|
|Robert Williams K |
|Maria Jones Anne  |
|Jen Brown Mary    |
+------------------+

So far we have been using concat() function, now let’s see another way to concatenate string using operator “||”.


  spark.sql("select fname ||' '|| lname ||' '|| mname as FullName from EMP")
    .show(false)

This yields the same output as the above statement.

6. Complete Example


package com.sparkbyexamples.spark.dataframe.functions.string

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{lit, _}
object ConcatExample extends App {

  val spark = SparkSession.builder()
    .appName("SparkByExamples.com")
    .master("local")
    .getOrCreate()

  val data = Seq(("James","A","Smith","2018","M",3000),
    ("Michael","Rose","Jones","2010","M",4000),
    ("Robert","K","Williams","2010","M",4000),
    ("Maria","Anne","Jones","2005","F",4000),
    ("Jen","Mary","Brown","2010","",-1)
  )

  val columns = Seq("fname","mname","lname","dob_year","gender","salary")
  import spark.sqlContext.implicits._
  val df = data.toDF(columns:_*)
  df.printSchema()
  df.show(false)

  df.select(concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")).as("FullName"))
      .show(false)

  df.withColumn("FullName",concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")))
    .drop("fname")
    .drop("mname")
    .drop("lname")
    .show(false)

  df.withColumn("FullName",concat_ws(",",col("fname"),col("mname"),col("lname")))
    .drop("fname")
    .drop("mname")
    .drop("lname")
      .show(false)

  df.createOrReplaceTempView("EMP")

  spark.sql("select CONCAT(fname,' ',lname,' ',mname) as FullName from EMP")
    .show(false)

  spark.sql("select fname ||' '|| lname ||' '|| mname as FullName from EMP")
    .show(false)
}

This example is also available at GitHub for reference

Conclusion

In this article, you have learned different ways to concatenate two or more string Dataframe columns into a single column using Spark SQL concat() and concat_ws() functions and finally learned to concatenate by leveraging RAW SQL syntax along with several Scala examples. Hope you like it.

For more Spark SQL functions, please refer SQL Functions.

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply