Spark – Add New Column & Multiple Columns to DataFrame

Adding a new column or multiple columns to Spark DataFrame can be done using withColumn(), select(), map() methods of DataFrame, In this article, I will explain how to add a new column from the existing column, adding a constant or literal value, and finally adding a list column to DataFrame.

First, let’s create a simple DataFrame to work with.


  import spark.sqlContext.implicits._

  val data = Seq(("111",50000),("222",60000),("333",40000))
  val df = data.toDF("EmpId","Salary")
  df.show(false)

Yields below output


+-----+------+
|EmpId|Salary|
+-----+------+
|111  |50000 |
|222  |60000 |
|333  |40000 |
+-----+------+

Using withColumn() to Add a New Column

withColumn() is used to add a new or update an existing column on DataFrame, here, I will just explain how to add a new column by using an existing column. withColumn() function takes two arguments, the first argument is the name of the new column and the second argument is the value of the column in Column type.


  //Derive a new column from existing
  df.withColumn("CopiedColumn",col("salary")* -1)
    .show(false)

Here, we have added a new column <strong>CopiedColumn</strong> by multiplying -1 with an existing column <strong>Salary</strong>. This yields the below output.


+-----+------+------------+
|EmpId|Salary|CopiedColumn|
+-----+------+------------+
|111  |50000 |-50000      |
|222  |60000 |-60000      |
|333  |40000 |-40000      |
+-----+------+------------+

You can also add columns based on some conditions, please refer to Spark Case When and When Otherwise examples

Using Select to Add Column

The above statement can also be written using select() as below and this yields the same as the above output. You can also add multiple columns using select.


  //Using select
  df.select($"EmpId",$"Salary", ($"salary"* -1).as("CopiedColumn") )
    .show(false)

You can chain withColumn() to add multiple columns to DataFrame.

Adding a Constant Column to DataFrame

Let’s create a new column with constant value using lit() SQL function, on the below snippet, we are creating a new column by adding a literal ‘1’ to Spark DataFrame.


val df2 = df.select(col("EmpId"),col("Salary"),lit("1").as("lit_value1"))
df2.show()

+-----+------+----------+
|EmpId|Salary|lit_value1|
+-----+------+----------+
|  111| 50000|         1|
|  222| 60000|         1|
|  333| 40000|         1|
+-----+------+----------+

Adding the same constant literal to all records in DataFrame may not be real-time useful so let’s see another example.


val df3 = df2.withColumn("lit_value2",
    when(col("Salary") >=40000 && col("Salary") <= 50000,
       lit("100").cast(IntegerType))
      .otherwise(lit("200").cast(IntegerType))
  )
  df3.show()

Adding a List column to DataFrame

Following example shows on how to create a new column with collection using typedLit() sql function. On below snippet, we are adding a new columns Seq(1, 2, 3), Map(“a” -> 1, “b” -> 2) and structure (“a”, 2, 1.0) to spark DataFrame.


val df4 = df3.withColumn("typedLit_seq",typedLit(Seq(1, 2, 3)))
    .withColumn("typedLit_map",typedLit(Map("a" -> 1, "b" -> 2)))
    .withColumn("typedLit_struct",typedLit(("a", 2, 1.0)))
df4.printSchema()
df4.show()

df4.printSchema() outputs below schema


root
 |-- EmpId: string (nullable = true)
 |-- Salary: integer (nullable = false)
 |-- lit_value1: string (nullable = false)
 |-- lit_value2: integer (nullable = true)
 |-- typedLit_seq: array (nullable = false)
 |    |-- element: integer (containsNull = false)
 |-- typedLit_map: map (nullable = false)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = false)
 |-- typedLit_struct: struct (nullable = false)
 |    |-- _1: string (nullable = true)
 |    |-- _2: integer (nullable = false)
 |    |-- _3: double (nullable = false)

df4.show() outputs below data frame.


+-----+------+----------+----------+------------+----------------+--------------+
|EmpId|Salary|lit_value1|lit_value2|typedLit_seq|    typedLit_map|typedLit_struct|
+-----+------+----------+----------+------------+----------------+--------------+
|  111| 50000|         1|       200|   [1, 2, 3]|[a -> 1, b -> 2]|   [a, 2, 1.0]|
|  222| 60000|         1|       200|   [1, 2, 3]|[a -> 1, b -> 2]|   [a, 2, 1.0]|
|  333| 40000|         1|       200|   [1, 2, 3]|[a -> 1, b -> 2]|   [a, 2, 1.0]|
+-----+------+----------+----------+------------+----------------+--------------+

Add Multiple Columns using Map

You can add multiple columns to Spark DataFrame in several ways if you wanted to add a known set of columns you can easily do by chaining withColumn() or on select(). However, sometimes you may need to add multiple columns after applying some transformations n that case you can use either map() or foldLeft(). Let’s see an example with a map.

I don’t have a real-time scenario to add multiple columns, below is just a skeleton on how to use. I will update this once I have a Scala example.


//Let's assume DF has just 3 columns c1,c2,c3
val df2 = df.map(row=>{
//apply transformation on these columns and derive multiple columns
//and store these column vlaues into c5,c6,c7,c8,c9,10
(c1,c2,c5,c6,c7,c8,c9,c10)
})

If you closely look at the above snippet, DataFrame has 3 columns and we are deriving multiple columns dynamically from existing columns by applying transformations, this can be a split() function or any custom UDF and finally dropping an existing column.

Source Code to Add Multiple Columns


package com.sparkbyexamples.spark.dataframe.functions

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, lit, typedLit, when}
import org.apache.spark.sql.types.IntegerType

object AddColumn extends App {

  val spark = SparkSession.builder()
    .appName("SparkByExamples.com")
    .master("local")
    .getOrCreate()

  import spark.sqlContext.implicits._

  val data = Seq(("111",50000),("222",60000),("333",40000))
  val df = data.toDF("EmpId","Salary")
  df.show(false)

  //Derive a new column from existing
  df.withColumn("CopiedColumn",df("salary")* -1)
    .show(false)

  //Using select
  df.select($"EmpId",$"Salary", ($"salary"* -1).as("CopiedColumn") )
    .show(false)

  //Adding a literal
  val df2 = df.select(col("EmpId"),col("Salary"),lit("1").as("lit_value1"))
  df2.show()
  
  val df3 = df2.withColumn("lit_value2",
    when(col("Salary") >=40000 && col("Salary") <= 50000, lit("100").cast(IntegerType))
      .otherwise(lit("200").cast(IntegerType))
  )
  df3.show(false)

  //Adding a list column
  val df4 = df3.withColumn("typedLit_seq",typedLit(Seq(1, 2, 3)))
    .withColumn("typedLit_map",typedLit(Map("a" -> 1, "b" -> 2)))
    .withColumn("typedLit_struct",typedLit(("a", 2, 1.0)))

  df4.printSchema()
  df4.show()
}

The complete source code is available at GitHub project

Conclusion

In this article, you have learned how to add a new column and multiple columns to Spark DataFrame using withColumn(), select(), lit(), map() functions by working with Scala example.

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

This Post Has 2 Comments

  1. Meera

    Thanks a lot fro sharing knowledge….very informative

    1. NNK

      Thanks Meera. I’ve corrected it.

Spark – Add New Column & Multiple Columns to DataFrame