You are currently viewing Spark – Add New Column & Multiple Columns to DataFrame

Adding a new column or multiple columns to Spark DataFrame can be done using withColumn(), select(), map() methods of DataFrame, In this article, I will explain how to add a new column from the existing column, adding a constant or literal value, and finally adding a list column to DataFrame.

First, let’s create a simple DataFrame to work with.


  import spark.sqlContext.implicits._

  val data = Seq(("111",50000),("222",60000),("333",40000))
  val df = data.toDF("EmpId","Salary")
  df.show(false)

Yields below output


// Output:
+-----+------+
|EmpId|Salary|
+-----+------+
|111  |50000 |
|222  |60000 |
|333  |40000 |
+-----+------+

1. Using withColumn() to Add a New Column

withColumn() is used to add a new or update an existing column on DataFrame, here, I will just explain how to add a new column by using an existing column. withColumn() function takes two arguments, the first argument is the name of the new column and the second argument is the value of the column in Column type.


  // Derive a new column from existing
  df.withColumn("CopiedColumn",col("salary")* -1)
    .show(false)

Here, we have added a new column <strong>CopiedColumn</strong> by multiplying -1 with an existing column <strong>Salary</strong>. This yields the below output.


// Output: 
+-----+------+------------+
|EmpId|Salary|CopiedColumn|
+-----+------+------------+
|111  |50000 |-50000      |
|222  |60000 |-60000      |
|333  |40000 |-40000      |
+-----+------+------------+

You can also add columns based on some conditions, please refer to Spark Case When and When Otherwise examples

2. Using Select to Add Column

The above statement can also be written using select() as below and this yields the same as the above output. You can also add multiple columns using select.


  // Using select
  df.select($"EmpId",$"Salary", ($"salary"* -1).as("CopiedColumn") )
    .show(false)

You can chain withColumn() to add multiple columns to DataFrame.

3. Adding a Constant Column to DataFrame

Let’s create a new column with constant value using lit() SQL function, on the below snippet, we are creating a new column by adding a literal ‘1’ to Spark DataFrame.


// Adding a Constant Column to DataFrame
val df2 = df.select(col("EmpId"),col("Salary"),lit("1").as("lit_value1"))
df2.show()

+-----+------+----------+
|EmpId|Salary|lit_value1|
+-----+------+----------+
|  111| 50000|         1|
|  222| 60000|         1|
|  333| 40000|         1|
+-----+------+----------+

Adding the same constant literal to all records in DataFrame may not be real-time useful so let’s see another example.


val df3 = df2.withColumn("lit_value2",
    when(col("Salary") >=40000 && col("Salary") <= 50000,
       lit("100").cast(IntegerType))
      .otherwise(lit("200").cast(IntegerType))
  )
  df3.show()

4. Adding a List column to DataFrame

Following example shows on how to create a new column with collection using typedLit() sql function. On below snippet, we are adding a new columns Seq(1, 2, 3), Map(“a” -> 1, “b” -> 2) and structure (“a”, 2, 1.0) to spark DataFrame.


val df4 = df3.withColumn("typedLit_seq",typedLit(Seq(1, 2, 3)))
    .withColumn("typedLit_map",typedLit(Map("a" -> 1, "b" -> 2)))
    .withColumn("typedLit_struct",typedLit(("a", 2, 1.0)))
df4.printSchema()
df4.show()

df4.printSchema() outputs below schema


root
 |-- EmpId: string (nullable = true)
 |-- Salary: integer (nullable = false)
 |-- lit_value1: string (nullable = false)
 |-- lit_value2: integer (nullable = true)
 |-- typedLit_seq: array (nullable = false)
 |    |-- element: integer (containsNull = false)
 |-- typedLit_map: map (nullable = false)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = false)
 |-- typedLit_struct: struct (nullable = false)
 |    |-- _1: string (nullable = true)
 |    |-- _2: integer (nullable = false)
 |    |-- _3: double (nullable = false)

df4.show() outputs below data frame.


// Output:
+-----+------+----------+----------+------------+----------------+--------------+
|EmpId|Salary|lit_value1|lit_value2|typedLit_seq|    typedLit_map|typedLit_struct|
+-----+------+----------+----------+------------+----------------+--------------+
|  111| 50000|         1|       200|   [1, 2, 3]|[a -> 1, b -> 2]|   [a, 2, 1.0]|
|  222| 60000|         1|       200|   [1, 2, 3]|[a -> 1, b -> 2]|   [a, 2, 1.0]|
|  333| 40000|         1|       200|   [1, 2, 3]|[a -> 1, b -> 2]|   [a, 2, 1.0]|
+-----+------+----------+----------+------------+----------------+--------------+

5. Add Multiple Columns using Map

You can add multiple columns to Spark DataFrame in several ways if you wanted to add a known set of columns you can easily do by chaining withColumn() or on select(). However, sometimes you may need to add multiple columns after applying some transformations n that case you can use either map() or foldLeft(). Let’s see an example with a map.

I don’t have a real-time scenario to add multiple columns, below is just a skeleton on how to use. I will update this once I have a Scala example.


// Add Multiple Columns using Map
// Let's assume DF has just 3 columns c1,c2,c3
val df2 = df.map(row=>{
// Apply transformation on these columns and derive multiple columns
// And store these column vlaues into c5,c6,c7,c8,c9,10
(c1,c2,c5,c6,c7,c8,c9,c10)
})

If you closely look at the above snippet, DataFrame has 3 columns and we are deriving multiple columns dynamically from existing columns by applying transformations, this can be a split() function or any custom UDF and finally dropping an existing column.

6. Source Code to Add Multiple Columns


// Source Code to Add Multiple Columns
package com.sparkbyexamples.spark.dataframe.functions
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, lit, typedLit, when}
import org.apache.spark.sql.types.IntegerType
object AddColumn extends App {
  val spark = SparkSession.builder()
    .appName("SparkByExamples.com")
    .master("local")
    .getOrCreate()
  import spark.sqlContext.implicits._
  val data = Seq(("111",50000),("222",60000),("333",40000))
  val df = data.toDF("EmpId","Salary")
  df.show(false)
  // Derive a new column from existing
  df.withColumn("CopiedColumn",df("salary")* -1)
    .show(false)
  // Using select
  df.select($"EmpId",$"Salary", ($"salary"* -1).as("CopiedColumn") )
    .show(false)
  // Adding a literal
  val df2 = df.select(col("EmpId"),col("Salary"),lit("1").as("lit_value1"))
  df2.show()
  
  val df3 = df2.withColumn("lit_value2",
    when(col("Salary") >=40000 && col("Salary")  1, "b" -> 2)))
    .withColumn("typedLit_struct",typedLit(("a", 2, 1.0)))
  df4.printSchema()
  df4.show()
}

The complete source code is available at GitHub project

7. Conclusion

In this article, you have learned how to add a new column and multiple columns to Spark DataFrame using withColumn(), select(), lit(), map() functions by working with Scala example.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has 3 Comments

  1. NNK

    Thanks Meera. I’ve corrected it.

  2. Meera

    Thanks a lot fro sharing knowledge….very informative

Comments are closed.