Adding a new column or multiple columns to Spark DataFrame can be done using withColumn(), select()
, map() methods of DataFrame, In this article, I will explain how to add a new column from the existing column, adding a constant or literal value, and finally adding a list column to DataFrame.
First, let’s create a simple DataFrame to work with.
import spark.sqlContext.implicits._
val data = Seq(("111",50000),("222",60000),("333",40000))
val df = data.toDF("EmpId","Salary")
df.show(false)
Yields below output
+-----+------+
|EmpId|Salary|
+-----+------+
|111 |50000 |
|222 |60000 |
|333 |40000 |
+-----+------+
Using withColumn() to Add a New Column
withColumn() is used to add a new or update an existing column on DataFrame, here, I will just explain how to add a new column by using an existing column. withColumn() function takes two arguments, the first argument is the name of the new column and the second argument is the value of the column in Column
type.
//Derive a new column from existing
df.withColumn("CopiedColumn",col("salary")* -1)
.show(false)
Here, we have added a new column <strong>CopiedColumn</strong>
by multiplying -1 with an existing column <strong>Salary</strong>
. This yields the below output.
+-----+------+------------+
|EmpId|Salary|CopiedColumn|
+-----+------+------------+
|111 |50000 |-50000 |
|222 |60000 |-60000 |
|333 |40000 |-40000 |
+-----+------+------------+
You can also add columns based on some conditions, please refer to Spark Case When and When Otherwise examples
Using Select to Add Column
The above statement can also be written using select() as below and this yields the same as the above output. You can also add multiple columns using select.
//Using select
df.select($"EmpId",$"Salary", ($"salary"* -1).as("CopiedColumn") )
.show(false)
You can chain withColumn() to add multiple columns to DataFrame.
Adding a Constant Column to DataFrame
Let’s create a new column with constant value using lit() SQL function, on the below snippet, we are creating a new column by adding a literal ‘1’ to Spark DataFrame.
val df2 = df.select(col("EmpId"),col("Salary"),lit("1").as("lit_value1"))
df2.show()
+-----+------+----------+
|EmpId|Salary|lit_value1|
+-----+------+----------+
| 111| 50000| 1|
| 222| 60000| 1|
| 333| 40000| 1|
+-----+------+----------+
Adding the same constant literal to all records in DataFrame may not be real-time useful so let’s see another example.
val df3 = df2.withColumn("lit_value2",
when(col("Salary") >=40000 && col("Salary") <= 50000,
lit("100").cast(IntegerType))
.otherwise(lit("200").cast(IntegerType))
)
df3.show()
Adding a List column to DataFrame
Following example shows on how to create a new column with collection using typedLit() sql function. On below snippet, we are adding a new columns Seq(1, 2, 3), Map(“a” -> 1, “b” -> 2) and structure (“a”, 2, 1.0) to spark DataFrame.
val df4 = df3.withColumn("typedLit_seq",typedLit(Seq(1, 2, 3)))
.withColumn("typedLit_map",typedLit(Map("a" -> 1, "b" -> 2)))
.withColumn("typedLit_struct",typedLit(("a", 2, 1.0)))
df4.printSchema()
df4.show()
df4.printSchema()
outputs below schema
root
|-- EmpId: string (nullable = true)
|-- Salary: integer (nullable = false)
|-- lit_value1: string (nullable = false)
|-- lit_value2: integer (nullable = true)
|-- typedLit_seq: array (nullable = false)
| |-- element: integer (containsNull = false)
|-- typedLit_map: map (nullable = false)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
|-- typedLit_struct: struct (nullable = false)
| |-- _1: string (nullable = true)
| |-- _2: integer (nullable = false)
| |-- _3: double (nullable = false)
df4.show() outputs below data frame.
+-----+------+----------+----------+------------+----------------+--------------+
|EmpId|Salary|lit_value1|lit_value2|typedLit_seq| typedLit_map|typedLit_struct|
+-----+------+----------+----------+------------+----------------+--------------+
| 111| 50000| 1| 200| [1, 2, 3]|[a -> 1, b -> 2]| [a, 2, 1.0]|
| 222| 60000| 1| 200| [1, 2, 3]|[a -> 1, b -> 2]| [a, 2, 1.0]|
| 333| 40000| 1| 200| [1, 2, 3]|[a -> 1, b -> 2]| [a, 2, 1.0]|
+-----+------+----------+----------+------------+----------------+--------------+
Add Multiple Columns using Map
You can add multiple columns to Spark DataFrame in several ways if you wanted to add a known set of columns you can easily do by chaining withColumn() or on select(). However, sometimes you may need to add multiple columns after applying some transformations n that case you can use either map() or foldLeft(). Let’s see an example with a map.
I don’t have a real-time scenario to add multiple columns, below is just a skeleton on how to use. I will update this once I have a Scala example.
//Let's assume DF has just 3 columns c1,c2,c3
val df2 = df.map(row=>{
//apply transformation on these columns and derive multiple columns
//and store these column vlaues into c5,c6,c7,c8,c9,10
(c1,c2,c5,c6,c7,c8,c9,c10)
})
If you closely look at the above snippet, DataFrame has 3 columns and we are deriving multiple columns dynamically from existing columns by applying transformations, this can be a split() function or any custom UDF and finally dropping an existing column.
Source Code to Add Multiple Columns
package com.sparkbyexamples.spark.dataframe.functions
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, lit, typedLit, when}
import org.apache.spark.sql.types.IntegerType
object AddColumn extends App {
val spark = SparkSession.builder()
.appName("SparkByExamples.com")
.master("local")
.getOrCreate()
import spark.sqlContext.implicits._
val data = Seq(("111",50000),("222",60000),("333",40000))
val df = data.toDF("EmpId","Salary")
df.show(false)
//Derive a new column from existing
df.withColumn("CopiedColumn",df("salary")* -1)
.show(false)
//Using select
df.select($"EmpId",$"Salary", ($"salary"* -1).as("CopiedColumn") )
.show(false)
//Adding a literal
val df2 = df.select(col("EmpId"),col("Salary"),lit("1").as("lit_value1"))
df2.show()
val df3 = df2.withColumn("lit_value2",
when(col("Salary") >=40000 && col("Salary") 1, "b" -> 2)))
.withColumn("typedLit_struct",typedLit(("a", 2, 1.0)))
df4.printSchema()
df4.show()
}
The complete source code is available at GitHub project
Conclusion
In this article, you have learned how to add a new column and multiple columns to Spark DataFrame using withColumn(), select(), lit(), map() functions by working with Scala example.
Happy Learning !!
Related Articles
- Spark Add Constant Column to DataFrame
- Add Multiple Jars to Spark Submit Classpath?
- Spark SQL – Add row number to DataFrame
- Spark – Add Hours, Minutes, and Seconds to Timestamp
- Spark SQL – Add Day, Month, and Year to Date
- Spark Set JVM Options to Driver & Executors
- Spark Read and Write MySQL Database Table
Thanks a lot fro sharing knowledge….very informative
Thanks Meera. I’ve corrected it.
Since this could be reference to many developers. “withColumn” for multiple column has drawbacks described in the docs. I explained that here
https://stackoverflow.com/a/70362724/1393487