In Spark, updating the DataFrame can be done by using withColumn() transformation function, In this article, I will explain how to update or change the DataFrame column.
I will also explain how to update the column based on condition.
First, let’s create a DataFrame
val data = Seq(Row(Row("James ","","Smith"),"36636","M","3000"),
Row(Row("Michael ","Rose",""),"40288","M","4000"),
Row(Row("Robert ","","Williams"),"42114","M","4000"),
Row(Row("Maria ","Anne","Jones"),"39192","F","4000"),
Row(Row("Jen","Mary","Brown"),"","F","-1")
)
val schema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("dob",StringType)
.add("gender",StringType)
.add("salary",StringType)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
1. Update the column value
Spark withColumn()
function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with.
// Update the column value
df.withColumn("salary",col("salary")*100)
If the column name specified not found, it creates a new column with the value specified.
2. Update the column type
Changing the data type on a DataFrame column can be done using cast()
function.
// Update the column type
df.withColumn("salary",col("salary").cast("Integer"))
3. Update based on condition
Here, we use when otherwise combination to update the DataFrame column.
// Update based on condition
val df2 = df.withColumn("new_gender", when(col("gender") === "M","Male")
.when(col("gender") === "F","Female")
.otherwise("Unknown"))
Happy Learning !!
Thanks for that, great site