Spark – How to update the DataFrame column?

In Spark, updating the DataFrame can be done by using withColumn() transformation function, In this article, I will explain how to update or change the DataFrame column.

I will also explain how to update the column based on condition.

First, let’s create a DataFrame


val data = Seq(Row(Row("James ","","Smith"),"36636","M","3000"),
      Row(Row("Michael ","Rose",""),"40288","M","4000"),
      Row(Row("Robert ","","Williams"),"42114","M","4000"),
      Row(Row("Maria ","Anne","Jones"),"39192","F","4000"),
      Row(Row("Jen","Mary","Brown"),"","F","-1")
)

val schema = new StructType()
      .add("name",new StructType()
      .add("firstname",StringType)
      .add("middlename",StringType)
      .add("lastname",StringType))
      .add("dob",StringType)
      .add("gender",StringType)
      .add("salary",StringType)

val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)

1. Update the column value

Spark withColumn() function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with.


// Update the column value
df.withColumn("salary",col("salary")*100)

If the column name specified not found, it creates a new column with the value specified.

2. Update the column type

Changing the data type on a DataFrame column can be done using cast() function.


// Update the column type
df.withColumn("salary",col("salary").cast("Integer"))

3. Update based on condition

Here, we use when otherwise combination to update the DataFrame column.


// Update based on condition
val df2 = df.withColumn("new_gender", when(col("gender") === "M","Male")
      .when(col("gender") === "F","Female")
      .otherwise("Unknown"))

Happy Learning !!

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply

This Post Has One Comment

  1. Anonymous

    Thanks for that, great site

You are currently viewing Spark – How to update the DataFrame column?