PySpark Update a Column with Value

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame’s are distributed immutable collection you can’t really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values. In this article, I will explain how to update or change the DataFrame column by using Python examples.

Let’s create a simple DataFrame to demonstrate the update.


from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
  ('Robert','Williams','M',6200)
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|     M|  3000|
|     Anna|    Rose|     F|  4100|
|   Robert|Williams|     M|  6200|
+---------+--------+------+------+

PySpark Update Column Examples

Below PySpark code update salary column value of DataFrame by multiplying salary by 3 times. Note that withColumn() is used to update or add a new column to the DataFrame, when you pass the existing column name to the first argument to withColumn() operation it updates, if the value is new then it creates a new column.


df2=df.withColumn("salary", df.salary*3)
df2.show()
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|     M|  9000|
|     Anna|    Rose|     F| 12300|
|   Robert|Williams|     M| 18600|
+---------+--------+------+------+

Update Column Based on Condition

Let’s see how to update a column value based on a condition by using When Otherwise. below example updates gender column with value Male for M, Female for F and keep the same value for others.


from pyspark.sql.functions import when
df3 = df.withColumn("gender", when(df.gender == "M","Male") \
      .when(df.gender == "F","Female") \
      .otherwise(df.gender))
df3.show()

+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|  Male|  3000|
|     Anna|    Rose|Female|  4100|
|   Robert|Williams|  Male|  6200|
+---------+--------+------+------+

Update DataFrame Column Data Type

You can also update a Data Type of column using withColumn() but additionally, you have to use cast() function of PySpark Column class. Below code updates salary column to String type.


df4=df.withColumn("salary",df.salary.cast("String"))
df4.printSchema()
root
 |-- firstname: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: string (nullable = true)

PySpark SQL Update


df.createOrReplaceTempView("PER")
df5=spark.sql("select firstname,gender,salary*3 as salary from PER")
df5.show()

Conclusion

Here, I have covered updating a PySpark DataFrame Column values, update values based on condition, change the data type, and updates using SQL expression.

Happy Learning !!

References

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply

This Post Has One Comment

  1. Anonymous

    Thank you so much

You are currently viewing PySpark Update a Column with Value