PySpark Update a Column with Value

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame’s are distributed immutable collection you can’t really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values. In this article, I will explain how to update or change the DataFrame column by using Python examples.

Let’s create a simple DataFrame to demonstrate the update.


from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
  ('Robert','Williams','M',6200)
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|     M|  3000|
|     Anna|    Rose|     F|  4100|
|   Robert|Williams|     M|  6200|
+---------+--------+------+------+

PySpark Update Column Examples

Below PySpark code update salary column value of DataFrame by multiplying salary by 3 times. Note that withColumn() is used to update or add a new column to the DataFrame, when you pass the existing column name to the first argument to withColumn() operation it updates, if the value is new then it creates a new column.


df2=df.withColumn("salary", df.salary*3)
df2.show()
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|     M|  9000|
|     Anna|    Rose|     F| 12300|
|   Robert|Williams|     M| 18600|
+---------+--------+------+------+

Update Column Based on Condition

Let’s see how to update a column value based on a condition by using When Otherwise. below example updates gender column with value Male for M, Female for F and keep the same value for others.


from pyspark.sql.functions import when
df3 = df.withColumn("gender", when(df.gender == "M","Male") \
      .when(df.gender == "F","Female") \
      .otherwise(df.gender))
df3.show()

+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|  Male|  3000|
|     Anna|    Rose|Female|  4100|
|   Robert|Williams|  Male|  6200|
+---------+--------+------+------+

Update DataFrame Column Data Type

You can also update a Data Type of column using withColumn() but additionally, you have to use cast() function of PySpark Column class. Below code updates salary column to String type.


df4=df.withColumn("salary",df.salary.cast("String"))
df4.printSchema()
root
 |-- firstname: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: string (nullable = true)

PySpark SQL Update


df.createOrReplaceTempView("PER")
df5=spark.sql("select firstname,gender,salary*3 as salary from PER")
df5.show()

Conclusion

Here, I have covered updating a PySpark DataFrame Column values, update values based on condition, change the data type, and updates using SQL expression.

Happy Learning !!

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply