PySpark Update a Column with Value

You can do an update of PySpark DataFrame Column using withColum () transformation, select(), and SQL (); since DataFrames are distributed immutable collections, you can’t really change the column values; however, when you change the value using withColumn() or any approach. PySpark returns a new Dataframe with updated values. I will explain how to update or change the DataFrame column using Python examples in this article.

Advertisements

Syntax


# Syntax
DataFrame.withColumn(colName, col)

Parameters:

colName: str: string, name of the new column.

col: Column: Column expression for the new column.

withColumn returns a new DataFrame by adding a column or replacing the existing one with the same name.

Note: The column expression must be an expression of the same DataFrame. Adding a column from some other DataFrame will raise an error.

Let’s create a simple DataFrame to demonstrate the update.


# Imports
from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
  ('Robert','Williams','M',6200)
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()

# Output
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|     M|  3000|
|     Anna|    Rose|     F|  4100|
|   Robert|Williams|     M|  6200|
+---------+--------+------+------+

PySpark Update Column Examples

Below, the PySpark code updates the salary column value of DataFrame by multiplying salary by three times. Note that withColumn() is used to update or add a new column to the DataFrame, when you pass the existing column name to the first argument to withColumn() operation it updates, if the value is new then it creates a new column.


df2=df.withColumn("salary", df.salary*3)
df2.show()
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|     M|  9000|
|     Anna|    Rose|     F| 12300|
|   Robert|Williams|     M| 18600|
+---------+--------+------+------+

Update Column Based on Condition

Let’s see how to update a column value based on a condition by using When Otherwise. Below example updates gender column with the value Male for M, Female for F, and keep the same value for others.


from pyspark.sql.functions import when
df3 = df.withColumn("gender", when(df.gender == "M","Male") \
      .when(df.gender == "F","Female") \
      .otherwise(df.gender))
df3.show()
# Output
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|  Male|  3000|
|     Anna|    Rose|Female|  4100|
|   Robert|Williams|  Male|  6200|
+---------+--------+------+------+

Update DataFrame Column Data Type

You can also update a Data Type of column using withColumn() but additionally, you have to use cast() function of PySpark Column class. Below code updates salary column to String type.


df4=df.withColumn("salary",df.salary.cast("String"))
df4.printSchema()
root
 |-- firstname: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: string (nullable = true)

Update DataFrame Column with a constant

The withColumn() method is also used to create a new DataFrame column with a constant value using the lit function. Below is an example.


df.withColumn("state",lit("CA")).show()

Yields below output.



# Output
+---------+--------+------+------+-----+
|firstname|lastname|gender|salary|state|
+---------+--------+------+------+-----+
|    James|   Smith|     M|  3000|   CA| 
|     Anna|    Rose|     F|  4100|   CA|
|   Robert|Williams|     M|  6200|   CA|
+---------+--------+------+------+-----+

PySpark SQL Update


df.createOrReplaceTempView("PER")
df5=spark.sql("select firstname,gender,salary*3 as salary from PER")
df5.show()

Frequently Asked Questions

How is withColumn different from select in PySpark

Both withColumn and select are used for DataFrame transformations, withColumn is used for adding or updating columns. Whereas select is used to select existing columns or apply transformations to existing columns.

Can I use withColumn to update multiple columns simultaneously

We can use withColumn to update multiple columns by chaining multiple calls to the method. Each call adds or updates one column.

How can I conditionally update a column using withColumn?

Conditional updates can be achieved by using PySpark’s when and otherwise functions within withColumn.
For Example:
df_updated = df.withColumn(“new_col”, when(col(“old_col”) > 10, “High”).otherwise(“Low”))

Can we use withColumn to drop a column from a DataFrame?

withColumn is not used for dropping columns. To drop a column, we can use the drop method of PySpark.
For example:
df_dropped = df.drop(“column_to_drop”)

How can I rename a column with withColumn?

withColumn is not used for renaming columns. To rename a column, we use the withColumnRenamed method or the alias method.
For example:
df_renamed = df.withColumnRenamed(“name to update”, “new_column”)

Conclusion

Here, I have covered updating a PySpark DataFrame Column values, updating values based on condition, changing the data type, and updating using SQL expression.

Happy Learning !!

References

https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html