You can do an update of PySpark DataFrame Column using withColum () transformation, select(), and SQL (); since DataFrames are distributed immutable collections, you can’t really change the column values; however, when you change the value using withColumn() or any approach. PySpark returns a new Dataframe with updated values. I will explain how to update or change the DataFrame column using Python examples in this article.
Syntax
# Syntax
DataFrame.withColumn(colName, col)
Parameters:
colName: str: string, name of the new column.
col: Column: Column expression for the new column.
withColumn returns a new DataFrame by adding a column or replacing the existing one with the same name.
Note: The column expression must be an expression of the same DataFrame. Adding a column from some other DataFrame will raise an error.
Let’s create a simple DataFrame to demonstrate the update.
# Imports
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
('Robert','Williams','M',6200)
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
# Output
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
| James| Smith| M| 3000|
| Anna| Rose| F| 4100|
| Robert|Williams| M| 6200|
+---------+--------+------+------+
PySpark Update Column Examples
Below, the PySpark code updates the salary column value of DataFrame by multiplying salary by three times. Note that withColumn() is used to update or add a new column to the DataFrame, when you pass the existing column name to the first argument to withColumn() operation it updates, if the value is new then it creates a new column.
df2=df.withColumn("salary", df.salary*3)
df2.show()
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
| James| Smith| M| 9000|
| Anna| Rose| F| 12300|
| Robert|Williams| M| 18600|
+---------+--------+------+------+
Update Column Based on Condition
Let’s see how to update a column value based on a condition by using When Otherwise. Below example updates gender
column with the value Male for M, Female for F, and keep the same value for others.
from pyspark.sql.functions import when
df3 = df.withColumn("gender", when(df.gender == "M","Male") \
.when(df.gender == "F","Female") \
.otherwise(df.gender))
df3.show()
# Output
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
| James| Smith| Male| 3000|
| Anna| Rose|Female| 4100|
| Robert|Williams| Male| 6200|
+---------+--------+------+------+
Update DataFrame Column Data Type
You can also update a Data Type of column using withColumn()
but additionally, you have to use cast()
function of PySpark Column
class. Below code updates salary
column to String type.
df4=df.withColumn("salary",df.salary.cast("String"))
df4.printSchema()
root
|-- firstname: string (nullable = true)
|-- lastname: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: string (nullable = true)
Update DataFrame Column with a constant
The withColumn
() method is also used to create a new DataFrame column with a constant value using the lit function. Below is an example.
df.withColumn("state",lit("CA")).show()
Yields below output.
# Output
+---------+--------+------+------+-----+
|firstname|lastname|gender|salary|state|
+---------+--------+------+------+-----+
| James| Smith| M| 3000| CA|
| Anna| Rose| F| 4100| CA|
| Robert|Williams| M| 6200| CA|
+---------+--------+------+------+-----+
PySpark SQL Update
df.createOrReplaceTempView("PER")
df5=spark.sql("select firstname,gender,salary*3 as salary from PER")
df5.show()
Frequently Asked Questions
withColumn
different from select
in PySpark Both withColumn and select are used for DataFrame transformations, withColumn is used for adding or updating columns. Whereas select
is used to select existing columns or apply transformations to existing columns.
withColumn
to update multiple columns simultaneously We can use withColumn
to update multiple columns by chaining multiple calls to the method. Each call adds or updates one column.
withColumn
? Conditional updates can be achieved by using PySpark’s when
and otherwise
functions within withColumn
.
For Example:
df_updated = df.withColumn(“new_col”, when(col(“old_col”) > 10, “High”).otherwise(“Low”))
withColumn
to drop a column from a DataFrame? withColumn
is not used for dropping columns. To drop a column, we can use the drop
method of PySpark.
For example:
df_dropped = df.drop(“column_to_drop”)
withColumn
? withColumn
is not used for renaming columns. To rename a column, we use the withColumnRenamed
method or the alias
method.
For example:
df_renamed = df.withColumnRenamed(“name to update”, “new_column”)
Conclusion
Here, I have covered updating a PySpark DataFrame Column values, updating values based on condition, changing the data type, and updating using SQL expression.
Happy Learning !!
Related Articles
- PySpark Groupby Agg (aggregate) – Explained
- PySpark Groupby on Multiple Columns
- PySpark Column alias after groupBy() Example
- PySpark DataFrame groupBy and Sort by Descending Order
- PySpark Groupby Count Distinct
- PySpark Column Class | Operators & Functions
- PySpark Column alias after groupBy() Example
- PySpark Get Number of Rows and Columns
Thank you so much