You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame’s are distributed immutable collection you can’t really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values. In this article, I will explain how to update or change the DataFrame column by using Python examples.
Let’s create a simple DataFrame to demonstrate the update.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
('Robert','Williams','M',6200)
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
| James| Smith| M| 3000|
| Anna| Rose| F| 4100|
| Robert|Williams| M| 6200|
+---------+--------+------+------+
PySpark Update Column Examples
Below PySpark code update salary column value of DataFrame by multiplying salary by 3 times. Note that withColumn() is used to update or add a new column to the DataFrame, when you pass the existing column name to the first argument to withColumn() operation it updates, if the value is new then it creates a new column.
df2=df.withColumn("salary", df.salary*3)
df2.show()
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
| James| Smith| M| 9000|
| Anna| Rose| F| 12300|
| Robert|Williams| M| 18600|
+---------+--------+------+------+
Update Column Based on Condition
Let’s see how to update a column value based on a condition by using When Otherwise. below example updates gender
column with value Male for M, Female for F and keep the same value for others.
from pyspark.sql.functions import when
df3 = df.withColumn("gender", when(df.gender == "M","Male") \
.when(df.gender == "F","Female") \
.otherwise(df.gender))
df3.show()
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
| James| Smith| Male| 3000|
| Anna| Rose|Female| 4100|
| Robert|Williams| Male| 6200|
+---------+--------+------+------+
Update DataFrame Column Data Type
You can also update a Data Type of column using withColumn()
but additionally, you have to use cast()
function of PySpark Column
class. Below code updates salary
column to String type.
df4=df.withColumn("salary",df.salary.cast("String"))
df4.printSchema()
root
|-- firstname: string (nullable = true)
|-- lastname: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: string (nullable = true)
PySpark SQL Update
df.createOrReplaceTempView("PER")
df5=spark.sql("select firstname,gender,salary*3 as salary from PER")
df5.show()
Conclusion
Here, I have covered updating a PySpark DataFrame Column values, update values based on condition, change the data type, and updates using SQL expression.
Happy Learning !!
Related Articles
- PySpark Groupby Agg (aggregate) – Explained
- PySpark Groupby on Multiple Columns
- PySpark Column alias after groupBy() Example
- PySpark DataFrame groupBy and Sort by Descending Order
- PySpark Groupby Count Distinct
- PySpark Column Class | Operators & Functions
- PySpark Column alias after groupBy() Example
- PySpark Get Number of Rows and Columns
Thank you so much