In PySpark DataFrame use when().otherwise() SQL functions to find out if a column has an empty value and use withColumn() transformation to replace a value of an existing column. In this article, I will explain how to replace an empty value with None/null
on a single column, all columns selected a list of columns of DataFrame with Python examples.
Note: In PySpark DataFrame None
value are shown as null
value.
Related: How to get Count of NULL, Empty String Values in PySpark DataFrame
Let’s create a PySpark DataFrame with empty values on some rows.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("","CA"), ("Julia",""),("Robert",""),("","NJ")]
df =spark.createDataFrame(data,["name","state"])
df.show()
#+------+-----+
#| name|state|
#+------+-----+
#| | CA|
#| Julia| |
#|Robert| |
#| | NJ|
#+------+-----+
PySpark Replace Empty Value with None
In order to replace empty value with None/null
on single DataFrame column, you can use withColumn() and when().otherwise() function.
#Replace empty string with None value
from pyspark.sql.functions import col,when
df.withColumn("name", \
when(col("name")=="" ,None) \
.otherwise(col("name"))) \
.show()
#+------+-----+
#| name|state|
#+------+-----+
#| null| CA|
#| Julia| |
#|Robert| |
#| null| NJ|
#+------+-----+
Replace Empty Value with None on All DataFrame Columns
To replace an empty value with None/null on all DataFrame columns, use df.columns
to get all DataFrame columns, loop through this by applying conditions.
#Replace empty string with None for all columns
from pyspark.sql.functions import col,when
df2=df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in df.columns])
df2.show()
#+------+-----+
#| name|state|
#+------+-----+
#| null| CA|
#| Julia| null|
#|Robert| null|
#| null| NJ|
#+------+-----+
Replace Empty Value with None on Selected Columns
Similarly, you can also replace a selected list of columns, specify all columns you wanted to replace in a list and use this on same expression above.
#Replace empty string with None on selected columns
from pyspark.sql.functions import col,when
replaceCols=["name","state"]
df2=df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in replaceCols])
df2.show()
Complete Example
Following is a complete example of replace empty value with None.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("","CA"), ("Julia",""),("Robert",""),("","NJ")]
df =spark.createDataFrame(data,["name","state"])
df.show()
#Replace empty string with None
from pyspark.sql.functions import col,when
df.withColumn("name", \
when(col("name")=="" ,None) \
.otherwise(col("name"))) \
.show()
#Replace empty string with None for all columns
df2=df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in df.columns])
df2.show()
#Replace empty string with None on selected columns
replaceCols=["name","state"]
df2=df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in replaceCols])
df2.show()
Conclusion
In summary, you have learned how to replace empty string values with None/null on single, all, and selected PySpark DataFrame columns using Python example.