PySpark Replace Empty Value With None/null on DataFrame

In PySpark DataFrame use when().otherwise() SQL functions to find out if a column has an empty value and use withColumn() transformation to replace a value of an existing column. In this article, I will explain how to replace an empty value with None/null on a single column, all columns selected a list of columns of DataFrame with Python examples.

Note: In PySpark DataFrame None value are shown as null value.

Related: How to get Count of NULL, Empty String Values in PySpark DataFrame

Let’s create a PySpark DataFrame with empty values on some rows.


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("","CA"), ("Julia",""),("Robert",""),("","NJ")]
df =spark.createDataFrame(data,["name","state"])
df.show()

#+------+-----+
#|  name|state|
#+------+-----+
#|      |   CA|
#| Julia|     |
#|Robert|     |
#|      |   NJ|
#+------+-----+

PySpark Replace Empty Value with None

In order to replace empty value with None/null on single DataFrame column, you can use withColumn() and when().otherwise() function.


#Replace empty string with None value
from pyspark.sql.functions import col,when
df.withColumn("name", \
       when(col("name")=="" ,None) \
          .otherwise(col("name"))) \
  .show()

#+------+-----+
#|  name|state|
#+------+-----+
#|  null|   CA|
#| Julia|     |
#|Robert|     |
#|  null|   NJ|
#+------+-----+

Replace Empty Value with None on All DataFrame Columns

To replace an empty value with None/null on all DataFrame columns, use df.columns to get all DataFrame columns, loop through this by applying conditions.


#Replace empty string with None for all columns
from pyspark.sql.functions import col,when
df2=df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in df.columns])
df2.show()

#+------+-----+
#|  name|state|
#+------+-----+
#|  null|   CA|
#| Julia| null|
#|Robert| null|
#|  null|   NJ|
#+------+-----+

Replace Empty Value with None on Selected Columns

Similarly, you can also replace a selected list of columns, specify all columns you wanted to replace in a list and use this on same expression above.


#Replace empty string with None on selected columns
from pyspark.sql.functions import col,when
replaceCols=["name","state"]
df2=df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in replaceCols])
df2.show()

Complete Example

Following is a complete example of replace empty value with None.


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("","CA"), ("Julia",""),("Robert",""),("","NJ")]
df =spark.createDataFrame(data,["name","state"])
df.show()

#Replace empty string with None
from pyspark.sql.functions import col,when
df.withColumn("name", \
       when(col("name")=="" ,None) \
          .otherwise(col("name"))) \
  .show()
  
#Replace empty string with None for all columns
df2=df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in df.columns])
df2.show()

#Replace empty string with None on selected columns
replaceCols=["name","state"]
df2=df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in replaceCols])
df2.show()

Conclusion

In summary, you have learned how to replace empty string values with None/null on single, all, and selected PySpark DataFrame columns using Python example.

Reference

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply