PySpark Check Column Exists in DataFrame

Naveen Nelamali

3 years ago

Problem: I have a PySpark DataFrame and I would like to check if a column exists in the DataFrame schema, could you please explain how to do it? Also, I have a need to check if DataFrame columns present in the list of strings.

1. Solution: PySpark Check if Column Exists in DataFrame

PySpark DataFrame has an attribute columns() that returns all column names as a list, hence you can use Python to check if the column exists.


listColumns=df.columns
"colum_name"  in listColumns

2. Check by Case insensitive

Let’s check if column exists by case insensitive, here I am converting column name you wanted to check & all DataFrame columns to Caps.


"column_name".upper() in (name.upper() for name in df.columns)

3. Check if Column exists in Nested Struct DataFrame

df.columns don’t return columns from the nested struct, so If you have a DataFrame with nested struct columns, you can check if the column exists on the nested column by getting schema in a string using df.schema.simpleString()


df.schema.simpleString().find("column_name:")
#or
"column_name:" in df.schema.simpleString()

4. Check if a Field Exists in a DataFrame

If you want to check if a Column exists with the same Data Type, then use the PySpark schema functions df.schema.fieldNames() or df.schema.


from pyspark.sql.types import StructField,StringType
print("name" in df.schema.fieldNames())
print(StructField("name",StringType(),True) in df.schema)

4. Complete Example of How to Check Column Presents in PySpark DataFrame


from pyspark.sql import Row
data=[Row(name="James",prop=Row(hair="black",eye="blue")),
      Row(name="Ann",prop=Row(hair="grey",eye="black"))]
df=spark.createDataFrame(data)
df.printSchema()
#root
# |-- name: string (nullable = true)
# |-- prop: struct (nullable = true)
# |    |-- hair: string (nullable = true)
# |    |-- eye: string (nullable = true)

# check if column exists
print(df.columns)
#['name', 'prop']
print("name" in df.columns)
# True

#case in-sensitive
print("name".upper() in (name.upper() for name in df.columns))
# True

#to check if you have nested columns
print(df.schema.simpleString())
#struct<name:string,prop:struct>

print(df.schema.simpleString().find('hair:'))
#31

print('hair:' in df.schema.simpleString())
#True

from pyspark.sql.types import StructField,StringType
print("name" in df.schema.fieldNames())
print(StructField("name",StringType(),True) in df.schema)

</name:string,prop:struct

Conclusion

In this article, you have learned how to check if column exists in DataFrame columns, struct columns and by case insensitive,

Happy Learning !!

References

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructType.html