PySpark Check Column Exists in DataFrame

You can directly use the df.columns list to check if the column name exists. In PySpark, df.columns is an attribute of a DataFrame that returns a list of the column names in the DataFrame. This attribute provides a straightforward way to access and inspect the names of all columns.

1. Checking Column Existence Using `df.columns`

Use columns attribute from PySpark DataFrame, check if a column exists in a DataFrame. DataFrame.columns returns all column names as a list and verify column existence using Python’s in operator along with if statement.


# Using df.columns
if "column_name" in df.columns:
    print("Column exists in DataFrame")
else:
    print("Column does not exist in DataFrame")

2. Check by Case insensitive

To check if a column exists in a PySpark DataFrame in a case-insensitive manner, convert both the column name and the DataFrame’s column names to a consistent case (e.g., uppercase) before comparing. Use the following approach:


# Case insensitive
column_to_check = "column_name"
exists = column_to_check.upper() in (name.upper() for name in df.columns)

Here, column_to_check.upper() converts the column name to uppercase. The generator expression (name.upper() for name in df.columns) converts each column name in the DataFrame to uppercase. The in operator then checks for the presence of the column.

3. Checking Column Existence Using Schema

You can check if a column exists in a PySpark DataFrame using the schema attribute, which contains the DataFrame’s schema information. By examining the schema, you can verify the presence of a column by checking for its name. The schema attribute provides a StructType object, which contains a list of StructField objects representing each column.


from pyspark.sql.types import StructField,StringType
print("name" in df.schema.fieldNames())
print(StructField("name",StringType(),True) in df.schema)

4. Check if a Column exists in the Nested Column

df.columns don’t return columns from the nested struct, so If you have a DataFrame with nested struct columns, you can check if the column exists on the nested column by getting schema in a string using df.schema.simpleString()


# Check column exists in nested column
df.schema.simpleString().find("column_name:")
#or
"column_name:" in df.schema.simpleString()

Alternatively, you can also try the following.


from pyspark.sql.types import StructType

def nested_column_exists(schema, col_name):
    for field in schema.fields:
        if field.name == col_name:
            return True
        if isinstance(field.dataType, StructType):
            if nested_column_exists(field.dataType, col_name):
                return True
    return False

exists = nested_column_exists(df.schema, "nested_column_name")

5. Complete Example


from pyspark.sql import Row
data=[Row(name="James",prop=Row(hair="black",eye="blue")),
      Row(name="Ann",prop=Row(hair="grey",eye="black"))]
df=spark.createDataFrame(data)
df.printSchema()
#root
# |-- name: string (nullable = true)
# |-- prop: struct (nullable = true)
# |    |-- hair: string (nullable = true)
# |    |-- eye: string (nullable = true)

# check if column exists
print(df.columns)
#['name', 'prop']
print("name" in df.columns)
# True

#case in-sensitive
print("name".upper() in (name.upper() for name in df.columns))
# True

#to check if you have nested columns
print(df.schema.simpleString())
#struct<name:string,prop:struct>

print(df.schema.simpleString().find('hair:'))
#31

print('hair:' in df.schema.simpleString())
#True

from pyspark.sql.types import StructField,StringType
print("name" in df.schema.fieldNames())
print(StructField("name",StringType(),True) in df.schema)

</name:string,prop:struct

Conclusion

Checking if a column exists in a PySpark DataFrame is crucial for ensuring data integrity and avoiding errors in the data processing. For flat schemas, the df.columns attribute offers a simple and efficient method, with case-insensitive checks achievable through consistent casing. For nested structures, recursive traversal of the schema using a custom function is necessary. These techniques provide robust solutions for validating column presence, supporting reliable and error-free data workflows in PySpark.

Happy Learning !!

References

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructType.html

1. Checking Column Existence Using df.columns