• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:5 mins read
You are currently viewing PySpark Check Column Exists in DataFrame

Problem: I have a PySpark DataFrame and I would like to check if a column exists in the DataFrame schema, could you please explain how to do it? Also, I have a need to check if DataFrame columns present in the list of strings.

1. Solution: PySpark Check if Column Exists in DataFrame

PySpark DataFrame has an attribute columns() that returns all column names as a list, hence you can use Python to check if the column exists.


listColumns=df.columns
"colum_name"  in listColumns

2. Check by Case insensitive

Let’s check if column exists by case insensitive, here I am converting column name you wanted to check & all DataFrame columns to Caps.


"column_name".upper() in (name.upper() for name in df.columns)

3. Check if Column exists in Nested Struct DataFrame

df.columns don’t return columns from the nested struct, so If you have a DataFrame with nested struct columns, you can check if the column exists on the nested column by getting schema in a string using df.schema.simpleString()


df.schema.simpleString().find("column_name:")
#or
"column_name:" in df.schema.simpleString()

4. Check if a Field Exists in a DataFrame

If you want to check if a Column exists with the same Data Type, then use the PySpark schema functions df.schema.fieldNames() or df.schema.


from pyspark.sql.types import StructField,StringType
print("name" in df.schema.fieldNames())
print(StructField("name",StringType(),True) in df.schema)

4. Complete Example of How to Check Column Presents in PySpark DataFrame


from pyspark.sql import Row
data=[Row(name="James",prop=Row(hair="black",eye="blue")),
      Row(name="Ann",prop=Row(hair="grey",eye="black"))]
df=spark.createDataFrame(data)
df.printSchema()
#root
# |-- name: string (nullable = true)
# |-- prop: struct (nullable = true)
# |    |-- hair: string (nullable = true)
# |    |-- eye: string (nullable = true)

# check if column exists
print(df.columns)
#['name', 'prop']
print("name" in df.columns)
# True

#case in-sensitive
print("name".upper() in (name.upper() for name in df.columns))
# True

#to check if you have nested columns
print(df.schema.simpleString())
#struct<name:string,prop:struct>

print(df.schema.simpleString().find('hair:'))
#31

print('hair:' in df.schema.simpleString())
#True

from pyspark.sql.types import StructField,StringType
print("name" in df.schema.fieldNames())
print(StructField("name",StringType(),True) in df.schema)

</name:string,prop:struct

Conclusion

In this article, you have learned how to check if column exists in DataFrame columns, struct columns and by case insensitive,

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium