Problem: I have a PySpark DataFrame and I would like to check if a column exists in the DataFrame schema, could you please explain how to do it? Also, I have a need to check if DataFrame columns present in the list of strings.
1. Solution: PySpark Check if Column Exists in DataFrame
PySpark DataFrame has an attribute columns()
that returns all column names as a list, hence you can use Python to check if the column exists.
listColumns=df.columns
"colum_name" in listColumns
2. Check by Case insensitive
Let’s check if column exists by case insensitive, here I am converting column name you wanted to check & all DataFrame columns to Caps.
"column_name".upper() in (name.upper() for name in df.columns)
3. Check if Column exists in Nested Struct DataFrame
df.columns
don’t return columns from the nested struct, so If you have a DataFrame with nested struct columns, you can check if the column exists on the nested column by getting schema in a string using df.schema.simpleString()
df.schema.simpleString().find("column_name:")
#or
"column_name:" in df.schema.simpleString()
4. Check if a Field Exists in a DataFrame
If you want to check if a Column exists with the same Data Type, then use the PySpark schema functions df.schema.fieldNames()
or df.schema
.
from pyspark.sql.types import StructField,StringType
print("name" in df.schema.fieldNames())
print(StructField("name",StringType(),True) in df.schema)
4. Complete Example of How to Check Column Presents in PySpark DataFrame
from pyspark.sql import Row
data=[Row(name="James",prop=Row(hair="black",eye="blue")),
Row(name="Ann",prop=Row(hair="grey",eye="black"))]
df=spark.createDataFrame(data)
df.printSchema()
#root
# |-- name: string (nullable = true)
# |-- prop: struct (nullable = true)
# | |-- hair: string (nullable = true)
# | |-- eye: string (nullable = true)
# check if column exists
print(df.columns)
#['name', 'prop']
print("name" in df.columns)
# True
#case in-sensitive
print("name".upper() in (name.upper() for name in df.columns))
# True
#to check if you have nested columns
print(df.schema.simpleString())
#struct<name:string,prop:struct>
print(df.schema.simpleString().find('hair:'))
#31
print('hair:' in df.schema.simpleString())
#True
from pyspark.sql.types import StructField,StringType
print("name" in df.schema.fieldNames())
print(StructField("name",StringType(),True) in df.schema)
</name:string,prop:struct
Conclusion
In this article, you have learned how to check if column exists in DataFrame columns, struct columns and by case insensitive,
Happy Learning !!
Related Articles
- PySpark Column Class | Operators & Functions
- PySpark Column alias after groupBy() Example
- How to Convert PySpark Column to List?
- PySpark Get Number of Rows and Columns
- PySpark Groupby on Multiple Columns
- PySpark alias() Column & DataFrame Examples
- PySpark Retrieve DataType & Column Names of DataFrame