You can get all columns of a DataFrame as an Array[String] by using columns
attribute of Spark DataFrame and use this with Scala Array functions to check if a column/field present in DataFrame, In this article I will also cover how to check if a column present/exists in nested column and by case insensitive.
1. Spark Check if Column Exists in DataFrame
Spark DataFrame has an attribute columns
that returns all column names as an Array[String]
, once you have the columns, you can use the array function contains()
to check if the column present.
Note that df.columns
returns only top level columns but not nested struct
columns.
// Spark check if column exists in DataFrame
val columnNameToCheck="name"
if(df.columns.contains(columnNameToCheck))
println("column exists")
else
println("column not exists")
2. Check Column Exists by Case insensitive
Let’s check if a column exists by case insensitive, In order to check first convert the column name you wanted to check to CAPS & all DataFrame columns to Caps and compare both. This is a general trick to do case insensitive, you can also convert to the SMALL case instead of CAPS.
// Check Column Exists by Case insensitive
df.columns.map(_.toUpperCase).contains(columnNameToCheck.toUpperCase)
3. Check Column exists in Nested Struct DataFrame
df.columns
don’t return columns from the nested struct, so If you have a DataFrame with nested struct columns, you can check if the column present on the nested column of Spark DataFrame by getting schema in a string using df.schema.simpleString()
.
// Check Column exists in Nested Struct DataFrame
df.schema.simpleString.contains("hair:")
4. Check if a Field Exists in a DataFrame
If you want to check if a Column exists with the same Data Type, then use the Spark schema functions df.schema.fieldNames
or df.schema.contains()
.
// Check if a Field Exists in a DataFrame
import org.apache.spark.sql.types.{StringType, StructField}
df.schema.fieldNames.contains("name")
df.schema.contains(StructField("name",StringType,true))
5. Complete Example of How to Check Column Presents in Spark DataFrame
Following is complete example for reference.
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}
object ColumnExists extends App {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples.com")
.getOrCreate()
val data = Seq(Row("James",Row("black","blue")),
Row("Ann",Row("grey ","black"))
)
val schema = new StructType()
.add("name",StringType)
.add("prop",new StructType()
.add("hair",StringType)
.add("eye",StringType)
)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.printSchema()
val columnNameToCheck="name"
if(df.columns.contains(columnNameToCheck))
println("column exists")
else
println("column not exists")
// Column exists
println(df.columns.map(_.toUpperCase).contains(columnNameToCheck.toUpperCase))
// true
println(df.schema.simpleString.contains("hair:"))
// true
println(df.schema.fieldNames.contains("name"))
// true
println(df.schema.contains(StructField("name",StringType,true)))
// true
}
Refer to Spark GitHub Examples for more example on Spark
Conclusion
In this Spark article, you have learned how to check if the column exists/present in DataFrame columns, if it present in nested struct columns, by case insensitive and finally check if it exists with the same Data type.
Happy Learning !!
Related Articles
References
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructType.html