Spark Check Column Present in DataFrame

You can get all columns of a DataFrame as an Array[String] by using columns attribute of Spark DataFrame and use this with Scala Array functions to check if a column/field present in DataFrame, In this article I will also cover how to check if a column present/exists in nested column and by case insensitive.

1. Spark Check if Column Exists in DataFrame

Spark DataFrame has an attribute columns that returns all column names as an Array[String], once you have the columns, you can use the array function contains() to check if the column present.

Note that df.columns returns only top level columns but not nested struct columns.


// Spark check if column exists in DataFrame
val columnNameToCheck="name"
if(df.columns.contains(columnNameToCheck))
  println("column exists")
else
  println("column not exists")

2. Check Column Exists by Case insensitive

Let’s check if a column exists by case insensitive, In order to check first convert the column name you wanted to check to CAPS & all DataFrame columns to Caps and compare both. This is a general trick to do case insensitive, you can also convert to the SMALL case instead of CAPS.


// Check Column Exists by Case insensitive
df.columns.map(_.toUpperCase).contains(columnNameToCheck.toUpperCase)

3. Check Column exists in Nested Struct DataFrame

df.columns don’t return columns from the nested struct, so If you have a DataFrame with nested struct columns, you can check if the column present on the nested column of Spark DataFrame by getting schema in a string using df.schema.simpleString().


// Check Column exists in Nested Struct DataFrame
df.schema.simpleString.contains("hair:")

4. Check if a Field Exists in a DataFrame

If you want to check if a Column exists with the same Data Type, then use the Spark schema functions df.schema.fieldNames or df.schema.contains().


// Check if a Field Exists in a DataFrame
import org.apache.spark.sql.types.{StringType, StructField}
df.schema.fieldNames.contains("name")
df.schema.contains(StructField("name",StringType,true))

5. Complete Example of How to Check Column Presents in Spark DataFrame

Following is complete example for reference.


import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

object ColumnExists extends App {

  val spark:SparkSession = SparkSession.builder()
    .master("local[3]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val data = Seq(Row("James",Row("black","blue")),
    Row("Ann",Row("grey ","black"))
  )

  val schema = new StructType()
    .add("name",StringType)
    .add("prop",new StructType()
      .add("hair",StringType)
      .add("eye",StringType)
      )

  val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
  df.printSchema()

  val columnNameToCheck="name"
  if(df.columns.contains(columnNameToCheck))
    println("column exists")
  else
    println("column not exists")
  // Column exists

  println(df.columns.map(_.toUpperCase).contains(columnNameToCheck.toUpperCase))
  // true

  println(df.schema.simpleString.contains("hair:"))
  // true

  println(df.schema.fieldNames.contains("name"))
  // true 

  println(df.schema.contains(StructField("name",StringType,true)))
  // true
}

Refer to Spark GitHub Examples for more example on Spark

Conclusion

In this Spark article, you have learned how to check if the column exists/present in DataFrame columns, if it present in nested struct columns, by case insensitive and finally check if it exists with the same Data type.

Happy Learning !!

References

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructType.html

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply