Spark Check Column Data Type is Integer or String

When you are dealing with large datasets with different data types (DataType) in Spark we often need to check the data type of a DataFrame column and even sometimes you need to get all integer, string type columns to perform certain operations.

Related: Different Ways to Get All Column Names & Data Types in Spark

Below is a quick snippet of how to check if a DataFrame column date type is Integer(int) or String in Spark. If you wanted to get all column names of integer, string or any specific type read through the complete article for examples.


// Check if a'name' column type is string
if (df.schema("name").dataType.typeName == "string")
   println(" name is 'string' column")

// Check if a'id' column type is integer  
if (df.schema("id").dataType.typeName == "integer")
   println(" id is 'integer' column")

Let’s see with the detailed example. First let’s create a DataFrame.


import spark.implicits._
val data = Seq((1,"Jeff","2012-04-14",2.34),
    (2,"Ram","2012-04-14",4.55),(3,"Scott","2012-04-14",4.56))
val df = data.toDF("id","name","dob","grade")
df.printSchema()

// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- dob: string (nullable = true)
// |-- grade: double (nullable = false)

1. Check Data Type of DataFrame Column

To check the column type of a DataFrame specific column use df.schema which returns all column names and types, now get the column type by name which returns the type. Refer to Spark Convert DataFrame Column Data Type


// Check Data Type of DataFrame Column
if (df.schema("name").dataType.typeName == "string")
   println(" name is 'string' column")
  
if (df.schema("id").dataType.typeName == "integer")
   println(" id is 'integer' column")

2. Select All Column Names of String Type

Some times you may want to replace all string type columns with a specific value, for example, replace an empty string with a null value in Spark, in order to do so you can use df.schema.fields to get all DataFrame columns and apply a filter to get only string columns.


// Select All Column Names of String Type
import org.apache.spark.sql.functions.{col}
val stringColumns=df.schema.fields.filter(_.dataType.isInstanceOf[StringType])
df.select(stringColumns.map(x=>col(x.name)):_*).show()

// +-----+----------+
// | name|       dob|
// +-----+----------+
// | Jeff|2012-04-14|
// |  Ram|2012-04-14|
// |Scott|2012-04-14|
// +-----+----------+

Alternatively you can also get using.


// Get All String Columns
val stringColumns1=df.schema.fields
      .filter(_.dataType.typeName == "string")
df.select(stringColumns1.map(x=>col(x.name)):_*).show()

And another way to get all columns of string type using df.dtypes.


val stringColumns2=df.dtypes.filter(_._2 == "StringType")
df.select(stringColumns2.map(x=>col(x._1)):_*).show()

3. Select All Column Names of Integer Type

If you wanted to know all column names of Integer types use the below example. This would typically need if you wanted to replace all integer columns with specific values e.t.c.


// Get All Integer Columns
val integerColumns=df.schema.fields
   .filter(_.dataType.isInstanceOf[IntegerType])
df.select(integerColumns.map(x=>col(x.name)):_*).show()

// +---+
// | id|
// +---+
// |  1|
// |  2|
// |  3|
// +---+

Conclusion

In this article, you have learned how to check if a data type of a column is a string, integer, or any other type and also learned how to select all string, integer columns using Spark with Scala examples. You would need to find all string/integer columns to replace values for a specific type of column for example replacing all empty strings with null on all string columns e.t.c.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply