Spark Check Column Data Type is Integer or String

When you are dealing with large datasets with different data types (DataType) in Spark we often need to check the data type of a DataFrame column and even sometimes you need to get all integer, string type columns to perform certain operations.

Related: Different Ways to Get All Column Names & Data Types in Spark

Below is a quick snippet of how to check if a DataFrame column date type is Integer(int) or String in Spark. If you wanted to get all column names of integer, string or any specific type read through the complete article for examples.


// Check if a'name' column type is string
if (df.schema("name").dataType.typeName == "string")
   println(" name is 'string' column")

// Check if a'id' column type is integer  
if (df.schema("id").dataType.typeName == "integer")
   println(" id is 'integer' column")

Let’s see with the detailed example. First let’s create a DataFrame.


import spark.implicits._
val data = Seq((1,"Jeff","2012-04-14",2.34),
    (2,"Ram","2012-04-14",4.55),(3,"Scott","2012-04-14",4.56))
val df = data.toDF("id","name","dob","grade")
df.printSchema()

//root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- dob: string (nullable = true)
// |-- grade: double (nullable = false)

Check Data Type of DataFrame Column

To check the column type of a DataFrame specific column use df.schema which returns all column names and types, now get the column type by name which returns the type. Refer to Spark Convert DataFrame Column Data Type


if (df.schema("name").dataType.typeName == "string")
   println(" name is 'string' column")
  
if (df.schema("id").dataType.typeName == "integer")
   println(" id is 'integer' column")

Select All Column Names of String Type

Some times you may want to replace all string type columns with a specific value, for example, replace an empty string with a null value in Spark, in order to do so you can use df.schema.fields to get all DataFrame columns and apply a filter to get only string columns.


import org.apache.spark.sql.functions.{col}
val stringColumns=df.schema.fields.filter(_.dataType.isInstanceOf[StringType])
df.select(stringColumns.map(x=>col(x.name)):_*).show()

//+-----+----------+
//| name|       dob|
//+-----+----------+
//| Jeff|2012-04-14|
//|  Ram|2012-04-14|
//|Scott|2012-04-14|
//+-----+----------+

Alternatively you can also get using.


//Get All String Columns
val stringColumns1=df.schema.fields
      .filter(_.dataType.typeName == "string")
df.select(stringColumns1.map(x=>col(x.name)):_*).show()

And another way to get all columns of string type using df.dtypes.


val stringColumns2=df.dtypes.filter(_._2 == "StringType")
df.select(stringColumns2.map(x=>col(x._1)):_*).show()

Select All Column Names of Integer Type

If you wanted to know all column names of Integer types use the below example. This would typically need if you wanted to replace all integer columns with specific values e.t.c.


//Get All Integer Columns
val integerColumns=df.schema.fields
   .filter(_.dataType.isInstanceOf[IntegerType])
df.select(integerColumns.map(x=>col(x.name)):_*).show()

//+---+
//| id|
//+---+
//|  1|
//|  2|
//|  3|
//+---+

Conclusion

In this article, you have learned how to check if a data type of a column is a string, integer, or any other type and also learned how to select all string, integer columns using Spark with Scala examples. You would need to find all string/integer columns to replace values for a specific type of column for example replacing all empty strings with null on all string columns e.t.c.

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply