When you are dealing with large datasets with different data types (DataType) in Spark we often need to check the data type of a DataFrame column and even sometimes you need to get all integer, string type columns to perform certain operations.
Related: Different Ways to Get All Column Names & Data Types in Spark
Below is a quick snippet of how to check if a DataFrame column date type is Integer
(int) or String
in Spark. If you wanted to get all column names of integer, string or any specific type read through the complete article for examples.
// Check if a'name' column type is string
if (df.schema("name").dataType.typeName == "string")
println(" name is 'string' column")
// Check if a'id' column type is integer
if (df.schema("id").dataType.typeName == "integer")
println(" id is 'integer' column")
Let’s see with the detailed example. First let’s create a DataFrame.
import spark.implicits._
val data = Seq((1,"Jeff","2012-04-14",2.34),
(2,"Ram","2012-04-14",4.55),(3,"Scott","2012-04-14",4.56))
val df = data.toDF("id","name","dob","grade")
df.printSchema()
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- dob: string (nullable = true)
// |-- grade: double (nullable = false)
1. Check Data Type of DataFrame Column
To check the column type of a DataFrame specific column use df.schema
which returns all column names and types, now get the column type by name which returns the type. Refer to Spark Convert DataFrame Column Data Type
// Check Data Type of DataFrame Column
if (df.schema("name").dataType.typeName == "string")
println(" name is 'string' column")
if (df.schema("id").dataType.typeName == "integer")
println(" id is 'integer' column")
2. Select All Column Names of String Type
Some times you may want to replace all string type columns with a specific value, for example, replace an empty string with a null value in Spark, in order to do so you can use df.schema.fields
to get all DataFrame columns and apply a filter to get only string columns.
// Select All Column Names of String Type
import org.apache.spark.sql.functions.{col}
val stringColumns=df.schema.fields.filter(_.dataType.isInstanceOf[StringType])
df.select(stringColumns.map(x=>col(x.name)):_*).show()
// +-----+----------+
// | name| dob|
// +-----+----------+
// | Jeff|2012-04-14|
// | Ram|2012-04-14|
// |Scott|2012-04-14|
// +-----+----------+
Alternatively you can also get using.
// Get All String Columns
val stringColumns1=df.schema.fields
.filter(_.dataType.typeName == "string")
df.select(stringColumns1.map(x=>col(x.name)):_*).show()
And another way to get all columns of string type using df.dtypes
.
val stringColumns2=df.dtypes.filter(_._2 == "StringType")
df.select(stringColumns2.map(x=>col(x._1)):_*).show()
3. Select All Column Names of Integer Type
If you wanted to know all column names of Integer types use the below example. This would typically need if you wanted to replace all integer columns with specific values e.t.c.
// Get All Integer Columns
val integerColumns=df.schema.fields
.filter(_.dataType.isInstanceOf[IntegerType])
df.select(integerColumns.map(x=>col(x.name)):_*).show()
// +---+
// | id|
// +---+
// | 1|
// | 2|
// | 3|
// +---+
Conclusion
In this article, you have learned how to check if a data type of a column is a string, integer, or any other type and also learned how to select all string, integer columns using Spark with Scala examples. You would need to find all string/integer columns to replace values for a specific type of column for example replacing all empty strings with null on all string columns e.t.c.
Happy Learning !!
Related Articles
- Spark Check String Column Has Numeric Values
- Spark Check Column Present in DataFrame
- How to Check Spark Version
- Apache Spark Interview Questions
- Spark SQL – Select Columns From DataFrame
- Spark DataFrame Cache and Persist Explained