To change the Spark DataFrame column type from one data type to another data type can be done using “withColumn()“, “cast function”, “selectExpr”, and SQL expression. Note that the type which you want to convert to should be a subclass of DataType class.
In Spark, we can change or cast DataFrame columns to only the following types as these are the subclasses of DataType class.
ArrayType, BinaryType, BooleanType, CalendarIntervalType, DateType, HiveStringType, MapType, NullType, NumericType, ObjectType, StringType, StructType, TimestampType
Let’s see some examples here using Scala snippet, the same approach can also apply with PySpark.
First, let’s create DataFrame.
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
val simpleData = Seq(Row("James",34,"2006-01-01","true","M",3000.60),
Row("Michael",33,"1980-01-10","true","F",3300.80),
Row("Robert",37,"06-01-1992","false","M",5000.50)
)
val simpleSchema = StructType(Array(
StructField("firstName",StringType,true),
StructField("age",IntegerType,true),
StructField("jobStartDate",StringType,true),
StructField("isGraduated", StringType, true),
StructField("gender", StringType, true),
StructField("salary", DoubleType, true)
))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(simpleData),simpleSchema)
df.printSchema()
df.show(false)
Outputs:
root
|-- firstName: string (nullable = true)
|-- age: integer (nullable = true)
|-- jobStartDate: string (nullable = true)
|-- isGraduated: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: double (nullable = true)
+---------+---+------------+-----------+------+------+
|firstName|age|jobStartDate|isGraduated|gender|salary|
+---------+---+------------+-----------+------+------+
|James |34 |2006-01-01 |true |M |3000.6|
|Michael |33 |1980-01-10 |true |F |3300.8|
|Robert |37 |06-01-1992 |false |M |5000.5|
+---------+---+------------+-----------+------+------+
Change column type using withColumn and cast
To convert the data type of a DataFrame column, Use withColumn() with the original column name as a first argument and for the second argument apply the casting method cast()
with DataType on the column.
Below Spark, snippet changes DataFrame column, ‘age’ from Integer to String (StringType) , ‘isGraduated’ column from String to Boolean (BooleanType) and ‘jobStartDate‘ column from String to DateType.
import org.apache.spark.sql.functions._
val df2 = df.withColumn("age",col("age").cast(StringType))
.withColumn("isGraduated",col("isGraduated").cast(BooleanType))
.withColumn("jobStartDate",col("jobStartDate").cast(DateType))
df2.printSchema()
Outputs:
root
|-- age: string (nullable = true)
|-- isGraduated: boolean (nullable = true)
|-- jobStartDate: date (nullable = true)
Cast column using Select
Below example cast’s selected columns using select() transformation. When you have many columns on DataFrame and wanted to cast selected columns this comes in handy.
val cast_df = df.select(df.columns.map {
case [email protected]"age" =>
col(column).cast("String").as(column)
case [email protected]"salary" =>
col(column).cast("String").as(column)
case column =>
col(column)
}: _*)
cast_df.printSchema()
col(column).cast("String").as(column))
will be converted as CAST(YEAR AS STRING) AS YEAR
. Yields below output.
root
|-- firstName: string (nullable = true)
|-- age: string (nullable = true)
|-- jobStartDate: string (nullable = true)
|-- isGraduated: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: string (nullable = true)
This example has been shared by @sriramrimmalapudi9gmail-com
Change Column type using selectExpr
Using selectExpr
we can convert spark DataFrame column “age” from String to integer, “isGraduated” from boolean to string and “jobStartDate” from date to String.
val df3 = df2.selectExpr("cast(age as int) age",
"cast(isGraduated as string) isGraduated",
"cast(jobStartDate as string) jobStartDate")
df3.printSchema()
df3.show(false)
root
|-- age: integer (nullable = true)
|-- isGraduated: string (nullable = true)
|-- jobStartDate: string (nullable = true)
Cast using SQL expression
We can also use SQL expression to change the spark DataFram column type.
df3.createOrReplaceTempView("CastExample")
val df4 = spark.sql("SELECT STRING(age),BOOLEAN(isGraduated),
DATE(jobStartDate) from CastExample")
df4.printSchema()
df4.show(false)
Outputs:
root
|-- age: string (nullable = true)
|-- isGraduated: boolean (nullable = true)
|-- jobStartDate: date (nullable = true)
The complete example of changing DataFrame column type
package com.sparkbyexamples.spark.dataframe
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
object CastColumnType extends App{
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
val simpleData = Seq(Row("James",34,"2006-01-01","true","M",3000.60),
Row("Michael",33,"1980-01-10","true","F",3300.80),
Row("Robert",37,"06-01-1992","false","M",5000.50)
)
val simpleSchema = StructType(Array(
StructField("firstName",StringType,true),
StructField("age",IntegerType,true),
StructField("jobStartDate",StringType,true),
StructField("isGraduated", StringType, true),
StructField("gender", StringType, true),
StructField("salary", DoubleType, true)
))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(simpleData),simpleSchema)
df.printSchema()
df.show(false)
val df2 = df.withColumn("age",col("age").cast(StringType))
.withColumn("isGraduated",col("isGraduated").cast(BooleanType))
.withColumn("jobStartDate",col("jobStartDate").cast(DateType))
df2.printSchema()
val df3 = df2.selectExpr("cast(age as int) age",
"cast(isGraduated as string) isGraduated",
"cast(jobStartDate as string) jobStartDate")
df3.printSchema()
df3.show(false)
df3.createOrReplaceTempView("CastExample")
val df4 = spark.sql("SELECT STRING(age),BOOLEAN(isGraduated), " +
"DATE(jobStartDate) from CastExample")
df4.printSchema()
df4.show(false)
val cast_df = df.select(df.columns.map {
case [email protected]"age" =>
col(column).cast("String").as(column)
case [email protected]"salary" =>
col(column).cast("String").as(column)
case column =>
col(column)
}: _*)
cast_df.printSchema()
}
This example is also available at GitHub for reference.
Happy Learning !!
I have use .selectExpr, it worked however this omits my other column in DF.
You should add all required columns into selectExpr(), This function results in a new DataFrame with just columns mentioned in it.
It always shows like name ‘NumericType’ is not defined, I don’t what I just missed from your article? I am so confused
Have you tried running complete example mentioned at the end of the post? just FYI, I’ve copied the code and tried it now and got the expected issue without any error. If you are still facing an error, please provide with a complete error stack trace.
You have not imported types – import org.apache.spark.sql.types._
Thanks Emil. I will add it.