Spark – How to change column type?

  • Post Author:
  • Post Category:Apache Spark

To change the Spark DataFrame column type from one data type to another datatype can be done using “withColumn“, “cast function”, “selectExpr”, and SQL expression. Note that the type which you want to convert to should be a subclass of DataType class.

In Spark, we can change or cast DataFrame columns to only the following types as these are the subclasses of DataType class.

ArrayTypeBinaryTypeBooleanTypeCalendarIntervalTypeDateTypeHiveStringTypeMapTypeNullTypeNumericTypeObjectTypeStringTypeStructTypeTimestampType

Let’s see some examples here using Scala snippet, the same approach will also apply with PySpark.

First, let’s create DataFrame


  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val simpleData = Seq(Row("James",34,"2006-01-01","true","M",3000.60),
    Row("Michael",33,"1980-01-10","true","F",3300.80),
    Row("Robert",37,"06-01-1992","false","M",5000.50)
  )

  val simpleSchema = StructType(Array(
    StructField("firstName",StringType,true),
    StructField("age",IntegerType,true),
    StructField("jobStartDate",StringType,true),
    StructField("isGraduated", StringType, true),
    StructField("gender", StringType, true),
    StructField("salary", DoubleType, true)
  ))

  val df = spark.createDataFrame(
     spark.sparkContext.parallelize(simpleData),simpleSchema)
  df.printSchema()
  df.show(false)

Outputs:


root
 |-- firstName: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- jobStartDate: string (nullable = true)
 |-- isGraduated: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: double (nullable = true)

+---------+---+------------+-----------+------+------+
|firstName|age|jobStartDate|isGraduated|gender|salary|
+---------+---+------------+-----------+------+------+
|James    |34 |2006-01-01  |true       |M     |3000.6|
|Michael  |33 |1980-01-10  |true       |F     |3300.8|
|Robert   |37 |06-01-1992  |false      |M     |5000.5|
+---------+---+------------+-----------+------+------+

Change column type using withColumn and cast

To convert the data type of a DataFrame column, Use “withColumn” with the original column name as a first argument and for the second argument apply the casting method with DataType on the column.

Below Spark, snippet changes DataFrame column, ‘age’ from Integer to String (StringType) , ‘isGraduated’ column from String to Boolean (BooleanType) and ‘jobStartDate‘ column from String to DateType.


  val df2 = df.withColumn("age",col("age").cast(StringType))
    .withColumn("isGraduated",col("isGraduated").cast(BooleanType))
    .withColumn("jobStartDate",col("jobStartDate").cast(DateType))
  df2.printSchema()

Outputs:


root
 |-- age: string (nullable = true)
 |-- isGraduated: boolean (nullable = true)
 |-- jobStartDate: date (nullable = true)

Change Column type using selectExpr

Using selectExpr we can convert spark DataFrame column “age” from String to integer, “isGraduated” from boolean to string and “jobStartDate” from date to String


  val df3 = df2.selectExpr("cast(age as int) age",
    "cast(isGraduated as string) isGraduated",
    "cast(jobStartDate as string) jobStartDate")
  df3.printSchema()
  df3.show(false)

root
 |-- age: integer (nullable = true)
 |-- isGraduated: string (nullable = true)
 |-- jobStartDate: string (nullable = true)

Cast using SQL expression

We can also use SQL expression to change the spark DataFram column type.


df3.createOrReplaceTempView("CastExample")
  val df4 = spark.sql("SELECT STRING(age),BOOLEAN(isGraduated),
        DATE(jobStartDate) from CastExample")
  df4.printSchema()
  df4.show(false)

Outputs:


root
 |-- age: string (nullable = true)
 |-- isGraduated: boolean (nullable = true)
 |-- jobStartDate: date (nullable = true)

The complete example of changing DataFrame column type


package com.sparkbyexamples.spark.dataframe

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

object CastColumnType extends App{

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val simpleData = Seq(Row("James",34,"2006-01-01","true","M",3000.60),
    Row("Michael",33,"1980-01-10","true","F",3300.80),
    Row("Robert",37,"06-01-1992","false","M",5000.50)
  )

  val simpleSchema = StructType(Array(
    StructField("firstName",StringType,true),
    StructField("age",IntegerType,true),
    StructField("jobStartDate",StringType,true),
    StructField("isGraduated", StringType, true),
    StructField("gender", StringType, true),
    StructField("salary", DoubleType, true)
  ))

  val df = spark.createDataFrame(
     spark.sparkContext.parallelize(simpleData),simpleSchema)
  df.printSchema()
  df.show(false)

  val df2 = df.withColumn("age",col("age").cast(StringType))
    .withColumn("isGraduated",col("isGraduated").cast(BooleanType))
    .withColumn("jobStartDate",col("jobStartDate").cast(DateType))
  df2.printSchema()


  val df3 = df2.selectExpr("cast(age as int) age",
    "cast(isGraduated as string) isGraduated",
    "cast(jobStartDate as string) jobStartDate")
  df3.printSchema()
  df3.show(false)

  df3.createOrReplaceTempView("CastExample")
  val df4 = spark.sql("SELECT STRING(age),BOOLEAN(isGraduated),
     DATE(jobStartDate) from CastExample")
  df4.printSchema()
  df4.show(false)

}

Happy Learning !!

NNK

SparkByExamples.com is a BigData and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment using Scala and Maven.

Leave a Reply