• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:10 mins read
You are currently viewing PySpark – Cast Column Type With Examples

In PySpark, you can cast or change the DataFrame column data type using cast() function of Column class, in this article, I will be using withColumn(), selectExpr(), and SQL expression to cast the from String to Int (Integer Type), String to Boolean e.t.c using PySpark examples.

Advertisements

Note that the type which you want to convert to should be a subclass of DataType class.

Below are the subclasses of the DataType classes in PySpark and we can change or cast DataFrame columns to only these types.

ArrayTypeBinaryTypeBooleanTypeCalendarIntervalTypeDateTypeHiveStringTypeMapTypeNullTypeNumericTypeObjectTypeStringTypeStructTypeTimestampType

1. Cast Column Type With Example

Below are some examples that convert String Type to Integer Type (int)


from pyspark.sql.types import IntegerType,BooleanType,DateType
# Convert String to Integer Type
df.withColumn("age",df.age.cast(IntegerType()))
df.withColumn("age",df.age.cast('int'))
df.withColumn("age",df.age.cast('integer'))

# Using select
df.select(col("age").cast('int').alias("age"))

#Using selectExpr()
df.selectExpr("cast(age as int) age")

#Using with spark.sql()
spark.sql("SELECT INT(age),BOOLEAN(isGraduated),DATE(jobStartDate) from CastExample")

Let’s run with an example, first, create simple DataFrame with different data types.


import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com') \
                    .getOrCreate()

simpleData = [("James",34,"2006-01-01","true","M",3000.60),
    ("Michael",33,"1980-01-10","true","F",3300.80),
    ("Robert",37,"06-01-1992","false","M",5000.50)
  ]

columns = ["firstname","age","jobStartDate","isGraduated","gender","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.show(truncate=False)

Outputs:


root
 |-- firstname: string (nullable = true)
 |-- age: long (nullable = true)
 |-- jobStartDate: string (nullable = true)
 |-- isGraduated: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: double (nullable = true)

+---------+---+------------+-----------+------+------+
|firstname|age|jobStartDate|isGraduated|gender|salary|
+---------+---+------------+-----------+------+------+
|James    |34 |2006-01-01  |true       |M     |3000.6|
|Michael  |33 |1980-01-10  |true       |F     |3300.8|
|Robert   |37 |06-01-1992  |false      |M     |5000.5|
+---------+---+------------+-----------+------+------+

2. withColumn() – Change Column Type

Use withColumn() to convert the data type of a DataFrame column, This function takes column name you wanted to convert as a first argument and for the second argument apply the casting method cast() with DataType on the column.

Below PySpark, snippet changes DataFrame column, age from Integer to String (StringType), isGraduated column from String to Boolean (BooleanType) and jobStartDate column to Convert from String to DateType.


from pyspark.sql.functions import col
from pyspark.sql.types import StringType,BooleanType,DateType
df2 = df.withColumn("age",col("age").cast(StringType())) \
    .withColumn("isGraduated",col("isGraduated").cast(BooleanType())) \
    .withColumn("jobStartDate",col("jobStartDate").cast(DateType()))
df2.printSchema()

Outputs:


root
 |-- firstname: string (nullable = true)
 |-- age: string (nullable = true)
 |-- jobStartDate: date (nullable = true)
 |-- isGraduated: boolean (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: double (nullable = true)

3. selectExpr() – Change Column Type

selectExpr() is a function in DataFrame which we can use to convert spark DataFrame column “age” from String to integer, “isGraduated” from boolean to string and “jobStartDate” from date to String.


df3 = df2.selectExpr("cast(age as int) age",
    "cast(isGraduated as string) isGraduated",
    "cast(jobStartDate as string) jobStartDate")
df3.printSchema()
df3.show(truncate=False)

root
 |-- age: integer (nullable = true)
 |-- isGraduated: string (nullable = true)
 |-- jobStartDate: string (nullable = true)

+---+-----------+------------+
|age|isGraduated|jobStartDate|
+---+-----------+------------+
|34 |true       |2006-01-01  |
|33 |true       |1980-01-10  |
|37 |false      |null        |
+---+-----------+------------+

4. SQL – Cast using SQL expression

We can also use PySpark SQL expression to change/cast the spark DataFrame column type. In order to use on SQL, first, we need to create a table using createOrReplaceTempView(). On SQL just wrap the column with the desired type you want.


df3.createOrReplaceTempView("CastExample")
df4 = spark.sql("SELECT STRING(age),BOOLEAN(isGraduated),DATE(jobStartDate) from CastExample")
df4.printSchema()
df4.show(truncate=False)

This yields the below output.


root
 |-- age: string (nullable = true)
 |-- isGraduated: boolean (nullable = true)
 |-- jobStartDate: date (nullable = true)

5. Complete Example of Casting PySpark Column

Below is complete working example of how to convert the data types of DataFrame column.


import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

simpleData = [("James",34,"2006-01-01","true","M",3000.60),
    ("Michael",33,"1980-01-10","true","F",3300.80),
    ("Robert",37,"06-01-1992","false","M",5000.50)
  ]

columns = ["firstname","age","jobStartDate","isGraduated","gender","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

from pyspark.sql.functions import col
from pyspark.sql.types import StringType,BooleanType,DateType

df2 = df.withColumn("age",col("age").cast(StringType())) \
    .withColumn("isGraduated",col("isGraduated").cast(BooleanType())) \
    .withColumn("jobStartDate",col("jobStartDate").cast(DateType()))
df2.printSchema()

df3 = df2.selectExpr("cast(age as int) age",
    "cast(isGraduated as string) isGraduated",
    "cast(jobStartDate as string) jobStartDate")
df3.printSchema()
df3.show(truncate=False)

df3.createOrReplaceTempView("CastExample")
df4 = spark.sql("SELECT STRING(age),BOOLEAN(isGraduated),DATE(jobStartDate) from CastExample")
df4.printSchema()
df4.show(truncate=False)

This example is also available at GitHub for reference.

6. Conclusion

In this PySpark article, you have learned how to cast or change one DataFrame column Data Type to another type using withColumn(), selectExpr(), SQL.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium