PySpark SQL Types (DataType) with Examples

Spread the love

PySpark SQL Types class is a base class of all data types in PuSpark which defined in a package pyspark.sql.types.DataType and they are used to create DataFrame with a specific type. In this article, you will learn different Data Types and their utility methods with Python examples.

1. DataType – Base Class of all PySpark SQL Types

  • All data types from the below table are supported in PySpark SQL.
  • DataType class is a base class for all PySpark Types.
  • Some types like IntegerType, DecimalType, ByteType e.t.c are subclass of NumericType which is a subclass of DataType.
StringType ShortType
ArrayType IntegerType
MapType LongType
StructType FloatType
DateType DoubleType
TimestampType DecimalType
BooleanType ByteType
CalendarIntervalType HiveStringType
BinaryType ObjectType
NumericType NullType
PySpark SQL Data Types

1.1 PySpark DataType Common Methods

All PySpark SQL Data Types extends DataType class and contains the following methods.

  • jsonValue() – Returns JSON representation of the data type.
  • simpleString() – Returns data type in a simple string. For collections, it returns what type of value collection holds.
  • typeName() – Returns just the date type.
  • fromJson() – Create Data type from JSON String.
  • json() – Returns JSON representation of the data type.
  • needConversion() – Does this type needs conversion between Python object and internal SQL object.
  • toInternal() – Converts a Python object into an internal SQL object.
  • fromInternal() – Converts an internal SQL object into a native Python object.

Below is usage of some of these.


from pyspark.sql.types import ArrayType,IntegerType
arrayType = ArrayType(IntegerType(),False)
print(arrayType.jsonValue()) 
print(arrayType.simpleString())
print(arrayType.typeName()) 

Yields below output.


{'type': 'array', 'elementType': 'integer', 'containsNull': False}
array
array

2. StringType

StringType “pyspark.sql.types.StringType” is used to represent string values, To create a string type use StringType().


from pyspark.sql.types import StringType
val strType = StringType()

3. ArrayType

Use ArrayType to represent arrays in a DataFrame and use ArrayType() to get an array object of a specific type.

On Array type object you can access all methods defined in section 1.1 and additionally, it provides containsNull, elementType to name a few.


from pyspark.sql.types import ArrayType,IntegerType
arrayType = ArrayType(IntegerType(),False)
print(arrayType.containsNull)
print(arrayType.elementType)

Yields below output.


true
IntegerType

For more example and usage, please refer Using ArrayType on DataFrame

4. MapType

Use MapType to represent key-value pair in a DataFrame. Use MapType() to get a map object of a specific key and value type.

On Map type object you can access all methods defined in section 1.1 and additionally, it provides keyType, valueType, valueContainsNull to name a few.


from pyspark.sql.types import MapType,StringType,IntegerType
mapType = MapType(StringType(),IntegerType())
 
print(mapType.keyType)
print(mapType.valueType)
print(mapType.valueContainsNull)

Yields below output.


StringType
IntegerType
True

For more example and usage, please refer Using MapType on DataFrame

5. DateType

Use DateType pyspark.sql.types.DateType to represent the Date on a DataFrame, useDateType() to get a date object.

On Date type object you can access all methods defined in section 1.1

DateType accept values in format yyyy-MM-dd.

6. TimestampType

Use TimestampType pyspark.sql.types.TimestampType to represent the time on a DataFrame. Use TimestampType() to get a time object.

On Timestamp type object you can access all methods defined in section 1.1

Timestamp accept values in format yyyy-MM-dd HH:mm:ss.SSSS.

7. SructType

Use StructType “pyspark.sql.types.StructType” to define the nested structure or schema of a DataFrame, use StructType() constructor to get a struct object.

StructType object provides a lot of functions like fields(), fieldNames() to name a few.


data = [("James","","Smith","36","M",3000),
    ("Michael","Rose","","40","M",4000),
    ("Robert","","Williams","42","M",4000),
    ("Maria","Anne","Jones","39","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ 
    StructField("firstname",StringType(),True), 
    StructField("middlename",StringType(),True), 
    StructField("lastname",StringType(),True), 
    StructField("age", StringType(), True), 
    StructField("gender", StringType(), True), 
    StructField("salary", IntegerType(), True) 
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)

For more example and usage, please refer PySpark StructType & StructField

8. Other Remaining PySpark SQL Data Types

Similar to the above-described types, the rest of the datatypes use their constructor to create an object of the desired Data Type, And all common methods described in section 1.1 are available with these types.

Conclusion

In this article, you have learned all the different PySpark SQL Types, DataType, classes, and their methods using Python examples. For more details refer to Types.

Thanks for reading. If you like it, please do share the article by following the below social links and any comments or suggestions are welcome in the comments sections! 

Happy Learning !!

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing PySpark SQL Types (DataType) with Examples