• Post author:
  • Post category:PySpark
  • Post last modified:March 1, 2024
  • Reading time:10 mins read
You are currently viewing PySpark SQL Types (DataType) with Examples

PySpark SQL Types class is a base class of all data types in PySpark which are defined in a package pyspark.sql.types.DataType and are used to create DataFrame with a specific type. In this article, you will learn different Data Types and their utility methods with Python examples.

Advertisements

Related: PySpark SQL and PySpark SQL Functions

1. DataType – Base Class of all PySpark SQL Types

  • All data types from the below table are supported in PySpark.
  • DataType class is a base class for all PySpark Types.
  • Some types like IntegerType, DecimalType, ByteType e.t.c are subclass of NumericType which is a subclass of DataType.
StringType ShortType
ArrayType IntegerType
MapType LongType
StructType FloatType
DateType DoubleType
TimestampType DecimalType
BooleanType ByteType
CalendarIntervalType HiveStringType
BinaryType ObjectType
NumericType NullType
PySpark SQL Data Types

1.1 PySpark DataType Common Methods

All PySpark SQL Data Types extends DataType class and contains the following methods.

  • jsonValue() – Returns JSON representation of the data type.
  • simpleString() – Returns data type in a simple string. For collections, it returns what type of value the collection holds.
  • typeName() – Returns just the date type.
  • fromJson() – Create Data type from JSON String.
  • json() – Returns JSON representation of the data type.
  • needConversion() – Does this type needs conversion between a Python object and an internal SQL object.
  • toInternal() – Converts a Python object into an internal SQL object.
  • fromInternal() – Converts an internal SQL object into a native Python object.

Below is the usage of some of these.


from pyspark.sql.types import ArrayType,IntegerType
arrayType = ArrayType(IntegerType(),False)
print(arrayType.jsonValue()) 
print(arrayType.simpleString())
print(arrayType.typeName()) 

Yields below output.


{'type': 'array', 'elementType': 'integer', 'containsNull': False}
array
array

2. StringType

StringType “pyspark.sql.types.StringType” is used to represent string values, To create a string type use StringType().


from pyspark.sql.types import StringType
val strType = StringType()

3. ArrayType

Use ArrayType to represent arrays in a DataFrame and use ArrayType() to get an array object of a specific type.

On an Array type object you can access all methods defined in section 1.1 and additionally, it provides containsNull, elementType to name a few.


from pyspark.sql.types import ArrayType,IntegerType
arrayType = ArrayType(IntegerType(),False)
print(arrayType.containsNull)
print(arrayType.elementType)

Yields below output.


true
IntegerType

For more example and usage, please refer Using ArrayType on DataFrame

4. MapType

Use MapType to represent key-value pair in a DataFrame. Use MapType() to get a map object of a specific key and value type.

On Map type object you can access all methods defined in section 1.1 and additionally, it provides keyType, valueType, valueContainsNull to name a few.


from pyspark.sql.types import MapType,StringType,IntegerType
mapType = MapType(StringType(),IntegerType())
 
print(mapType.keyType)
print(mapType.valueType)
print(mapType.valueContainsNull)

Yields below output.


StringType
IntegerType
True

For more example and usage, please refer Using MapType on DataFrame

5. DateType

Use DateType pyspark.sql.types.DateType to represent the Date on a DataFrame, useDateType() to get a date object.

On the Date type object, you can access all methods defined in section 1.1

DateType accepts values in format yyyy-MM-dd.

6. TimestampType

Use TimestampType pyspark.sql.types.TimestampType to represent the time on a DataFrame. Use TimestampType() to get a time object.

On the Timestamp type object you can access all methods defined in section 1.1

Timestamp accepts values in format yyyy-MM-dd HH:mm:ss.SSSS.

7. SructType

Use StructTypepyspark.sql.types.StructType” to define the nested structure or schema of a DataFrame, use StructType() constructor to get a struct object.

StructType object provides a lot of functions like fields(), fieldNames() to name a few.


data = [("James","","Smith","36","M",3000),
    ("Michael","Rose","","40","M",4000),
    ("Robert","","Williams","42","M",4000),
    ("Maria","Anne","Jones","39","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ 
    StructField("firstname",StringType(),True), 
    StructField("middlename",StringType(),True), 
    StructField("lastname",StringType(),True), 
    StructField("age", StringType(), True), 
    StructField("gender", StringType(), True), 
    StructField("salary", IntegerType(), True) 
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)

For more examples and usage, please refer PySpark StructType & StructField

8. Other Remaining PySpark SQL Data Types

Similar to the above-described types, the rest of the datatypes use their constructor to create an object of the desired Data Type, And all common methods described in section 1.1 are available with these types.

Conclusion

In this article, you have learned all the different PySpark SQL Types, DataType, classes, and their methods using Python examples. For more details refer to Types.

Thanks for reading. If you like it, please do share the article by following the below social links and any comments or suggestions are welcome in the comments sections! 

Happy Learning !!

Leave a Reply