PySpark SQL Types class is a base class of all data types in PySpark which are defined in a package pyspark.sql.types.DataType
and are used to create DataFrame with a specific type. In this article, you will learn different Data Types and their utility methods with Python examples.
Related: PySpark SQL and PySpark SQL Functions
1. DataType – Base Class of all PySpark SQL Types
- All data types from the below table are supported in PySpark.
DataType
class is a base class for all PySpark Types.- Some types like
IntegerType
,DecimalType
,ByteType
e.t.c are subclass ofNumericType
which is a subclass of DataType.
StringType | ShortType |
ArrayType | IntegerType |
MapType | LongType |
StructType | FloatType |
DateType | DoubleType |
TimestampType | DecimalType |
BooleanType | ByteType |
CalendarIntervalType | HiveStringType |
BinaryType | ObjectType |
NumericType | NullType |
1.1 PySpark DataType Common Methods
All PySpark SQL Data Types extends DataType
class and contains the following methods.
jsonValue()
– Returns JSON representation of the data type.simpleString()
– Returns data type in a simple string. For collections, it returns what type of value the collection holds.typeName()
– Returns just the date type.fromJson()
– Create Data type from JSON String.json()
– Returns JSON representation of the data type.needConversion()
– Does this type needs conversion between a Python object and an internal SQL object.toInternal()
– Converts a Python object into an internal SQL object.fromInternal()
– Converts an internal SQL object into a native Python object.
Below is the usage of some of these.
from pyspark.sql.types import ArrayType,IntegerType
arrayType = ArrayType(IntegerType(),False)
print(arrayType.jsonValue())
print(arrayType.simpleString())
print(arrayType.typeName())
Yields below output.
{'type': 'array', 'elementType': 'integer', 'containsNull': False}
array
array
2. StringType
StringType “pyspark.sql.types.StringType
” is used to represent string values, To create a string type use StringType()
.
from pyspark.sql.types import StringType
val strType = StringType()
3. ArrayType
Use ArrayType to represent arrays in a DataFrame and use ArrayType()
to get an array object of a specific type.
On an Array type object you can access all methods defined in section 1.1 and additionally, it provides containsNull
, elementType
to name a few.
from pyspark.sql.types import ArrayType,IntegerType
arrayType = ArrayType(IntegerType(),False)
print(arrayType.containsNull)
print(arrayType.elementType)
Yields below output.
true
IntegerType
For more example and usage, please refer Using ArrayType on DataFrame
4. MapType
Use MapType to represent key-value pair in a DataFrame. Use MapType()
to get a map object of a specific key and value type.
On Map type object you can access all methods defined in section 1.1 and additionally, it provides keyType
, valueType
, valueContainsNull
to name a few.
from pyspark.sql.types import MapType,StringType,IntegerType
mapType = MapType(StringType(),IntegerType())
print(mapType.keyType)
print(mapType.valueType)
print(mapType.valueContainsNull)
Yields below output.
StringType
IntegerType
True
For more example and usage, please refer Using MapType on DataFrame
5. DateType
Use DateType pyspark.sql.types.DateType
to represent the Date on a DataFrame, useDateType()
to get a date object.
On the Date type object, you can access all methods defined in section 1.1
DateType accepts values in format yyyy-MM-dd
.
6. TimestampType
Use TimestampType pyspark.sql.types.TimestampType
to represent the time on a DataFrame. Use TimestampType()
to get a time object.
On the Timestamp type object you can access all methods defined in section 1.1
Timestamp accepts values in format yyyy-MM-dd HH:mm:ss.SSSS
.
7. SructType
Use StructType “pyspark.sql.types.StructType
” to define the nested structure or schema of a DataFrame, use StructType()
constructor to get a struct object.
StructType object provides a lot of functions like fields()
, fieldNames()
to name a few.
data = [("James","","Smith","36","M",3000),
("Michael","Rose","","40","M",4000),
("Robert","","Williams","42","M",4000),
("Maria","Anne","Jones","39","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([
StructField("firstname",StringType(),True),
StructField("middlename",StringType(),True),
StructField("lastname",StringType(),True),
StructField("age", StringType(), True),
StructField("gender", StringType(), True),
StructField("salary", IntegerType(), True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)
For more examples and usage, please refer PySpark StructType & StructField
8. Other Remaining PySpark SQL Data Types
Similar to the above-described types, the rest of the datatypes use their constructor to create an object of the desired Data Type, And all common methods described in section 1.1 are available with these types.
Conclusion
In this article, you have learned all the different PySpark SQL Types, DataType, classes, and their methods using Python examples. For more details refer to Types.
Thanks for reading. If you like it, please do share the article by following the below social links and any comments or suggestions are welcome in the comments sections!
Happy Learning !!