You are currently viewing Spark SQL Data Types with Examples

Spark SQL DataType class is a base class of all data types in Spark which defined in a package org.apache.spark.sql.types.DataType and they are primarily used while working on DataFrames, In this article, you will learn different Data Types and their utility methods with Scala examples.

1. Spark SQL DataType – base class of all Data Types

All data types from the below table are supported in Spark SQL and DataType class is a base class for all these. For some types like IntegerType, DecimalType, ByteType e.t.c are subclass of NumericType which is a subclass of DataType.

StringType ShortType
ArrayType IntegerType
MapType LongType
StructType FloatType
DateType DoubleType
TimestampType DecimalType
BooleanType ByteType
CalendarIntervalType HiveStringType
BinaryType ObjectType
NumericType NullType

1.1 DataType common methods

All Spark SQL Data Types extends DataType class and should provide implementation to the methods explained in this example.


 val arr = ArrayType(IntegerType,false)
 println("json() : "+arrayType.json)  // Represents json string of datatype
 println("prettyJson() : "+arrayType.prettyJson) // Gets json in pretty format
 println("simpleString() : "+arrayType.simpleString) // simple string
 println("sql() : "+arrayType.sql) // SQL format
 println("typeName() : "+arrayType.typeName) // type name
 println("catalogString() : "+arrayType.catalogString) // catalog string
 println("defaultSize() : "+arrayType.defaultSize) // default size

Yields below output.


json() : {"type":"array","elementType":"string","containsNull":true}
prettyJson() : {
  "type" : "array",
  "elementType" : "string",
  "containsNull" : true
}
simpleString() : array<string>
sql() : ARRAY<STRING>
typeName() : array
catalogString() : array<string>
defaultSize() : 20

Besides these, the DataType class has the following static methods.

1.2 DataType.fromJson()

If you have a JSON string and you wanted to convert to a DataType use fromJson() . For example you wanted to convert JSON schema from a string to StructType.


val typeFromJson = DataType.fromJson(
    """{"type":"array",
      |"elementType":"string","containsNull":false}""".stripMargin)
println(typeFromJson.getClass)
val typeFromJson2 = DataType.fromJson("\"string\"")
println(typeFromJson2.getClass)

//This prints
class org.apache.spark.sql.types.ArrayType
class org.apache.spark.sql.types.StringType$

1.3 DataType.fromDDL()

Like loading structure from JSON string, we can also create it fromDDL(),


val ddlSchemaStr = "`fullName` STRUCT<`first`: STRING, `last`: STRING," +
    "`middle`: STRING>,`age` INT,`gender` STRING"
val ddlSchema = DataType.fromDDL(ddlSchemaStr)
println(ddlSchema.getClass)
// This prints
class org.apache.spark.sql.types.StructType

1.4 DataType.canWrite()

1.5 DataType.equalsStructurally()

2. Use Spark SQL DataTypes class to get a type object

In order to get or create a specific data type, we should use the objects and factory methods provided by org.apache.spark.sql.types.DataTypes class. for example, use object DataTypes.StringType to get StringType and the factory method DataTypes.createArrayType(StirngType) to get ArrayType of string.


//Below are some examples  
val strType = DataTypes.StringType
val arrayType = DataTypes.createArrayType(StringType)
val structType = DataTypes.createStructType(
    Array(DataTypes.createStructField("fieldName",StringType,true)))

3. StringType

StringType “org.apache.spark.sql.types.StringType” is used to represent string values, To create a string type use either DataTypes.StringType or StringType(), both of these returns object of String type.


  val strType = DataTypes.StringType
  println("json : "+strType.json)
  println("prettyJson : "+strType.prettyJson)
  println("simpleString : "+strType.simpleString)
  println("sql : "+strType.sql)
  println("typeName : "+strType.typeName)
  println("catalogString : "+strType.catalogString)
  println("defaultSize : "+strType.defaultSize)

Outputs


json : "string"
prettyJson : "string"
simpleString : string
sql : STRING
typeName : string
catalogString : string
defaultSize : 20

4. ArrayType

Use ArrayType to represent arrays in a DataFrame and use either factory method DataTypes.createArrayType() or ArrayType() constructor to get an array object of a specific type.

On Array type object you can access all methods defined in section 1.1 and additionally, it provides containsNull(), elementType(), productElement() to name a few.


val arr = ArrayType(IntegerType,false)
val arrayType = DataTypes.createArrayType(StringType,true)
println("containsNull : "+arrayType.containsNull)
println("elementType : "+arrayType.elementType)
println("productElement : "+arrayType.productElement(0))

Yields below output.


containsNull : true
elementType : StringType
productElement : StringType

For more example and usage, please refer Using ArrayType on DataFrame

5. MapType

Use MapType to represent maps with key-value pair in a DataFrame and use either factory method DataTypes.createMapType() or MapType() constructor to get a map object of a specific key and value type.

On Map type object you can access all methods defined in section 1.1 and additionally, it provides keyType(), valueType(), valueContainsNull(), productElement() to name a few.


val mapType1 = MapType(StringType,IntegerType)
val mapType = DataTypes.createMapType(StringType,IntegerType)
println("keyType() : "+mapType.keyType)
println("valueType() : "+mapType.valueType)
println("valueContainsNull() : "+mapType.valueContainsNull)
println("productElement(1) : "+mapType.productElement(1))

Yields below output.


keyType() : StringType
valueType() : IntegerType
valueContainsNull() : true
productElement(1) : IntegerType

For more example and usage, please refer Using MapType on DataFrame

6. DateType

Use DateType “org.apache.spark.sql.types.DataType” to represent the date on a DataFrame and use either DataTypes.DateType or DateType() constructor to get a date object.

On Date type object you can access all methods defined in section 1.1

7. TimestampType

Use TimestampType “org.apache.spark.sql.types.TimestampType” to represent the time on a DataFrame and use either DataTypes.TimestampType or TimestampType() constructor to get a time object.

On Timestamp type object you can access all methods defined in section 1.1

8. StructType

Use StructType “org.apache.spark.sql.types.StructType” to define the nested structure or schema of a DataFrame, use either DataTypes.createStructType() or StructType() constructor to get a struct object.

StructType object provides lot of functions like toDDL(), fields(), fieldNames(), length() to name few.


  //StructType
  val structType = DataTypes.createStructType(
    Array(DataTypes.createStructField("fieldName",StringType,true)))

  val simpleSchema = StructType(Array(
    StructField("name",StringType,true),
    StructField("id", IntegerType, true),
    StructField("gender", StringType, true),
    StructField("salary", DoubleType, true)
  ))

  val anotherSchema = new StructType()
    .add("name",new StructType()
      .add("firstname",StringType)
      .add("lastname",StringType))
    .add("id",IntegerType)
    .add("salary",DoubleType)

For more example and usage, please refer StructType

9. All other remaining Spark SQL Data Types

Similar to the above-described types, for the rest of the datatypes use the appropriate method on DataTypes class or data type constructor to create an object of the desired Data Type, And all common methods described in section 1.1 are available with these types.

Conclusion

In this article, you have learned all different Spark SQL DataTypes, DataType, DataTypes classes and their methods using Scala examples. I would recommend referring to DataType and DataTypes API for more details.

Thanks for reading. If you like it, please do share the article by following the below social links and any comments or suggestions are welcome in the comments sections! 

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has 6 Comments

  1. Satya

    Very Informative..!

  2. NNK

    Thanks. Fixed the typo.

  3. Anonymous

    8. SructType -> 8. StructType

  4. NNK

    Thanks Chang for your wonderful words :)

  5. Ball Chang

    Simple but essential examples!

  6. Ball Chang

    Clear but essential examples!

Comments are closed.