Spark printSchema() Example

  • Post author:

org.apache.spark.sql.Dataset.printSchema() is used to print or display the schema of the DataFrame or Dataset in the tree format along with column name and data type. If you have DataFrame/Dataset with a nested structure it displays schema in a nested tree format.

1. printSchema() Syntax

Following is the Syntax of the printSchema() method. This method has two signatures one without arguments and another with integer argument. These two are used to print the schema of the DataFrame to console or log.


// printSchema() Syntax
printSchema(): Unit
printSchema(level: Int): Unit

2. Spark printSchema() Example

First, let’s create a Spark DataFrame with column names.


// Example 1 - DataFrame printSchema()
// Import
import org.apache.spark.sql.SparkSession

// Create SparkSession
val spark = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate();

// Create DataFrame
val columns = Seq("language","fee")
val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
import spark.implicits._
val df = data.toDF(columns:_*)

// Print Schema
df.printSchema()

The above example creates the DataFrame with two columns language and fee. Since we have not specified the data types it infers the data type of each column based on the column values (data). now let’s use printSchama() which displays the schema of the DataFrame on the console or logs.


# Output
root
 |-- language: string (nullable = true)
 |-- fee: string (nullable = true)

Now let’s assign a data type to each column by using Spark StructType and StructField.


// Example 2 - Create StructType Schema
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType, IntegerType}
val schema = StructType( Array(
    StructField("language", StringType,true),
    StructField("fee", IntegerType,true)
  ))
  
//Create DataFrame
val data2 = List(Row("Java", "20000"), Row("Python", "100000"), Row("Scala", "3000"))
val df2 = spark.createDataFrame(
    spark.sparkContext.parallelize(data2),schema)
df2.printSchema()

This yields similar output as above. To display the contents of the Spark DataFrame use show() method.


# Output
root
 |-- language: string (nullable = true)
 |-- fee: integer (nullable = true)

3. Print Schema with Nested Structure

While working on DataFrame we often need to work with the nested struct column and this can be defined using StructType. In the below example column name data type is StructType which is nested.

printSchema() method on the Spark DataFrame shows StructType columns as struct.


// Example 3 - Nested structure
// Create Nested Structure
val schema_nest = StructType( Array(
    StructField("name",StructType( Array(
      StructField("firstname", StringType,true),
      StructField("middlename", StringType,true),
      StructField("lastname", StringType,true)
    ))),
    StructField("language", StringType,true),
    StructField("fee", IntegerType,true)
  ))

//Create DataFrame
val data3 = List(
    Row(Row("James","","Smith"),"Java", "20000"),
    Row(Row("Michael","Rose",""),"Python", "100000")
    )
val df3 = spark.createDataFrame(
    spark.sparkContext.parallelize(data3),schema_nest)
df3.printSchema()

Prints below schema to console.


# Output
root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- language: string (nullable = true)
 |-- fee: integer (nullable = true)

4. Print Schema with Level

Spark DataFrame printSchema() method also takes option param level of type int, This can be used to select how many levels you wanted to print schema when you have multi nested schema.

For example, using printSchema(1) displays just first level from schema.


// Print first level of Schema
df3.printSchema(1)

Just prints the first level from the schema. Compare this with the above schema.


# Output
root
 |-- name: struct (nullable = true)
 |-- language: string (nullable = true)
 |-- fee: integer (nullable = true)

5. Print Schema for ArrayType and MapType

StructType also supports ArrayType and MapType to define the DataFrame columns for array and map collections respectively. In the below example, column languages defined as ArrayType(StringType) and properties defined as MapType(StringType,StringType) meaning both key and value as String.


//Use ArrayType & MapType
import org.apache.spark.sql.types.{ArrayType,MapType}
val schema_col = StructType( Array(
    StructField("name", StringType,true),
    StructField("languages", ArrayType(StringType),true),
    StructField("properties", MapType(StringType,StringType),true)
  ))

//Create DataFrame
val data4 = List(
    Row("James",List("Java","Scala"), Map("hair"->"black","eye"->"brown")),
    Row("Michael",List("Python","PHP"), Map("hair"->"brown","eye"->"black"))
  )
val df4 = spark.createDataFrame(
    spark.sparkContext.parallelize(data4),schema_col)
df4.printSchema()

Outputs the below schema. Note that field languages is array type and properties is map type.


#Output
root
 |-- name: string (nullable = true)
 |-- languages: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

6. Complete Example of Spark Print Schema


// Complete Example
import org.apache.spark.sql.{Row, SparkSession}
object SparkPrintSchema extends App {
  // Create SparkSession
  val spark = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate();

  // Create DataFrame
  val columns = Seq("language","fee")
  val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
  import spark.implicits._
  val df = data.toDF(columns:_*)
  df.printSchema()

  // Create StructType Schema
  import org.apache.spark.sql.types.{StringType, StructField, StructType, IntegerType}
  val schema = StructType( Array(
    StructField("language", StringType,true),
    StructField("fee", IntegerType,true)
  ))

  //Create DataFrame
  val data2 = List(Row("Java", "20000"), Row("Python", "100000"), Row("Scala", "3000"))
  val df2 = spark.createDataFrame(
    spark.sparkContext.parallelize(data2),schema)
  df2.printSchema()

  //Create Nested Structure
  val schema_nest = StructType( Array(
    StructField("name",StructType( Array(
      StructField("firstname", StringType,true),
      StructField("middlename", StringType,true),
      StructField("lastname", StringType,true)
    ))),
    StructField("language", StringType,true),
    StructField("fee", IntegerType,true)
  ))

  //Create DataFrame
  val data3 = List(
    Row(Row("James","","Smith"),"Java", "20000"),
    Row(Row("Michael","Rose",""),"Python", "100000")
    )
  val df3 = spark.createDataFrame(
    spark.sparkContext.parallelize(data3),schema_nest)
  df3.printSchema()

  //Use ArrayType & MapType
  import org.apache.spark.sql.types.{ArrayType,MapType}
  val schema_col = StructType( Array(
    StructField("name", StringType,true),
    StructField("language", ArrayType(StringType),true),
    StructField("properties", MapType(StringType,StringType),true)
  ))

  //Create DataFrame
  val data4 = List(
    Row("James",List("Java","Scala"), Map("hair"->"black","eye"->"brown")),
    Row("Michael",List("Python","PHP"), Map("hair"->"brown","eye"->"black"))
  )
  val df4 = spark.createDataFrame(
    spark.sparkContext.parallelize(data4),schema_col)
  df4.printSchema()
}

7. Conclusion

In this article, you have learned the syntax and usage of the Spark printSchema() method with several examples including how printSchema() prints the schema of the DataFrame when it has nested structure, array, and map types.

Happy Learning !!

spark printschema

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Spark printSchema() Example