org.apache.spark.sql.Dataset.printSchema()
is used to print or display the schema of the DataFrame or Dataset in the tree format along with column name and data type. If you have DataFrame/Dataset with a nested structure it displays schema in a nested tree format.
1. printSchema() Syntax
Following is the Syntax of the printSchema() method. This method has two signatures one without arguments and another with integer argument. These two are used to print the schema of the DataFrame to console or log.
// printSchema() Syntax
printSchema(): Unit
printSchema(level: Int): Unit
2. Spark printSchema() Example
First, let’s create a Spark DataFrame with column names.
// Example 1 - DataFrame printSchema()
// Import
import org.apache.spark.sql.SparkSession
// Create SparkSession
val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate();
// Create DataFrame
val columns = Seq("language","fee")
val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
import spark.implicits._
val df = data.toDF(columns:_*)
// Print Schema
df.printSchema()
The above example creates the DataFrame with two columns language
and fee
. Since we have not specified the data types it infers the data type of each column based on the column values (data). now let’s use printSchama()
which displays the schema of the DataFrame on the console or logs.
# Output
root
|-- language: string (nullable = true)
|-- fee: string (nullable = true)
Now let’s assign a data type to each column by using Spark StructType and StructField.
// Example 2 - Create StructType Schema
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType, IntegerType}
val schema = StructType( Array(
StructField("language", StringType,true),
StructField("fee", IntegerType,true)
))
//Create DataFrame
val data2 = List(Row("Java", "20000"), Row("Python", "100000"), Row("Scala", "3000"))
val df2 = spark.createDataFrame(
spark.sparkContext.parallelize(data2),schema)
df2.printSchema()
This yields similar output as above. To display the contents of the Spark DataFrame use show() method.
# Output
root
|-- language: string (nullable = true)
|-- fee: integer (nullable = true)
3. Print Schema with Nested Structure
While working on DataFrame we often need to work with the nested struct column and this can be defined using StructType. In the below example column name
data type is StructType which is nested.
printSchema()
method on the Spark DataFrame shows StructType columns as struct
.
// Example 3 - Nested structure
// Create Nested Structure
val schema_nest = StructType( Array(
StructField("name",StructType( Array(
StructField("firstname", StringType,true),
StructField("middlename", StringType,true),
StructField("lastname", StringType,true)
))),
StructField("language", StringType,true),
StructField("fee", IntegerType,true)
))
//Create DataFrame
val data3 = List(
Row(Row("James","","Smith"),"Java", "20000"),
Row(Row("Michael","Rose",""),"Python", "100000")
)
val df3 = spark.createDataFrame(
spark.sparkContext.parallelize(data3),schema_nest)
df3.printSchema()
Prints below schema to console.
# Output
root
|-- name: struct (nullable = true)
| |-- firstname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- lastname: string (nullable = true)
|-- language: string (nullable = true)
|-- fee: integer (nullable = true)
4. Print Schema with Level
Spark DataFrame printSchema() method also takes option param level
of type int, This can be used to select how many levels you wanted to print schema when you have multi nested schema.
For example, using printSchema(1)
displays just first level from schema.
// Print first level of Schema
df3.printSchema(1)
Just prints the first level from the schema. Compare this with the above schema.
# Output
root
|-- name: struct (nullable = true)
|-- language: string (nullable = true)
|-- fee: integer (nullable = true)
5. Print Schema for ArrayType and MapType
StructType also supports ArrayType and MapType to define the DataFrame columns for array and map collections respectively. In the below example, column languages
defined as ArrayType(StringType) and properties
defined as MapType(StringType,StringType) meaning both key and value as String.
//Use ArrayType & MapType
import org.apache.spark.sql.types.{ArrayType,MapType}
val schema_col = StructType( Array(
StructField("name", StringType,true),
StructField("languages", ArrayType(StringType),true),
StructField("properties", MapType(StringType,StringType),true)
))
//Create DataFrame
val data4 = List(
Row("James",List("Java","Scala"), Map("hair"->"black","eye"->"brown")),
Row("Michael",List("Python","PHP"), Map("hair"->"brown","eye"->"black"))
)
val df4 = spark.createDataFrame(
spark.sparkContext.parallelize(data4),schema_col)
df4.printSchema()
Outputs the below schema. Note that field languages
is array type and properties
is map type.
#Output
root
|-- name: string (nullable = true)
|-- languages: array (nullable = true)
| |-- element: string (containsNull = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
6. Complete Example of Spark Print Schema
// Complete Example
import org.apache.spark.sql.{Row, SparkSession}
object SparkPrintSchema extends App {
// Create SparkSession
val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate();
// Create DataFrame
val columns = Seq("language","fee")
val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
import spark.implicits._
val df = data.toDF(columns:_*)
df.printSchema()
// Create StructType Schema
import org.apache.spark.sql.types.{StringType, StructField, StructType, IntegerType}
val schema = StructType( Array(
StructField("language", StringType,true),
StructField("fee", IntegerType,true)
))
//Create DataFrame
val data2 = List(Row("Java", "20000"), Row("Python", "100000"), Row("Scala", "3000"))
val df2 = spark.createDataFrame(
spark.sparkContext.parallelize(data2),schema)
df2.printSchema()
//Create Nested Structure
val schema_nest = StructType( Array(
StructField("name",StructType( Array(
StructField("firstname", StringType,true),
StructField("middlename", StringType,true),
StructField("lastname", StringType,true)
))),
StructField("language", StringType,true),
StructField("fee", IntegerType,true)
))
//Create DataFrame
val data3 = List(
Row(Row("James","","Smith"),"Java", "20000"),
Row(Row("Michael","Rose",""),"Python", "100000")
)
val df3 = spark.createDataFrame(
spark.sparkContext.parallelize(data3),schema_nest)
df3.printSchema()
//Use ArrayType & MapType
import org.apache.spark.sql.types.{ArrayType,MapType}
val schema_col = StructType( Array(
StructField("name", StringType,true),
StructField("language", ArrayType(StringType),true),
StructField("properties", MapType(StringType,StringType),true)
))
//Create DataFrame
val data4 = List(
Row("James",List("Java","Scala"), Map("hair"->"black","eye"->"brown")),
Row("Michael",List("Python","PHP"), Map("hair"->"brown","eye"->"black"))
)
val df4 = spark.createDataFrame(
spark.sparkContext.parallelize(data4),schema_col)
df4.printSchema()
}
7. Conclusion
In this article, you have learned the syntax and usage of the Spark printSchema()
method with several examples including how printSchema() prints the schema of the DataFrame when it has nested structure, array, and map types.
Happy Learning !!
Related Articles
- Spark Schema – Explained with Examples
- Calculate Size of Spark DataFrame & RDD
- Spark Read Multiple CSV Files
- Calculate Size of Spark DataFrame & RDD
- Spark Convert a Row into Case Class
- Spark Word Count Explained with Example
- How to Check Spark Version
- Spark Shell Command Usage with Examples
- Spark Merge Two DataFrames with Different Columns or Schema