PySpark Select Nested struct Columns

Using PySpark select() transformations one can select the nested struct columns from DataFrame. While working with semi-structured files like JSON or structured files like Avro, Parquet, ORC we often have to deal with complex nested structures. When you read these files into DataFrame, all nested structure elements are converted into…

Continue Reading PySpark Select Nested struct Columns

Spark SQL – Select Columns From DataFrame

In Spark SQL, select() function is used to select one or multiple columns, nested columns, column by index, all columns, from the list, by regular expression from a DataFrame. select() is a transformation function in Spark and returns a new DataFrame with the selected columns. You can also alias column…

Continue Reading Spark SQL – Select Columns From DataFrame

PySpark Select Columns From DataFrame

In PySpark, select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpark select() is a transformation function hence it returns a new DataFrame with the selected columns. Select a Single & Multiple Columns from PySparkSelect All…

Continue Reading PySpark Select Columns From DataFrame

PySpark StructType & StructField Explained with Examples

PySpark StructType & StructField classes are used to programmatically specify the schema to the DataFrame and creating complex columns like nested struct, array and map columns. StructType is a collection of StructField's that defines column name, column data type, boolean to specify if the field can be nullable or not…

Continue Reading PySpark StructType & StructField Explained with Examples

PySpark withColumnRenamed to Rename Column on DataFrame

Use PySpark withColumnRenamed() to rename a DataFrame column, we often need to rename one column or multiple (or all) columns on PySpark DataFrame, you can do this in several ways. When columns are nested it becomes complicated. Since DataFrame's are an immutable collection, you can't rename or update a column…

Continue Reading PySpark withColumnRenamed to Rename Column on DataFrame

Spark read JSON with or without schema

By default Spark SQL infer schema while reading JSON file, but, we can ignore this and read a JSON with schema (user-defined) using spark.read.schema("schema") method. What is Spark Schema Spark Schema defines the structure of the data (column name, datatype, nested columns, nullable e.t.c), and when it specified while reading…

Continue Reading Spark read JSON with or without schema

Spark Convert case class to Schema

Spark SQL provides Encoders to convert case class to the spark schema (struct StructType object), If you are using older versions of Spark, you can create spark schema from case class using the Scala hack. Both options are explained here with examples. First, let's create a case class "Name" &…

Continue Reading Spark Convert case class to Schema

Spark Schema – Explained with Examples

Spark Schema defines the structure of the DataFrame which you can get by calling printSchema() method on the DataFrame object. Spark SQL provides StructType & StructField classes to programmatically specify the schema. By default, Spark infers the schema from the data, however, sometimes we may need to define our own…

Continue Reading Spark Schema – Explained with Examples

Spark – Create a DataFrame with Array of Struct column

Problem: How to create a Spark DataFrame with Array of struct column using Spark and Scala? Using StructType and ArrayType classes we can create a DataFrame with Array of Struct column ( ArrayType(StructType) ). From below example column "booksInterested" is an array of StructType which holds "name", "author" and the…

Continue Reading Spark – Create a DataFrame with Array of Struct column

Spark – explode Array of Struct to rows

Problem: How to explode Array of StructType DataFrame columns to rows using Spark. Solution: Spark explode function can be used to explode an Array of Struct ArrayType(StructType) columns to rows on Spark DataFrame using scala example. Before we start, let's create a DataFrame with Struct column in an array. From…

Continue Reading Spark – explode Array of Struct to rows