Spark Read() options

Spark provides several read options that help you to read files. The spark.read() is a method used to read data from various data sources such as CSV, JSON, Parquet, Avro, ORC, JDBC, and many more. It returns a DataFrame or Dataset depending on the API used. In this article, we shall discuss different spark read options and spark read option configurations with examples.

1. Introduction

Since the Spark Read() function helps to read various data sources, before deep diving into the read options available let’s see how we can read various data sources

Here’s an example of how to read different files using spark.read():


// Imports
import org.apache.spark.sql.SparkSession

// Create SparkSession
val spark = SparkSession.builder()
        .appName("Creating DataFrame")
        .master("local[*]")
        .getOrCreate()

// Reading a CSV file
val df = spark.read
  .csv("path/to/file.csv")

//Reading a JSON file
val df = spark.read
  .json("path/to/file.json")

//Reading a text file
val df = spark.read
  .text("path/to/file.txt")

//Reading a Parquet file with compression:
  .parquet("path/to/file.parquet")

//5.Reading a JDBC table with custom query:
val df = spark.read
  .format("jdbc")
  .option("url", "jdbc:mysql://localhost:3306/mydb")
  .option("dbtable", "mytable")
  .option("user", "myuser")
  .option("password", "mypassword")
  .option("query", "SELECT * FROM mytable WHERE column1 > 100")
  .load()

You can also specify a custom schema by using the schema method:


// Imports
import org.apache.spark.sql.types._

val customSchema = StructType(Seq(
  StructField("name", StringType, nullable = false),
  StructField("age", IntegerType, nullable = true),
  StructField("gender", StringType, nullable = true)
))

val df = spark.read
  .option("header", "true")
  .schema(customSchema)
  .csv("path/to/file.csv")

Note: spark.read() is a lazy operation, which means that it won’t actually read the data until an action is performed on the DataFrame.

2. Spark read() options

Spark provides several read options that allow you to customize how data is read from the sources that are explained above. Here are some of the commonly used Spark read options:

2.1 Syntax of Spark read() options:

You can use option() from DataFrameReader to set options.


# Using read options
val df = spark.read.format("data_source_format")
                   .option("option", "value")
                   .option("option", "value")
                   .load("path/to/data")

Alternatively, you can also write using options()


# Using read options
val df = spark.read.format("data_source_format")
                   .options(Map("option"->"value","option"->"value"))
                   .load("path/to/data")

2.2 Available options

header: Specifies whether the input file has a header row or not. This option can be set to true or false. For example, header=true indicates that the input file has a header row.
inferSchema: Specifies whether to infer the schema of the input data. If set to true, Spark will try to infer the schema of the input data. If set to false, Spark will use the default schema for the input data source. For example, inferSchema=true indicates that Spark should try to infer the schema of the input data.
delimiter: Specifies the delimiter used to separate fields in the input file. For example, delimiter=',' specifies that the input file uses a comma as the delimiter.
encoding: Specifies the character encoding of the input file. For example, encoding='UTF-8' specifies that the input file is encoded using UTF-8.
quote: Specifies the character used to enclose fields in the input file. For example, quote='"' specifies that the input file uses double quotes to enclose fields.
escape: Specifies the character used to escape special characters in the input file. For example, escape='\\' specifies that the input file uses a backslash to escape special characters.
multiLine: Specifies whether the input file has multiline records. If set to true, Spark will read multiline records as a single record. If set to false, Spark will read multiline records as separate records. For example, multiLine=true indicates that the input file has multiline records.
ignoreLeadingWhiteSpace: Specifies whether to ignore leading whitespaces in fields. If set to true, Spark will ignore leading whitespaces. If set to false, Spark will consider leading whitespaces as part of the field value. For example, ignoreLeadingWhiteSpace=true indicates that Spark should ignore leading whitespaces.
ignoreTrailingWhiteSpace: Specifies whether to ignore trailing whitespaces in fields. If set to true, Spark will ignore trailing whitespaces. If set to false, Spark will consider trailing whitespaces as part of the field value. For example, ignoreTrailingWhiteSpace=true indicates that Spark should ignore trailing whitespaces.

These are some of the commonly used read options in Spark. There are many other options available depending on the input data source.

3. Spark Read Options with Examples

Here are some examples of how to configure Spark read options:

3.1. Configuring the number of partitions


val df = spark.read
  .option("header", "true")
  .option("numPartitions", 10)
  .csv("path/to/file.csv")

This configures the Spark read option with the number of partitions to 10 when reading a CSV file.

3.2. Configuring the schema


import org.apache.spark.sql.types._

val customSchema = StructType(Seq(
  StructField("name", StringType, nullable = false),
  StructField("age", IntegerType, nullable = true),
  StructField("gender", StringType, nullable = true)
))

val df = spark.read
  .option("header", "true")
  .schema(customSchema)
  .csv("path/to/file.csv")

This configures the Spark read options with a custom schema for the data when reading a CSV file.

3.3. Configuring the sampling ratio


val df = spark.read
  .option("header", "true")
  .option("samplingRatio", 0.5)
  .csv("path/to/file.csv")

This configures a sampling ratio of 0.5 when reading a CSV file.

3.4. Configuring the column names


val df = spark.read
  .option("header", "false")
  .option("inferSchema", "true")
  .option("columnNameOfCorruptRecord", "_corrupt_record")
  .csv("path/to/file.csv")

This configures the name of the column that stores corrupt records to _corrupt_record when reading a CSV file with inferred schema and no header row.

3.5. Configuring the partition column


val df = spark.read
  .option("header", "true")
  .option("partitionColumn", "date")
  .option("lowerBound", "2020-01-01")
  .option("upperBound", "2020-12-31")
  .option("numPartitions", 12)
  .csv("path/to/file.csv")

This configures partitioning by the date Column with a lower bound of 2020-01-01, an upper bound of 2020-12-31, and 12 partitions when reading a CSV file.

These are just a few examples of how to configure Spark read options. There are many more options available depending on the data source and format.

4. Conclusion

In conclusion, Spark read options are an essential feature for reading and processing data in Spark. These options allow users to specify various parameters when reading data from different data sources, such as file formats, compression, partitioning, schema inference, and many more.

Table of contents