Spark/Pyspark Application Configuration

In this Spark article, I will explain how to read Spark/Pyspark application configuration or any other configurations and properties from external sources. But why do we need to provide them externally? can’t we hardcode in the codebase? The reason for passing them externally is in real-time Spark application configurations, properties, passwords, etc… are not hardcoded inside the application. They are been passed externally because –

1. Using the application.properties file

Use this approach when you have a set of unrelated configurations and you need to bundle them in a single file(this file may be environment-specific i.e. stage/dev/prod).

Below we have a sample application.properties file. Here we specify the configurations simply as a key-value map i.e. as a set of properties.


#Environmental properties
appEnv=PROD/UAT/SI/TEST
dev.executionmode=local
appName = ConfigTest

#Constants
hive_mrkt_table_name = MARKET_TABLE
hive_prdc_table_name = PRODUCT_TABLE
prdc_file = src/main/resources/Periods-2015-_2022.csv

#Snowflake properties
sfURL=DF12835.east-us-2.azure.snowflakecomputing.com
sfUser=sfuser
sfPassword=Sf_user@123
sfDatabase=SNOWFLAKE_SAMPLE_DATA
sfSchema=PCDS_SF100TCL
sfWarehouse=COMPUTE_WH

And these spark application configurations can be read using the following snippet to read these types of properties


// PropertyReader Program
// Read & loads the properties
package main.scala.org.example.config

import java.io.FileNotFoundException
import java.util.Properties
import scala.io.Source

object PropertyReader {
  def propertyReader(path:String): Properties = {
    var properties: Properties = null
   val url = getClass.getResource(path)
   if (url != null) {
     val source = Source.fromURL(url)

     properties = new Properties()
     properties.load(source.bufferedReader())
     properties
    }
   else
   {
    throw new FileNotFoundException("Properties file cannot be loaded")
    }
  }
}

In the above snippet, you have the property reader method which takes the path of the application.properties file as a parameter and returns Properties. You can import this method in another class and use the properties.

Here is an example of its usage.

Here in the main class, in line 11 we are calling the PropertyReader function which we discussed earlier with the path of the property file as input and populating value for appName and product data file path from configs using the key.
Inline {26, 36} we can see the usage of these properties.

The output of the printStatements:

Spark Application Configuration — The output of the property reader

Like this using java.util.properties, we can read the key-value pairs from any external property file use them in the spark application configuration and avoid hardcoding.

2. Using the JSON file type

Use this approach when you have to specify multiple interrelated configurations (wherein some of them might be related to each other). All you need to do is- bucket these configurations under different headers.

Consider the following sample application.conf JSON file


configs{
  spark {
    app-name = "my-app"
    master = "local[*]"
    log-level = "INFO"
  }
  snowflake {
    account = "account"
    username = "username"
    password = "password"
  }
  sql_queries {
  create_database = "create database if not exists ${db_Name}"
  drop_database =  s"drop database if not exists ${db_Name}"
  }
  path {
  prdc_file = "src/main/resources/Periods-2015-_2022.csv"
  }
}

In the above JSON config file, you bucket the configurations related to spark/snowflake/SQL-queries/paths under the respective headers to improve the readability. You can also have nested structures with any depth using this approach

So, let us see how to read these configurations:

Add the following dependency to your pom.xml


<dependency>
  <groupid>com.typesafe</groupid>
  <artifactid>config</artifactid>
  <version>1.3.4</version>
</dependency>

Typesafe

Typesafe supports Java properties, JSON, and a human-friendly JSON superset. we can use ConfigFactory.load() method to load the available configurations.

According to the official documentation, the standard behavior loads the following type of files (first-listed are higher priority):

system properties
application.conf (all resources on the classpath with this name)
application.json (all resources on the classpath with this name)
application.properties (all resources on the classpath with this name)
reference.conf (all resources on the classpath with this name)

Use the following lines of code to read the config parameters:


package main.scala.org.example.config

import com.typesafe.config.{Config, ConfigFactory}

object configReader {
  def configReader(filePath: String): Config ={
    val config = ConfigFactory.load(filePath)
    config
  }
}

In the above snippet, we have the ConfigReader method which takes the path of the application.config file as the parameter and return Config. You can import this method in another class and use the properties.

Here is an example of its usage.


// hiveTest scala program
// This reads the config file and
// and create SparkSession with the configs
package org.example.hive

import org.apache.spark.sql.{DataFrame, SparkSession}

object hiveTest {

  def appMain(args: Array[String]): Unit = {

    //Reading application.conf JSON file using configReader
    val configs = configReader.configReader("application.conf").getConfig("configs")
    //Set variable using values from configs
    val applicationName = configs.getString("spark.app-name")
    val master_url = configs.getString("spark.master")
    val prdc_file_path = configs.getString("path.prdc_file")

    println(appName)
    println(master)
    println(prdc_file_path)

    //Initialie spark session
    val spark = SparkSession.builder
      .appName(applicationName)
      .master(master_url)
      .enableHiveSupport()
      .getOrCreate()

    //Read prdc file and create dataFrame
    val productDataFrame = spark.read
      .option("header", true)
      .option("inferSchema", true)
      .csv(prdc_file_path)

  }
  def main(args:Array[String]): Unit = {
    appMain(args:Array[String])
  }
}

In the above snippet, we are importing the ConfigReader object into the main method and initiating with the passing application.conf file path.

Initially, we are calling the config reader function which we discussed earlier with the path of the config file as input, and extracting output of values for appName, spark master, and product data file path from configs. We can directly use these variables in our application.

The output of the printStatements:

Spark/Pyspark Application Configuration — the output of the print statement

Like this using the Typesafe library, we can read the properties from JSON by reading from any external source and use them in the application and avoid hardcoding.

3. Conclusion

Storing spark configuration and properties in an external file helps to reduce the code changes frequently when in cases we want to update frequently. We can simply update the external file. These methods reduce code movement dependency and increase security for your applications.

Table of contents

1. Using the application.properties file

2. Using the JSON file type

Typesafe

3. Conclusion

Related articles