In this Spark article, I will explain how to read Spark/Pyspark application configuration or any other configurations and properties from external sources. But why do we need to provide them externally? can’t we hardcode in the codebase? The reason for passing them externally is in real-time Spark application configurations, properties, passwords, etc… are not hardcoded inside the application. They are been passed externally because –
- No need to do any changes in the application code base which needs to be deployed after the change. Simply we can update the parameters in the config files.
- For security purposes hardcoding passwords in the codebase is not a good practice.
There are multiple ways to read the configuration files in Scala but here are two of my most preferred approaches depending on the structure of the configurations.
Table of contents
1. Using the application.properties file
Use this approach when you have a set of unrelated configurations and you need to bundle them in a single file(this file may be environment-specific i.e. stage/dev/prod).
Below we have a sample application.properties file. Here we specify the configurations simply as a key-value map i.e. as a set of properties.
#Environmental properties
appEnv=PROD/UAT/SI/TEST
dev.executionmode=local
appName = ConfigTest
#Constants
hive_mrkt_table_name = MARKET_TABLE
hive_prdc_table_name = PRODUCT_TABLE
prdc_file = src/main/resources/Periods-2015-_2022.csv
#Snowflake properties
sfURL=DF12835.east-us-2.azure.snowflakecomputing.com
sfUser=sfuser
sfPassword=Sf_user@123
sfDatabase=SNOWFLAKE_SAMPLE_DATA
sfSchema=PCDS_SF100TCL
sfWarehouse=COMPUTE_WH
And these spark application configurations can be read using the following snippet to read these types of properties
// PropertyReader Program
// Read & loads the properties
package main.scala.org.example.config
import java.io.FileNotFoundException
import java.util.Properties
import scala.io.Source
object PropertyReader {
def propertyReader(path:String): Properties = {
var properties: Properties = null
val url = getClass.getResource(path)
if (url != null) {
val source = Source.fromURL(url)
properties = new Properties()
properties.load(source.bufferedReader())
properties
}
else
{
throw new FileNotFoundException("Properties file cannot be loaded")
}
}
}
In the above snippet, you have the property reader method which takes the path of the application.properties file as a parameter and returns Properties. You can import this method in another class and use the properties.
Here is an example of its usage.
Here in the main class, in line 11 we are calling the PropertyReader function which we discussed earlier with the path of the property file as input and populating value for appName and product data file path from configs using the key.
Inline {26, 36} we can see the usage of these properties.
The output of the printStatements:
Like this using java.util.properties, we can read the key-value pairs from any external property file use them in the spark application configuration and avoid hardcoding.
2. Using the JSON file type
Use this approach when you have to specify multiple interrelated configurations (wherein some of them might be related to each other). All you need to do is- bucket these configurations under different headers.
Consider the following sample application.conf JSON file
configs{
spark {
app-name = "my-app"
master = "local[*]"
log-level = "INFO"
}
snowflake {
account = "account"
username = "username"
password = "password"
}
sql_queries {
create_database = "create database if not exists ${db_Name}"
drop_database = s"drop database if not exists ${db_Name}"
}
path {
prdc_file = "src/main/resources/Periods-2015-_2022.csv"
}
}
In the above JSON config file, you bucket the configurations related to spark/snowflake/SQL-queries/paths under the respective headers to improve the readability. You can also have nested structures with any depth using this approach
So, let us see how to read these configurations:
- Add the following dependency to your pom.xml
<dependency>
<groupid>com.typesafe</groupid>
<artifactid>config</artifactid>
<version>1.3.4</version>
</dependency>
Typesafe
Typesafe supports Java properties, JSON, and a human-friendly JSON superset. we can use ConfigFactory.load()
method to load the available configurations.
According to the official documentation, the standard behavior loads the following type of files (first-listed are higher priority):
- system properties
- application.conf (all resources on the classpath with this name)
- application.json (all resources on the classpath with this name)
- application.properties (all resources on the classpath with this name)
- reference.conf (all resources on the classpath with this name)
Use the following lines of code to read the config parameters:
package main.scala.org.example.config
import com.typesafe.config.{Config, ConfigFactory}
object configReader {
def configReader(filePath: String): Config ={
val config = ConfigFactory.load(filePath)
config
}
}
In the above snippet, we have the ConfigReader method which takes the path of the application.config file as the parameter and return Config. You can import this method in another class and use the properties.
Here is an example of its usage.
// hiveTest scala program
// This reads the config file and
// and create SparkSession with the configs
package org.example.hive
import org.apache.spark.sql.{DataFrame, SparkSession}
object hiveTest {
def appMain(args: Array[String]): Unit = {
//Reading application.conf JSON file using configReader
val configs = configReader.configReader("application.conf").getConfig("configs")
//Set variable using values from configs
val applicationName = configs.getString("spark.app-name")
val master_url = configs.getString("spark.master")
val prdc_file_path = configs.getString("path.prdc_file")
println(appName)
println(master)
println(prdc_file_path)
//Initialie spark session
val spark = SparkSession.builder
.appName(applicationName)
.master(master_url)
.enableHiveSupport()
.getOrCreate()
//Read prdc file and create dataFrame
val productDataFrame = spark.read
.option("header", true)
.option("inferSchema", true)
.csv(prdc_file_path)
}
def main(args:Array[String]): Unit = {
appMain(args:Array[String])
}
}
In the above snippet, we are importing the ConfigReader object into the main method and initiating with the passing application.conf file path.
Initially, we are calling the config reader function which we discussed earlier with the path of the config file as input, and extracting output of values for appName, spark master, and product data file path from configs. We can directly use these variables in our application.
The output of the printStatements:
Like this using the Typesafe library, we can read the properties from JSON by reading from any external source and use them in the application and avoid hardcoding.
3. Conclusion
Storing spark configuration and properties in an external file helps to reduce the code changes frequently when in cases we want to update frequently. We can simply update the external file. These methods reduce code movement dependency and increase security for your applications.
Related articles
- What is Apache Spark Driver?
- Spark – What is SparkSession Explained
- What is spark.driver.maxResultSize?
- Dynamic way of doing ETL through Pyspark
- PySpark distinct vs dropDuplicates
- PySpark count() – Different Methods Explained
- How to Exit or Quit from Spark Shell & PySpark?
- PySpark Shell Command Usage with Examples
- Spark SQL Performance Tuning by Configurations