You are currently viewing How to use -D parameter or environment variable to Spark job?

The -D parameter with spark-submit is used to set environment variables to a Spark job. Alternatively, you can also set these environment variables by using –config. These variables are commonly referred to as Spark configuration properties or Spark settings.

Advertisements

In this article, we shall discuss what is -D parameter or environment variable in a Spark job and different ways to pass them to Spark Job.

1. What is -D flag?

In Java, the -D flag is used to set system properties when running a Java application from the command line. System properties are key-value pairs that configure various aspects of the Java runtime environment. You can use the -D flag followed by the property name and value to set a system property.


# Usage -D with Java
java -Dpropertyname=propertyvalue YourJavaClass

2. How to use -D to set environment variables in Spark?

In Spark, you can’t use -D directly to set environment variables with the spark-submit command. However, Spark provides a way to use -D and set environment variables to both executors and drivers by using spark.executor.extraJavaOptions and spark.driver.extraJavaOptions.

By using these you can provide extra Java options like environment variables and Java memory settings to the Spark executor and Spark driver.

Note that using the --conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app' option will not work when spark submits the driver in client mode. Use --driver-java-options "-Dproperty=value" instead.

The usage of these properties in spark-submit is as follows.


# Usage of -D with spark-submit
spark-submit -----
--conf 'spark.executor.extraJavaOptions=-Dproperty=value -Dproperty2=value'
--conf 'spark.driver.extraJavaOptions=-Dproperty=value -Dproperty2=value'
------
------
YourJavaClass

Here, “property” represents the name of the Spark configuration property, and “value” represents the desired value for that property. Multiple configuration properties can be specified by providing multiple “-D” parameters or environment variable assignments.

These configurations help customize the behavior of the Spark application according to the specific requirements of your job. They can control various aspects such as memory allocation, parallelism, serialization, logging, and more. Some commonly used Spark configuration properties include:

  • spark.executor.memory: Sets the amount of memory per executor.
  • spark.executor.cores: Sets the number of cores per executor.
  • spark.driver.memory: Sets the amount of memory allocated to the driver.
  • spark.default.parallelism: Sets the default parallelism for RDD operations.
  • spark.serializer: Specifies the serializer used for data serialization.

By using the “-D” parameter or environment variables, you can easily modify these properties without modifying the source code of your Spark application. This flexibility allows you to experiment with different configurations and optimize the performance of your Spark jobs.

Note that using the --conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app' option will not work when spark submits the driver in client mode. Use --driver-java-options "-Dproperty=value" instead.


# Usage in client mode
spark-submit -----
--driver-java-options "-Dproperty=value"
------
------
YourJavaClass

Usage of -D with Example

Following are a few examples of how to use the -D flag to set environment variables to the Spark job (executor and driver).


# spark-submit example
spark-submit -----
--conf 'spark.driver.extraJavaOptions –Dlog4j.configuration=/path/log4j.properties' 
--conf 'spark.executor.extraJavaOptions –Dlog4j.configuration=/path/log4j.properties'
-----
-----
YourJavaClass

Another example


# spark-submit example
spark-submit -----
--conf 'spark.driver.extraJavaOptions –Denv=dev -Dkey=value -Dkey2=value' 
--conf 'spark.executor.extraJavaOptions –Denv=dev -Dkey1=value -Dkey2=value'
-----
-----
YourJavaClass

3. Using Configuration File

To pass environment variables to a Spark job using a configuration file, you can follow these steps:

1. Create a configuration file, typically named spark-defaults.conf. You can place this file in the Spark configuration directory (e.g., conf/ within your Spark installation directory) or in a directory specified by the SPARK_CONF_DIR environment variable.

2. Inside the configuration file, specify the desired configuration properties in the format property=value. Each property should be on a separate line.
Example:


spark.executor.memory  4g
spark.driver.memory    2g

In the example above, two properties are set: spark.executor.memory with a value of 4g and spark.driver.memory with a value of 2g. These properties determine the memory allocation for the executor and driver, respectively.

3. Run the spark-submit command, which will automatically read the configuration properties from the spark-defaults.conf file.


spark-submit --class com.example.YourSparkApp --master yarn --deploy-mode cluster your-spark-app.jar

In the above example,

  • the spark-submit command will execute your Spark application (your-spark-app.jar) in cluster mode using the YARN resource manager (--master yarn). The application class com.example.YourSparkApp should be replaced with the appropriate class name for your Spark application.
  • The Spark job will start with the configuration properties specified in the spark-defaults.conf file, overriding any default settings or properties defined elsewhere.

Using a configuration file allows you to define and manage the Spark configuration properties in a separate file, making it easier to maintain and modify the properties without modifying the spark-submit command each time. It provides a more organized and reusable approach to configure your Spark jobs.

4 Programmatically within Spark code

To pass environment variables to a Spark job programmatically within your Spark code, you can use the SparkConf object to set the desired configuration properties. Here’s how you can do it:


// Import the SparkConf class in your Spark application code.
import org.apache.spark.SparkConf

// Create an instance of SparkConf.
val conf = new SparkConf()

// Use the set() method of the SparkConf object to set 
// the desired configuration properties.
conf.set("spark.executor.memory", "4g")
conf.set("spark.driver.memory", "2g")

// Pass the SparkConf object to the SparkSession or SparkContext constructor 
// when creating the Spark session or context.
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .config(conf)
  .appName("YourSparkApp")
  .getOrCreate()

In the above example,

  • We pass the conf object to the config() method of SparkSession.builder() to configure the Spark session with the desired properties.
  • You can replace "YourSparkApp" it with the desired name for your Spark application.
  • By setting the configuration properties programmatically within your Spark code, you can dynamically adjust the properties based on your application logic.
  • This approach is useful when you need fine-grained control over the configuration properties and want to customize them based on runtime conditions or external factors.

Note that programmatically setting configuration properties within Spark code will override any default settings or properties specified through other methods such as command-line arguments or configuration files.

Conclusion

In conclusion, the “-D” parameter or environment variable in a Spark job is a flexible mechanism for configuring and customizing various aspects of the Spark application’s behavior. It allows you to set configuration properties at runtime without modifying the source code, providing greater flexibility and adaptability to different environments and requirements.

Spark provides a way to use -D and set environment variables to both executors and drivers by using spark.executor.extraJavaOptions and spark.driver.extraJavaOptions

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium