The -D parameter with spark-submit is used to set environment variables to a Spark job. Alternatively, you can also set these environment variables by using –config. These variables are commonly referred to as Spark configuration properties or Spark settings.
In this article, we shall discuss what is -D parameter or environment variable in a Spark job and different ways to pass them to Spark Job.
1. What is -D flag?
In Java, the -D
flag is used to set system properties when running a Java application from the command line. System properties are key-value pairs that configure various aspects of the Java runtime environment. You can use the -D
flag followed by the property name and value to set a system property.
# Usage -D with Java
java -Dpropertyname=propertyvalue YourJavaClass
2. How to use -D to set environment variables in Spark?
In Spark, you can’t use -D directly to set environment variables with the spark-submit command. However, Spark provides a way to use -D and set environment variables to both executors and drivers by using spark.executor.extraJavaOptions
and spark.driver.extraJavaOptions
.
By using these you can provide extra Java options like environment variables and Java memory settings to the Spark executor and Spark driver.
Note that using the --conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
option will not work when spark submits the driver in client mode. Use --driver-java-options "-Dproperty=value"
instead.
The usage of these properties in spark-submit
is as follows.
# Usage of -D with spark-submit
spark-submit -----
--conf 'spark.executor.extraJavaOptions=-Dproperty=value -Dproperty2=value'
--conf 'spark.driver.extraJavaOptions=-Dproperty=value -Dproperty2=value'
------
------
YourJavaClass
Here, “property” represents the name of the Spark configuration property, and “value” represents the desired value for that property. Multiple configuration properties can be specified by providing multiple “-D” parameters or environment variable assignments.
These configurations help customize the behavior of the Spark application according to the specific requirements of your job. They can control various aspects such as memory allocation, parallelism, serialization, logging, and more. Some commonly used Spark configuration properties include:
spark.executor.memory
: Sets the amount of memory per executor.spark.executor.cores
: Sets the number of cores per executor.spark.driver.memory
: Sets the amount of memory allocated to the driver.spark.default.parallelism
: Sets the default parallelism for RDD operations.spark.serializer
: Specifies the serializer used for data serialization.
By using the “-D” parameter or environment variables, you can easily modify these properties without modifying the source code of your Spark application. This flexibility allows you to experiment with different configurations and optimize the performance of your Spark jobs.
Note that using the --conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
option will not work when spark submits the driver in client mode. Use --driver-java-options "-Dproperty=value"
instead.
# Usage in client mode
spark-submit -----
--driver-java-options "-Dproperty=value"
------
------
YourJavaClass
Usage of -D with Example
Following are a few examples of how to use the -D flag to set environment variables to the Spark job (executor and driver).
# spark-submit example
spark-submit -----
--conf 'spark.driver.extraJavaOptions –Dlog4j.configuration=/path/log4j.properties'
--conf 'spark.executor.extraJavaOptions –Dlog4j.configuration=/path/log4j.properties'
-----
-----
YourJavaClass
Another example
# spark-submit example
spark-submit -----
--conf 'spark.driver.extraJavaOptions –Denv=dev -Dkey=value -Dkey2=value'
--conf 'spark.executor.extraJavaOptions –Denv=dev -Dkey1=value -Dkey2=value'
-----
-----
YourJavaClass
3. Using Configuration File
To pass environment variables to a Spark job using a configuration file, you can follow these steps:
1. Create a configuration file, typically named spark-defaults.conf
. You can place this file in the Spark configuration directory (e.g., conf/
within your Spark installation directory) or in a directory specified by the SPARK_CONF_DIR
environment variable.
2. Inside the configuration file, specify the desired configuration properties in the format property=value
. Each property should be on a separate line.
Example:
spark.executor.memory 4g
spark.driver.memory 2g
In the example above, two properties are set: spark.executor.memory
with a value of 4g
and spark.driver.memory
with a value of 2g
. These properties determine the memory allocation for the executor and driver, respectively.
3. Run the spark-submit
command, which will automatically read the configuration properties from the spark-defaults.conf
file.
spark-submit --class com.example.YourSparkApp --master yarn --deploy-mode cluster your-spark-app.jar
In the above example,
- the
spark-submit
command will execute your Spark application (your-spark-app.jar
) in cluster mode using the YARN resource manager (--master yarn
). The application classcom.example.YourSparkApp
should be replaced with the appropriate class name for your Spark application. - The Spark job will start with the configuration properties specified in the
spark-defaults.conf
file, overriding any default settings or properties defined elsewhere.
Using a configuration file allows you to define and manage the Spark configuration properties in a separate file, making it easier to maintain and modify the properties without modifying the spark-submit
command each time. It provides a more organized and reusable approach to configure your Spark jobs.
4 Programmatically within Spark code
To pass environment variables to a Spark job programmatically within your Spark code, you can use the SparkConf
object to set the desired configuration properties. Here’s how you can do it:
// Import the SparkConf class in your Spark application code.
import org.apache.spark.SparkConf
// Create an instance of SparkConf.
val conf = new SparkConf()
// Use the set() method of the SparkConf object to set
// the desired configuration properties.
conf.set("spark.executor.memory", "4g")
conf.set("spark.driver.memory", "2g")
// Pass the SparkConf object to the SparkSession or SparkContext constructor
// when creating the Spark session or context.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.config(conf)
.appName("YourSparkApp")
.getOrCreate()
In the above example,
- We pass the
conf
object to theconfig()
method ofSparkSession.builder()
to configure the Spark session with the desired properties. - You can replace
"YourSparkApp"
it with the desired name for your Spark application. - By setting the configuration properties programmatically within your Spark code, you can dynamically adjust the properties based on your application logic.
- This approach is useful when you need fine-grained control over the configuration properties and want to customize them based on runtime conditions or external factors.
Note that programmatically setting configuration properties within Spark code will override any default settings or properties specified through other methods such as command-line arguments or configuration files.
Conclusion
In conclusion, the “-D” parameter or environment variable in a Spark job is a flexible mechanism for configuring and customizing various aspects of the Spark application’s behavior. It allows you to set configuration properties at runtime without modifying the source code, providing greater flexibility and adaptability to different environments and requirements.
Spark provides a way to use -D and set environment variables to both executors and drivers by using spark.executor.extraJavaOptions
and spark.driver.extraJavaOptions