Configuring Spark Executor extraJavaOptions
is a pivotal aspect of optimizing Apache Spark applications. In my experience, this parameter allows fine-tuning of Java Virtual Machine (JVM) settings for Spark Executors, addressing critical factors such as memory allocation, garbage collection strategies, and system properties. By customizing these options, Spark engineers can enhance performance, manage resources efficiently, and cater to specific application requirements.
You can find all the configurations that are set to Spark executor by accessing Spark Web UI.
- Go to Spark UI
- Select Environment
- Select Spark Properties
Using spark.executor.extraJavaOptions
with spark-submit
Regardless of using Spark with Scala or PySpark you can use the extraJavaOptions
to set JVM options to driver and executors.
# Using spark submit
spark-submit --master yarn \
--deploy-mode cluster \
--name my-app \
--conf 'spark.executor.extraJavaOptions=-DenvVar1=var1Value -DenvVar2=var2Value' \
--conf 'spark.driver.extraJavaOptions=-DenvVar1=var1Value -DenvVar2=var2Value'
........
........
Using SparkConf
You can also set the JVM options to driver and executors at the time of creating SparkSession. The below example demonstrates with Scala, similarly, you can also achieve this in PySpark
import org.apache.spark.sql.SparkSession
// Create SparkSession in spark 2.x or later
val spark = SparkSession.builder().master("local[*]")
.appName("SparkByExamples.com")
.conf("spark.driver.extraJavaOptions","-DenvVar1=var1Value")
.conf("spark.executor.extraJavaOptions","-DenvVar1=var1Value")
.getOrCreate()
Note that when you submit your Spark or PySpark application in client mode, the spark driver runs on the server where you submit your application.
Note: In client mode, spark.driver.extraJavaOptions
config must not be set through the SparkConf
(using .conf()) directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-java-options
command line option or in your default properties file.
Purpose of spark.executor.extraJavaOptions
:
Memory Configuration:
Users often need to fine-tune memory settings for Spark Executors to optimize performance. The spark.executor.extraJavaOptions
property can be used to set options related to heap size, garbage collection, and off-heap memory.
--conf spark.executor.extraJavaOptions="-Xmx4g -XX:MaxMetaspaceSize=512m"
This example sets the maximum heap size to 4 gigabytes (-Xmx4g
) and the maximum Metaspace size to 512 megabytes (-XX:MaxMetaspaceSize=512m
).
Garbage Collection Configuration:
Garbage collection settings are critical for managing memory efficiently. Users can use spark.executor.extraJavaOptions
to specify garbage collection algorithms, tuning parameters, and collector behavior.
--conf spark.executor.extraJavaOptions="-XX:+UseG1GC -XX:MaxGCPauseMillis=100"
This example sets the G1 garbage collector (-XX:+UseG1GC
) and configures the maximum pause time to 100 milliseconds (-XX:MaxGCPauseMillis=100
).
Java System Properties:
Users can pass Java system properties to Spark Executors using spark.executor.extraJavaOptions
. This is useful for setting properties that affect the behavior of the Java runtime environment.
--conf spark.executor.extraJavaOptions="-Dproperty=value"
This example sets a Java system property (-Dproperty=value
) that can be accessed within Spark tasks.
Logging Configuration:
Logging settings for Spark Executors can be customized using spark.executor.extraJavaOptions
. This includes specifying log levels, log file locations, and log format.
--conf spark.executor.extraJavaOptions="-Dlog4j.configuration=file:/path/to/log4j.properties"
Here, a custom log4j configuration file is specified using a Java system property.
JVM Debugging:
Enabling JVM debugging is facilitated by spark.executor.extraJavaOptions
. Users can set options to enable remote debugging and specify the port for connecting a debugger.
--conf spark.executor.extraJavaOptions="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005"
This example enables the JVM debugger on port 5005 (-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
).
External Library Paths:
Users may need to include external libraries or directories in the classpath of Spark Executors. spark.executor.extraJavaOptions
can be used to add paths to external libraries
--conf spark.executor.extraJavaOptions="-Djava.library.path=/path/to/libs"
This sets the java.library.path
system property to include the specified directory.
Conclusion
In summary, spark.executor.extraJavaOptions
is a versatile configuration property in Apache Spark, allowing users to tailor the Java runtime environment for Spark Executors. It is beneficial for memory tuning, garbage collection configuration, system properties, logging customization, JVM debugging, and managing external library paths. Users should carefully choose and configure these options based on their specific application requirements and the characteristics of their Spark cluster. Using these without knowing will have side effects, which include degrading performance.
Related Articles
- Spark Set Environment Variable to Executors
- Spark Set JVM Options to Driver & Executors
- Difference Between Spark Driver vs Executor
- Difference Between Spark Worker vs Executor
- How to Set Apache Spark Executor Memory
- Tune Spark Executor Number, Cores, and Memory