You are currently viewing Usage of Spark Executor extrajavaoptions

Configuring Spark Executor extraJavaOptions is a pivotal aspect of optimizing Apache Spark applications. In my experience, this parameter allows fine-tuning of Java Virtual Machine (JVM) settings for Spark Executors, addressing critical factors such as memory allocation, garbage collection strategies, and system properties. By customizing these options, Spark engineers can enhance performance, manage resources efficiently, and cater to specific application requirements.

Advertisements

You can find all the configurations that are set to Spark executor by accessing Spark Web UI.

  1. Go to Spark UI
  2. Select Environment
  3. Select Spark Properties

Using spark.executor.extraJavaOptions with spark-submit

Regardless of using Spark with Scala or PySpark you can use the extraJavaOptions to set JVM options to driver and executors.


# Using spark submit
spark-submit --master yarn \
    --deploy-mode cluster \
    --name my-app \
    --conf 'spark.executor.extraJavaOptions=-DenvVar1=var1Value -DenvVar2=var2Value' \
    --conf 'spark.driver.extraJavaOptions=-DenvVar1=var1Value -DenvVar2=var2Value'
    ........
    ........

Using SparkConf

You can also set the JVM options to driver and executors at the time of creating SparkSession. The below example demonstrates with Scala, similarly, you can also achieve this in PySpark


import org.apache.spark.sql.SparkSession

// Create SparkSession in spark 2.x or later
val spark = SparkSession.builder().master("local[*]")
    .appName("SparkByExamples.com")
    .conf("spark.driver.extraJavaOptions","-DenvVar1=var1Value")
    .conf("spark.executor.extraJavaOptions","-DenvVar1=var1Value")
    .getOrCreate()

Note that when you submit your Spark or PySpark application in client mode, the spark driver runs on the server where you submit your application.

Note: In client mode, spark.driver.extraJavaOptions config must not be set through the SparkConf (using .conf()) directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-java-options command line option or in your default properties file.

Purpose of spark.executor.extraJavaOptions:

Memory Configuration:

Users often need to fine-tune memory settings for Spark Executors to optimize performance. The spark.executor.extraJavaOptions property can be used to set options related to heap size, garbage collection, and off-heap memory.


--conf spark.executor.extraJavaOptions="-Xmx4g -XX:MaxMetaspaceSize=512m"

This example sets the maximum heap size to 4 gigabytes (-Xmx4g) and the maximum Metaspace size to 512 megabytes (-XX:MaxMetaspaceSize=512m).

Garbage Collection Configuration:

Garbage collection settings are critical for managing memory efficiently. Users can use spark.executor.extraJavaOptions to specify garbage collection algorithms, tuning parameters, and collector behavior.


--conf spark.executor.extraJavaOptions="-XX:+UseG1GC -XX:MaxGCPauseMillis=100"

This example sets the G1 garbage collector (-XX:+UseG1GC) and configures the maximum pause time to 100 milliseconds (-XX:MaxGCPauseMillis=100).

Java System Properties:

Users can pass Java system properties to Spark Executors using spark.executor.extraJavaOptions. This is useful for setting properties that affect the behavior of the Java runtime environment.


--conf spark.executor.extraJavaOptions="-Dproperty=value"

This example sets a Java system property (-Dproperty=value) that can be accessed within Spark tasks.

Logging Configuration:

Logging settings for Spark Executors can be customized using spark.executor.extraJavaOptions. This includes specifying log levels, log file locations, and log format.


--conf spark.executor.extraJavaOptions="-Dlog4j.configuration=file:/path/to/log4j.properties"

Here, a custom log4j configuration file is specified using a Java system property.

JVM Debugging:

Enabling JVM debugging is facilitated by spark.executor.extraJavaOptions. Users can set options to enable remote debugging and specify the port for connecting a debugger.


--conf spark.executor.extraJavaOptions="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005"

This example enables the JVM debugger on port 5005 (-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005).

External Library Paths:

Users may need to include external libraries or directories in the classpath of Spark Executors. spark.executor.extraJavaOptions can be used to add paths to external libraries


--conf spark.executor.extraJavaOptions="-Djava.library.path=/path/to/libs"

This sets the java.library.path system property to include the specified directory.

Conclusion

In summary, spark.executor.extraJavaOptions is a versatile configuration property in Apache Spark, allowing users to tailor the Java runtime environment for Spark Executors. It is beneficial for memory tuning, garbage collection configuration, system properties, logging customization, JVM debugging, and managing external library paths. Users should carefully choose and configure these options based on their specific application requirements and the characteristics of their Spark cluster. Using these without knowing will have side effects, which include degrading performance.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium