Spark Session configuration in PySpark.

In this article, we shall discuss how to use different spark configurations while creating PySpark Session, and validate the Configurations. Spark Session is the entry point to any Spark functionality.

1. Create Spark Session With Configuration

Spark Session provides a unified interface for interacting with different Spark APIs and allows applications to run on a Spark cluster. Spark Session was introduced in Spark 2.0 as a replacement for the earlier Spark Context and SQL Context APIs.

To create a Spark Session in PySpark, you can use the SparkSession builder. Here is an example of how to create a Spark Session in Pyspark:


# Imports
from pyspark.sql import SparkSession

# Create a SparkSession object
spark = SparkSession.builder \
    .appName("MyApp") \
    .master("local[2]") \
    .config("spark.executor.memory", "2g") \
    .getOrCreate()

In this example, we set the Spark master URL to “local[2]” to run Spark locally with two cores, and we set the Spark Session Configuration in Pyspark amount of executor memory to “2g”. You can customize these options as per your requirements.

2. Configuring Spark using SparkConf in Pyspark

To change the Spark Session configuration in PySpark, you can use the SparkConf() class to set the configuration properties and then pass this SparkConf object while creating the SparkSession object.

Here’s an example:


# Imports
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

# Create a SparkConf object
conf = SparkConf().setAppName("MyApp")
            .setMaster("local[2]")
            .set("spark.executor.memory", "2g")

# Create a SparkSession object
spark = SparkSession.builder.config(conf=conf).getOrCreate()

Now, you can use the SparkSession object to perform various Spark operations

In this example, we are changing the Spark Session configuration in PySpark and setting three configuration properties using the set() method of SparkConf object.

The first property setAppName() sets the name of the application.
The second property setMaster() specifies the Spark cluster manager to connect to. Here, we are running in local mode with two cores.
The third property set("spark.executor.memory", "2g") sets the amount of memory to be used by each executor in the Spark cluster.

Finally, we pass the SparkConf object to the config() method of the SparkSession builder and create a SparkSession object. You can change the configuration properties as per your requirement. Just make sure to set them before creating the SparkSession object.

3. Validate Spark Session Configuration

To validate the Spark Session configuration in PySpark, you can use the getOrCreate() method of the SparkSession object to get the current SparkSession and then use the SparkContext object’s getConf() method to retrieve the configuration settings.


# Imports
from pyspark.sql import SparkSession

# Create a SparkConf object
conf = SparkConf().setAppName("MyApp")
           .setMaster("local[2]")
           .set("spark.executor.memory", "2g")

# Create a SparkSession object
spark = SparkSession.builder.config(conf=conf).getOrCreate()

# Retrieve the SparkConf object from the SparkContext
conf = spark.sparkContext.getConf()

# Print the configuration settings
print("spark.app.name = ", conf.get("spark.app.name"))
print("spark.master = ", conf.get("spark.master"))
print("spark.executor.memory = ", conf.get("spark.executor.memory"))

# Output
spark.app.name =  MyApp
spark.master =  local[2]
spark.executor.memory =  2g

In this example, we retrieve the SparkConf object from the SparkContext and print the values of three configuration properties: spark.app.name, spark.master, and spark.executor.memory. You can add or remove configuration properties to validate their values.

You can run this code after setting your Spark Session configuration properties to see the values of those properties. If the printed values match your configuration, it means that your configuration has been successfully applied to the Spark Session.

4. Using SparkContext

You can also set the Spark Config object directly to SparkContext.


sc = SparkContext(conf = conf)

5. Using spark-defaults

You can also set the Spark parameters in a spark-defaults.conf file:


# Using spark-defaults
spark.some.config.option1 some-value
spark.some.config.option2 "some-value"

6. Using spark-submit.

Finally, you can also set it while submitting a spark application using spark-submit (pyspark). For more examples refer to spark-submit.


# Using spark-submit
spark-submit \
--properties-file path/to/your/spark-defaults.conf \
--name "sparkbyexamples.com" \
--conf <key<=<value> \
--py-files path/to/your/pyspark_files.zip \
path/to/your/pyspark_main.py

7. Conclusion

In conclusion, the Spark Session in PySpark can be configured using the config() method of the SparkSession builder. You can set various configuration properties, such as the application name, the Spark master URL, and the executor memory, to customize the behavior of your Spark application.

Table of contents