In this article, we shall discuss how to use different spark configurations while creating PySpark Session, and validate the Configurations. Spark Session is the entry point to any Spark functionality.
Table of contents
1. Create Spark Session With Configuration
Spark Session provides a unified interface for interacting with different Spark APIs and allows applications to run on a Spark cluster. Spark Session was introduced in Spark 2.0 as a replacement for the earlier Spark Context and SQL Context APIs.
To create a Spark Session in PySpark, you can use the SparkSession
builder. Here is an example of how to create a Spark Session in Pyspark:
# Imports
from pyspark.sql import SparkSession
# Create a SparkSession object
spark = SparkSession.builder \
.appName("MyApp") \
.master("local[2]") \
.config("spark.executor.memory", "2g") \
.getOrCreate()
In this example, we set the Spark master URL to “local[2]” to run Spark locally with two cores, and we set the Spark Session Configuration in Pyspark amount of executor memory to “2g”. You can customize these options as per your requirements.
2. Configuring Spark using SparkConf in Pyspark
To change the Spark Session configuration in PySpark, you can use the SparkConf()
class to set the configuration properties and then pass this SparkConf
object while creating the SparkSession
object.
Here’s an example:
# Imports
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
# Create a SparkConf object
conf = SparkConf().setAppName("MyApp")
.setMaster("local[2]")
.set("spark.executor.memory", "2g")
# Create a SparkSession object
spark = SparkSession.builder.config(conf=conf).getOrCreate()
Now, you can use the SparkSession object to perform various Spark operations
In this example, we are changing the Spark Session configuration in PySpark and setting three configuration properties using the set()
method of SparkConf
object.
- The first property
setAppName()
sets the name of the application. - The second property
setMaster()
specifies the Spark cluster manager to connect to. Here, we are running in local mode with two cores. - The third property
set("spark.executor.memory", "2g")
sets the amount of memory to be used by each executor in the Spark cluster.
Finally, we pass the SparkConf
object to the config()
method of the SparkSession
builder and create a SparkSession
object. You can change the configuration properties as per your requirement. Just make sure to set them before creating the SparkSession
object.
3. Validate Spark Session Configuration
To validate the Spark Session configuration in PySpark, you can use the getOrCreate()
method of the SparkSession
object to get the current SparkSession
and then use the SparkContext
object’s getConf()
method to retrieve the configuration settings.
# Imports
from pyspark.sql import SparkSession
# Create a SparkConf object
conf = SparkConf().setAppName("MyApp")
.setMaster("local[2]")
.set("spark.executor.memory", "2g")
# Create a SparkSession object
spark = SparkSession.builder.config(conf=conf).getOrCreate()
# Retrieve the SparkConf object from the SparkContext
conf = spark.sparkContext.getConf()
# Print the configuration settings
print("spark.app.name = ", conf.get("spark.app.name"))
print("spark.master = ", conf.get("spark.master"))
print("spark.executor.memory = ", conf.get("spark.executor.memory"))
# Output
spark.app.name = MyApp
spark.master = local[2]
spark.executor.memory = 2g
In this example, we retrieve the SparkConf
object from the SparkContext
and print the values of three configuration properties: spark.app.name
, spark.master
, and spark.executor.memory
. You can add or remove configuration properties to validate their values.
You can run this code after setting your Spark Session configuration properties to see the values of those properties. If the printed values match your configuration, it means that your configuration has been successfully applied to the Spark Session.
4. Using SparkContext
You can also set the Spark Config object directly to SparkContext.
sc = SparkContext(conf = conf)
5. Using spark-defaults
You can also set the Spark parameters in a spark-defaults.conf
file:
# Using spark-defaults
spark.some.config.option1 some-value
spark.some.config.option2 "some-value"
6. Using spark-submit.
Finally, you can also set it while submitting a spark application using spark-submit
(pyspark). For more examples refer to spark-submit.
# Using spark-submit
spark-submit \
--properties-file path/to/your/spark-defaults.conf \
--name "sparkbyexamples.com" \
--conf <key<=<value> \
--py-files path/to/your/pyspark_files.zip \
path/to/your/pyspark_main.py
7. Conclusion
In conclusion, the Spark Session in PySpark can be configured using the config()
method of the SparkSession
builder. You can set various configuration properties, such as the application name, the Spark master URL, and the executor memory, to customize the behavior of your Spark application.
Related Articles
- PySpark – What is SparkSession?
- Spark/Pyspark Application Configuration
- Spark Internal Execution plan
- What is DAG in Spark or PySpark
- PySpark SQL Functions
- PySpark collect_list() and collect_set() functions