In this article, I will explain how to add multiple jars to PySpark application classpath running with spark-submit, pyspark shell, and running from the IDE.
1. Add Multiple Jars to PySpark spark-submit
There are multiple ways to add jars to PySpark application with spark-submit.
1.1 Adding jars to the classpath
You can also add jars using Spark submit option --jar
, using this option you can add a single jar or multiple jars by comma-separated. This option adds the specified jars to the driver and all executors.
spark-submit --jars /path/first.jar,/path/second.jar,/path/third.jar
......
......
your-application.py
Alternatively, you can also use SparkContext.addJar()
1.2 Adding all jars from a folder to classpath
If you have many jars, imagine using all these jars in a comma-separated and when you have to update the version of the jars, it’s going to be a nightmare to maintain this.
You can use the below snippet to add all jars from a folder automatically, $(echo /path/*.jar | tr ' ' ',')
statement creates a comma-separated string by appending all jar names in a folder.
spark-submit --jars $(echo /path/*.jar | tr ' ' ',') \
your-application.py
1.3 Add Jars to Driver
If you need a jar only on the driver node then use --conf spark.driver.extraClassPath
or --driver-class-path
.
spark-submit --jars file1.jar,file2.jar \
--driver-class-path file3.jar \
your-application.py
2. Add Jar to PySpark Shell
Options on pyspark shell are similar to spark-submit hence you can use the options specified above to add one or multiple jars to PySpark.
pyspark --jars file1.jar,file2.jar
3. Create SparkSession with Jar dependency
You can also add multiple jars to the driver and executor classpaths while creating SparkSession in PySpark as shown below. This takes the highest precedence over other approaches.
# Create SparkSession
spark = SparkSession.builder \
.config("spark.jars", "file1.jar,file2.jar") \
.config("spark.driver.extraClassPath", "file3.jar") \
.appName('SparkByExamples.com') \
.getOrCreate()
Here, file1.jar
and file2.jar
are added to both driver and executors and file3.jar
is added only to the driver classpath.
Conclusion
In this article, you have learned how to add multiple jars to PySpark application running with pyspark shell, spark-submit, and running from PyCharm, Spyder, and notebooks.
Related Articles
- Spark Set Environment Variable to Executors
- SOLVED Can’t assign requested address: Service ‘sparkDriver’
- Spark Set JVM Options to Driver & Executors
- Read JDBC in Parallel using PySpark
- PySpark Read and Write MySQL Database Table
- PySpark Read and Write MySQL Database Table
- PySpark Read JDBC Table to DataFrame
- PySpark SparkContext Explained