How to add Multiple Jars to PySpark

  • Post author:
  • Post category:PySpark
  • Post last modified:December 12, 2022

In this article, I will explain how to add multiple jars to PySpark application classpath running with spark-submit, pyspark shell, and running from the IDE.

1. Add Multiple Jars to PySpark spark-submit

There are multiple ways to add jars to PySpark application with spark-submit.

1.1 Adding jars to the classpath

You can also add jars using Spark submit option --jar, using this option you can add a single jar or multiple jars by comma-separated. This option adds the specified jars to the driver and all executors.


spark-submit --jars /path/first.jar,/path/second.jar,/path/third.jar
             ......
             ......
             your-application.py 

Alternatively, you can also use SparkContext.addJar()

1.2 Adding all jars from a folder to classpath

If you have many jars, imagine using all these jars in a comma-separated and when you have to update the version of the jars, it’s going to be a nightmare to maintain this.

You can use the below snippet to add all jars from a folder automatically, $(echo /path/*.jar | tr ' ' ',') statement creates a comma-separated string by appending all jar names in a folder.


spark-submit --jars $(echo /path/*.jar | tr ' ' ',') \ 
             your-application.py 

1.3 Add Jars to Driver

If you need a jar only on the driver node then use --conf spark.driver.extraClassPath or --driver-class-path.


spark-submit --jars file1.jar,file2.jar \ 
    --driver-class-path file3.jar \ 
    your-application.py

2. Add Jar to PySpark Shell

Options on pyspark shell are similar to spark-submit hence you can use the options specified above to add one or multiple jars to PySpark.


pyspark --jars file1.jar,file2.jar

3. Create SparkSession with Jar dependency

You can also add multiple jars to the driver and executor classpaths while creating SparkSession in PySpark as shown below. This takes the highest precedence over other approaches.


# Create SparkSession
spark = SparkSession.builder \
          .config("spark.jars", "file1.jar,file2.jar") \
          .config("spark.driver.extraClassPath", "file3.jar") \
          .appName('SparkByExamples.com') \
          .getOrCreate()

Here, file1.jar and file2.jar are added to both driver and executors and file3.jar is added only to the driver classpath.

Conclusion

In this article, you have learned how to add multiple jars to PySpark application running with pyspark shell, spark-submit, and running from PyCharm, Spyder, and notebooks.

Related Articles

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply