• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:5 mins read
You are currently viewing How to add Multiple Jars to PySpark

In this article, I will explain how to add multiple jars to PySpark application classpath running with spark-submit, pyspark shell, and running from the IDE.

1. Add Multiple Jars to PySpark spark-submit

There are multiple ways to add jars to PySpark application with spark-submit.

1.1 Adding jars to the classpath

You can also add jars using Spark submit option --jar, using this option you can add a single jar or multiple jars by comma-separated. This option adds the specified jars to the driver and all executors.


spark-submit --jars /path/first.jar,/path/second.jar,/path/third.jar
             ......
             ......
             your-application.py 

Alternatively, you can also use SparkContext.addJar()

1.2 Adding all jars from a folder to classpath

If you have many jars, imagine using all these jars in a comma-separated and when you have to update the version of the jars, it’s going to be a nightmare to maintain this.

You can use the below snippet to add all jars from a folder automatically, $(echo /path/*.jar | tr ' ' ',') statement creates a comma-separated string by appending all jar names in a folder.


spark-submit --jars $(echo /path/*.jar | tr ' ' ',') \ 
             your-application.py 

1.3 Add Jars to Driver

If you need a jar only on the driver node then use --conf spark.driver.extraClassPath or --driver-class-path.


spark-submit --jars file1.jar,file2.jar \ 
    --driver-class-path file3.jar \ 
    your-application.py

2. Add Jar to PySpark Shell

Options on pyspark shell are similar to spark-submit hence you can use the options specified above to add one or multiple jars to PySpark.


pyspark --jars file1.jar,file2.jar

3. Create SparkSession with Jar dependency

You can also add multiple jars to the driver and executor classpaths while creating SparkSession in PySpark as shown below. This takes the highest precedence over other approaches.


# Create SparkSession
spark = SparkSession.builder \
          .config("spark.jars", "file1.jar,file2.jar") \
          .config("spark.driver.extraClassPath", "file3.jar") \
          .appName('SparkByExamples.com') \
          .getOrCreate()

Here, file1.jar and file2.jar are added to both driver and executors and file3.jar is added only to the driver classpath.

Conclusion

In this article, you have learned how to add multiple jars to PySpark application running with pyspark shell, spark-submit, and running from PyCharm, Spyder, and notebooks.

Related Articles

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium