How to Run a PySpark Script from Python?

Question: How to run/submit (spark-submit) PySpark application from another Python script as a sub process and get the status of the job?

Solution: Run PySpark Application as a Python process

Generally, PySpark (Spark with Python) application should be run by using spark-submit script from shell or by using Airflow/Oozie/Luigi or any other workflow tools however some times you may need to run PySpark application from another python program and get the status of the job, you can do this by using Python subprocess module.

The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.

docs.python.org

import subprocess

spark_submit_str= "spark-submit --master yarn --deploy-mode cluster wordByExample.py"
process=subprocess.Popen(spark_submit_str,stdout=subprocess.PIPE,stderr=subprocess.PIPE, universal_newlines=True, shell=True)
stdout,stderr = process.communicate()
if process.returncode !=0:
   print(stderr)
print(stdout)

The above example does a spark-submit for pyspark application wordByExample.py on a cluster as a python subprocess and the spark-submit console log will be sent to stdout and error to stderr.

  • subprocess.PIPE: Used to pipe the stdinstdout or stderr to Popen and indicates that a pipe to the standard stream should be opened.

Run PySpark Application from spark-submit

In case if you wanted to run a PySpark application using spark-submit from a shell, use the below example.

Specify the .py file you wanted to run and you can also specify the .py, .egg, .zip file to spark submit command using --py-files option for any dependencies.


./bin/spark-submit \
   --master yarn \
   --deploy-mode cluster \
   wordByExample.py

Below are some of the options & configurations that are used with spark-submit.

spark-submit ConfigurationsDescription
–deploy-modeEither client or cluster
–masterSupports local, yarn, mesos://HOST:PORT,
spark://HOST:PORT, k8s://HOST:PORT,
k8s://https://HOST:PORT
–driver-memoryMemory to be used by the Spark driver.
–driver-coresCPU cores to be used by the Spark driver
–num-executorsThe total number of executors to use.
–executor-memoryAmount of memory to use for the executor process.
–executor-coresNumber of CPU cores to use for the executor process.
–total-executor-coresThe total number of executor cores to use.
–py-filesUse --py-files to add .py.zip or .egg files.
–config spark.executor.pyspark.memoryThe amount of memory to be used by PySpark for each executor.
–config spark.pyspark.driver.pythonPython binary executable to use for PySpark in driver.
–config spark.pyspark.pythonPython binary executable to use for PySpark in both driver and executors.

Note: Files specified with --py-files are uploaded to the cluster before it runs the application. Besides these, Spark also supports many more configurations.

Below is another example of spark-submit with few more options.


./bin/spark-submit \
   --master yarn \
   --deploy-mode cluster \
   --driver-memory 8g \
   --executor-memory 16g \
   --executor-cores 2  \
   --conf "spark.sql.shuffle.partitions=20000" \
   --conf "spark.executor.memoryOverhead=5244" \
   --conf "spark.memory.fraction=0.8" \
   --conf "spark.memory.storageFraction=0.2" \
   --py-files file1.py,file2.py
   wordByExample.py

Conclusion

In this PySpark article you have learned how to run s Pyspark application (spark-submit) from another python script and also learned how to submit a pyspark application from a shell script.

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply