Question: How to run/submit (spark-submit) PySpark application from another Python script as a sub process and get the status of the job?
Solution: Run PySpark Application as a Python process
Generally, PySpark (Spark with Python) application should be run by using spark-submit script from shell or by using Airflow/Oozie/Luigi or any other workflow tools however some times you may need to run PySpark application from another python program and get the status of the job, you can do this by using Python subprocess module.
The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.docs.python.org
import subprocess spark_submit_str= "spark-submit --master yarn --deploy-mode cluster wordByExample.py" process=subprocess.Popen(spark_submit_str,stdout=subprocess.PIPE,stderr=subprocess.PIPE, universal_newlines=True, shell=True) stdout,stderr = process.communicate() if process.returncode !=0: print(stderr) print(stdout)
The above example does a spark-submit for pyspark application
wordByExample.py on a cluster as a python subprocess and the spark-submit console log will be sent to
stdout and error to
subprocess.PIPE: Used to pipe the stdin, stdout or stderr to
Popenand indicates that a pipe to the standard stream should be opened.
Run PySpark Application from spark-submit
In case if you wanted to run a PySpark application using spark-submit from a shell, use the below example.
Specify the .py file you wanted to run and you can also specify the .py, .egg, .zip file to spark submit command using
--py-files option for any dependencies.
./bin/spark-submit \ --master yarn \ --deploy-mode cluster \ wordByExample.py
Below are some of the options & configurations that are used with spark-submit.
|–driver-memory||Memory to be used by the Spark driver.|
|–driver-cores||CPU cores to be used by the Spark driver|
|–num-executors||The total number of executors to use.|
|–executor-memory||Amount of memory to use for the executor process.|
|–executor-cores||Number of CPU cores to use for the executor process.|
|–total-executor-cores||The total number of executor cores to use.|
|–config spark.executor.pyspark.memory||The amount of memory to be used by PySpark for each executor.|
|–config spark.pyspark.driver.python||Python binary executable to use for PySpark in driver.|
|–config spark.pyspark.python||Python binary executable to use for PySpark in both driver and executors.|
Note: Files specified with
--py-files are uploaded to the cluster before it runs the application. Besides these, Spark also supports many more configurations.
Below is another example of spark-submit with few more options.
./bin/spark-submit \ --master yarn \ --deploy-mode cluster \ --driver-memory 8g \ --executor-memory 16g \ --executor-cores 2 \ --conf "spark.sql.shuffle.partitions=20000" \ --conf "spark.executor.memoryOverhead=5244" \ --conf "spark.memory.fraction=0.8" \ --conf "spark.memory.storageFraction=0.2" \ --py-files file1.py,file2.py wordByExample.py
In this PySpark article you have learned how to run s Pyspark application (spark-submit) from another python script and also learned how to submit a pyspark application from a shell script.
Happy Learning !!