Question: How to run/submit (spark-submit) PySpark application from another Python script as a sub process and get the status of the job?
Solution: Run PySpark Application as a Python process
Generally, PySpark (Spark with Python) application should be run by using spark-submit script from shell or by using Airflow/Oozie/Luigi or any other workflow tools however some times you may need to run PySpark application from another python program and get the status of the job, you can do this by using Python subprocess module.
The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.
docs.python.org
import subprocess
spark_submit_str= "spark-submit --master yarn --deploy-mode cluster wordByExample.py"
process=subprocess.Popen(spark_submit_str,stdout=subprocess.PIPE,stderr=subprocess.PIPE, universal_newlines=True, shell=True)
stdout,stderr = process.communicate()
if process.returncode !=0:
print(stderr)
print(stdout)
The above example does a spark-submit for pyspark application wordByExample.py
on a cluster as a python subprocess and the spark-submit console log will be sent to stdout
and error to stderr
.
subprocess.PIPE
: Used to pipe the stdin, stdout or stderr toPopen
and indicates that a pipe to the standard stream should be opened.
Run PySpark Application from spark-submit
In case if you wanted to run a PySpark application using spark-submit from a shell, use the below example.
Specify the .py file you wanted to run and you can also specify the .py, .egg, .zip file to spark submit command using --py-files
option for any dependencies.
./bin/spark-submit \
--master yarn \
--deploy-mode cluster \
wordByExample.py
Below are some of the options & configurations that are used with spark-submit.
spark-submit Configurations | Description |
---|---|
–deploy-mode | Either client or cluster |
–master | Supports local , yarn , mesos://HOST:PORT , spark://HOST:PORT , k8s://HOST:PORT , k8s://https://HOST:PORT |
–driver-memory | Memory to be used by the Spark driver. |
–driver-cores | CPU cores to be used by the Spark driver |
–num-executors | The total number of executors to use. |
–executor-memory | Amount of memory to use for the executor process. |
–executor-cores | Number of CPU cores to use for the executor process. |
–total-executor-cores | The total number of executor cores to use. |
–py-files | Use --py-files to add .py , .zip or .egg files. |
–config spark.executor.pyspark.memory | The amount of memory to be used by PySpark for each executor. |
–config spark.pyspark.driver.python | Python binary executable to use for PySpark in driver. |
–config spark.pyspark.python | Python binary executable to use for PySpark in both driver and executors. |
Note: Files specified with --py-files
are uploaded to the cluster before it runs the application. Besides these, Spark also supports many more configurations.
Below is another example of spark-submit with few more options.
./bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 8g \
--executor-memory 16g \
--executor-cores 2 \
--conf "spark.sql.shuffle.partitions=20000" \
--conf "spark.executor.memoryOverhead=5244" \
--conf "spark.memory.fraction=0.8" \
--conf "spark.memory.storageFraction=0.2" \
--py-files file1.py,file2.py
wordByExample.py
Conclusion
In this PySpark article you have learned how to run s Pyspark application (spark-submit) from another python script and also learned how to submit a pyspark application from a shell script.
Happy Learning !!