How to Run a PySpark Script from Python?

Question: How to run/submit (spark-submit) PySpark application from another Python script as a sub process and get the status of the job?

Solution: Run PySpark Application as a Python process

Generally, PySpark (Spark with Python) application should be run by using spark-submit script from shell or by using Airflow/Oozie/Luigi or any other workflow tools however some times you may need to run PySpark application from another python program and get the status of the job, you can do this by using Python subprocess module.

The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.
docs.python.org


import subprocess

spark_submit_str= "spark-submit --master yarn --deploy-mode cluster wordByExample.py"
process=subprocess.Popen(spark_submit_str,stdout=subprocess.PIPE,stderr=subprocess.PIPE, universal_newlines=True, shell=True)
stdout,stderr = process.communicate()
if process.returncode !=0:
   print(stderr)
print(stdout)

The above example does a spark-submit for pyspark application wordByExample.py on a cluster as a python subprocess and the spark-submit console log will be sent to stdout and error to stderr.

subprocess.PIPE: Used to pipe the stdin, stdout or stderr to Popen and indicates that a pipe to the standard stream should be opened.

Run PySpark Application from spark-submit

In case if you wanted to run a PySpark application using spark-submit from a shell, use the below example.

Specify the .py file you wanted to run and you can also specify the .py, .egg, .zip file to spark submit command using --py-files option for any dependencies.


./bin/spark-submit \
   --master yarn \
   --deploy-mode cluster \
   wordByExample.py

Below are some of the options & configurations that are used with spark-submit.

spark-submit Configurations	Description
–deploy-mode	Either `client` or `cluster`
–master	Supports `local`, `yarn`, `mesos://HOST:PORT`, `spark://HOST:PORT`, `k8s://HOST:PORT`, `k8s://https://HOST:PORT`
–driver-memory	Memory to be used by the Spark driver.
–driver-cores	CPU cores to be used by the Spark driver
–num-executors	The total number of executors to use.
–executor-memory	Amount of memory to use for the executor process.
–executor-cores	Number of CPU cores to use for the executor process.
–total-executor-cores	The total number of executor cores to use.
–py-files	Use `--py-files` to add `.py`, `.zip` or `.egg` files.
–config spark.executor.pyspark.memory	The amount of memory to be used by PySpark for each executor.
–config spark.pyspark.driver.python	Python binary executable to use for PySpark in driver.
–config spark.pyspark.python	Python binary executable to use for PySpark in both driver and executors.

Note: Files specified with --py-files are uploaded to the cluster before it runs the application. Besides these, Spark also supports many more configurations.

Below is another example of spark-submit with few more options.


./bin/spark-submit \
   --master yarn \
   --deploy-mode cluster \
   --driver-memory 8g \
   --executor-memory 16g \
   --executor-cores 2  \
   --conf "spark.sql.shuffle.partitions=20000" \
   --conf "spark.executor.memoryOverhead=5244" \
   --conf "spark.memory.fraction=0.8" \
   --conf "spark.memory.storageFraction=0.2" \
   --py-files file1.py,file2.py
   wordByExample.py

Conclusion

In this PySpark article you have learned how to run s Pyspark application (spark-submit) from another python script and also learned how to submit a pyspark application from a shell script.

Happy Learning !!

Solution: Run PySpark Application as a Python process

Run PySpark Application from spark-submit

Conclusion

Related Articles