When you are learning Spark, you will have a question why do we need spark-submit and PySpark commands, I would take a moment of your time and explain the differences between these two.
- pyspark is a REPL similar to spark-shell for Python language.
- spark-submit is used to submit Spark application on cluster.
spark-submit vs pyspark command
Using spark-submit and pyspark command you can run the spark statements, Both these commands are available at $SPARK_HOME/bin
directory and you will find two sets of these commands *.sh
files for Linux/macOS and *.cmd
files for windows.
pyspark.sh and pyspark.cmd commands
pyspark command is REPL (read–eval–print loop) which is used to start an interactive shell to test/run few individual PySpark commands. This is mostly used to quickly test some commands during the development time.
For windows use pyspark.cmd
For Linux/macOS use pyspark.sh
When you run pyspark utility regardless of the OS, you will get the below shell prompt.
This is similar to spark-shell
command (Used by Scala
developers).
spark-submit.sh and spark-submit.cmd commands
spark-submit command is used to run Spark application on cluster, Spark Deploy Modes Client vs Cluster are used to specify if you want to run Spark Driver locally or in the cluster. During development time we usually run spark programs from editors like IntelliJ/Eclipse
for Scala and Java; and PyCharm/Spyder
for PySpark (Python), these submit Spark applications in client mode by default.
For windows use spark-submit.cmd
For Linux/macOS use spark-submit.sh
Using this script you can run programs
- Either on client or cluster mode
- And on different cluster managers.
Note: spark-submit utility eventually calls a below Scala program.
org.apache.spark.deploy.SparkSubmit
Similarities
These both commands have some similarities, let’s see them
- Both create a Spark Web UI to track Job progress.
- Both have similar options, try the below commands.
spark-submit --help
pyspark --help
Happy Learning !!