Difference between spark-submit vs pyspark commands?

When you are learning Spark, you will have a question why do we need spark-submit and PySpark commands, I would take a moment of your time and explain the differences between these two.

  • pyspark is a REPL similar to spark-shell for Python language.
  • spark-submit is used to submit Spark application on cluster.

spark-submit vs pyspark command

Using spark-submit and pyspark command you can run the spark statements, Both these commands are available at $SPARK_HOME/bin directory and you will find two sets of these commands *.sh files for Linux/macOS and *.cmd files for windows.

pyspark.sh and pyspark.cmd commands

pyspark command is REPL (read–eval–print loop) which is used to start an interactive shell to test/run few individual PySpark commands. This is mostly used to quickly test some commands during the development time.

For windows use pyspark.cmd

For Linux/macOS use pyspark.sh

When you run pyspark utility regardless of the OS, you will get the below shell prompt.

pyspark shell

This is similar to spark-shell command (Used by Scala developers).

spark-submit.sh and spark-submit.cmd commands

spark-submit command is used to run Spark application on cluster, Spark Deploy Modes Client vs Cluster are used to specify if you want to run Spark Driver locally or in the cluster. During development time we usually run spark programs from editors like IntelliJ/Eclipse for Scala and Java; and PyCharm/Spyder for PySpark (Python), these submit Spark applications in client mode by default.

For windows use spark-submit.cmd

For Linux/macOS use spark-submit.sh

Using this script you can run programs

  • Either on client or cluster mode
  • And on different cluster managers.

Note: spark-submit utility eventually calls a below Scala program.



These both commands have some similarities, let’s see them

  • Both create a Spark Web UI to track Job progress.
  • Both have similar options, try the below commands.

spark-submit --help
pyspark --help

Happy Learning !!

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing Difference between spark-submit vs pyspark commands?