• Post author:
  • Post category:PySpark
  • Post last modified:February 21, 2023
  • Reading time:5 mins read
You are currently viewing Difference between spark-submit vs pyspark commands?

When you are learning Spark, you will have a question why do we need spark-submit and PySpark commands, I would take a moment of your time and explain the differences between these two.

Advertisements
  • pyspark is a REPL similar to spark-shell for Python language.
  • spark-submit is used to submit Spark application on cluster.

spark-submit vs pyspark command

Using spark-submit and pyspark command you can run the spark statements, Both these commands are available at $SPARK_HOME/bin directory and you will find two sets of these commands *.sh files for Linux/macOS and *.cmd files for windows.

pyspark.sh and pyspark.cmd commands

pyspark command is REPL (read–eval–print loop) which is used to start an interactive shell to test/run few individual PySpark commands. This is mostly used to quickly test some commands during the development time.

For windows use pyspark.cmd

For Linux/macOS use pyspark.sh

When you run pyspark utility regardless of the OS, you will get the below shell prompt.

pyspark shell

This is similar to spark-shell command (Used by Scala developers).

spark-submit.sh and spark-submit.cmd commands

spark-submit command is used to run Spark application on cluster, Spark Deploy Modes Client vs Cluster are used to specify if you want to run Spark Driver locally or in the cluster. During development time we usually run spark programs from editors like IntelliJ/Eclipse for Scala and Java; and PyCharm/Spyder for PySpark (Python), these submit Spark applications in client mode by default.

For windows use spark-submit.cmd

For Linux/macOS use spark-submit.sh

Using this script you can run programs

  • Either on client or cluster mode
  • And on different cluster managers.

Note: spark-submit utility eventually calls a below Scala program.


org.apache.spark.deploy.SparkSubmit 

Similarities

These both commands have some similarities, let’s see them

  • Both create a Spark Web UI to track Job progress.
  • Both have similar options, try the below commands.

spark-submit --help
pyspark --help

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply