PySpark Shell Command Usage with Examples

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:PySpark
Post last modified:March 27, 2024
Reading time:10 mins read

You are currently viewing PySpark Shell Command Usage with Examples

PySpark (Spark with python) default comes with an interactive pyspark shell command (with several options) that is used to learn, test PySpark examples and analyze data from the command line. Since Spark supports Scala, Python, R, and Java, It provides different shells for each language. But for Java, there is no shell. If you are using Scala then use spark-shell and for R language use sparkr.

1. Launch PySpark Shell Command

Go to the Spark Installation directory from the command line and type bin/pyspark and press enter, this launches pyspark shell and gives you a prompt to interact with Spark in Python language. If you have set the Spark in a PATH then just enter pyspark in command line or terminal (mac users).


./bin/pyspark

Yields below output. It also supports several command-line options which I will cover in the below sections.

To exit from the pyspark shell use quit(), exit() or Ctrl-D (i.e. EOF). Let’s understand a few statements from the above screenshot.

By default, pyspark creates a Spark context which internally creates a Web UI with URL localhost:4040. Since it is unable to bind on 4040 for me it was created on 4042 port.
Spark context created with app id local-*
By default it uses local[*] as master
Spark context and session are created with variables 'sc' and 'spark' respectively.
Displays Spark, and Python versions.

2. PySpark Shell Web UI

By default, PySpark launches Web UI on port 4040, if it could not bind then it tries on 4041, 4042, and so on until it binds. In my case, it is bound on 4046 port.

3. Run PySpark Statements from Shell

Let’s create a PySpark DataFrame with some sample data to validate the installation, enter the following commands in the shell in the same order.


>>> data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
>>> df = spark.createDataFrame(data)
>>> df.show()

Yields below output. For more examples on Spark with python refer to PySpark Tutorial with Examples.

Use quit(), exit() or Ctrl-D (i.e. EOF) to exit from the pyspark shell.

4. PySpark Shell Command Examples

Let’s see the different pyspark shell commands with different options.

Example 1:


./bin/pyspark \
   --master yarn \
   --deploy-mode cluster

This launches the Spark driver program in cluster. By default, it uses client mode which launches the driver on the same machine where you are running shell.

Example 2: Below example uses other python files as dependencies.


./bin/spyspark \
   --master yarn \
   --deploy-mode cluster \
   --py-files file1.py,file2.py,file3.zip

Example 3: Below example uses the pyspark shell with configs.


./bin/pyspark \
   --master yarn \
   --deploy-mode cluster \
   --driver-memory 8g \
   --executor-memory 16g \
   --executor-cores 2  \
   --conf "spark.sql.shuffle.partitions=20000" \
   --conf "spark.executor.memoryOverhead=5244" \
   --conf "spark.memory.fraction=0.8" \
   --conf "spark.memory.storageFraction=0.2" \
   --py-files file1.py,file2.py

5. PySpark Shell Command Options

Like any other shell command, PySpark shell also provides several commands and options, you can get all available options with --help. Below are some of the important options.

PySpark Shell Options	Option Description
-I <file>	preload <file>, enforcing line-by-line interpretation
–master MASTER_URL	spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]).
–deploy-mode DEPLOY_MODE	Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”) (Default: client).
–py-files PY_FILES	Comma-separated list of .zip, .egg, or .py files to place.
–name NAME	Specify the name of your application.
–packages	Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths.
–files FILES	Comma-separated list of files to be placed in the working directory of each executor.

PySpark Shell Options

For the complete list of spark-shell options use the -h command.


.bin/pyspark --help

If you closely look at it most of the options are similar to spark-submit command.

Conclusion

In this article, you have learned What is PySpark shell, how to use it with several commands, and the different command options available.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium