PySpark (Spark with python) default comes with an interactive
pyspark shell command (with several options) that is used to learn, test PySpark examples and analyze data from the command line. Since Spark supports Scala, Python, R, and Java, It provides different shells for each language. But for Java, there is no shell. If you are using Scala then use spark-shell and for R language use
PySpark Shell Key Points –
- PySpark shell is referred as REPL (Read Eval Print Loop) which is used to quickly test PySpark statements.
- Spark shell is available for Scala, Python and R (Java might be supported in previous versions).
pysparkcommand is used to launch Spark with Python shell also call PySpark.
- Use spark-shell command to work Spark with Scala.
sparkrcommand is used to launch Spark with R language.
- PySpark shell default provides
sparkis an object of SparkSession and
scis an object of SparkContext.
- In PySpark shell, you cannot create your own SparkContext.
Pre-requisites: Before you proceed make sure you have PySpark installed.
1. Launch PySpark Shell Command
Go to the Spark Installation directory from the command line and type
bin/pyspark and press enter, this launches pyspark shell and gives you a prompt to interact with Spark in Python language. If you have set the Spark in a PATH then just enter
pyspark in command line or terminal (mac users).
Yields below output. It also supports several command-line options which I will cover in the below sections.
To exit from the pyspark shell use
Ctrl-D (i.e. EOF). Let’s understand a few statements from the above screenshot.
- By default,
pysparkcreates a Spark context which internally creates a Web UI with URL http://localhost:4040. Since it is unable to bind on 4040 for me it was created on 4042 port.
- Spark context created with app id local-*
- By default it uses local[*] as master
- Spark context and session are created with variables
- Displays Spark, and Python versions.
2. PySpark Shell Web UI
By default, PySpark launches Web UI on port 4040, if it could not bind then it tries on 4041, 4042, and so on until it binds. In my case, it is bound on 4046 port.
3. Run PySpark Statements from Shell
Let’s create a PySpark DataFrame with some sample data to validate the installation, enter the following commands in the shell in the same order.
>>> data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")] >>> df = spark.createDataFrame(data) >>> df.show()
Yields below output. For more examples on Spark with python refer to PySpark Tutorial with Examples.
Ctrl-D (i.e. EOF) to exit from the pyspark shell.
4. PySpark Shell Command Examples
Let’s see the different pyspark shell commands with different options.
./bin/pyspark \ --master yarn \ --deploy-mode cluster
This launches the Spark driver program in
cluster. By default, it uses
client mode which launches the driver on the same machine where you are running shell.
Example 2: Below example uses other python files as dependencies.
./bin/spyspark \ --master yarn \ --deploy-mode cluster \ --py-files file1.py,file2.py,file3.zip
Example 3: Below example uses the pyspark shell with configs.
./bin/pyspark \ --master yarn \ --deploy-mode cluster \ --driver-memory 8g \ --executor-memory 16g \ --executor-cores 2 \ --conf "spark.sql.shuffle.partitions=20000" \ --conf "spark.executor.memoryOverhead=5244" \ --conf "spark.memory.fraction=0.8" \ --conf "spark.memory.storageFraction=0.2" \ --py-files file1.py,file2.py
5. PySpark Shell Command Options
Like any other shell command, PySpark shell also provides several commands and options, you can get all available options with
--help. Below are some of the important options.
|PySpark Shell Options||Option Description|
|-I <file>||preload <file>, enforcing line-by-line interpretation|
|–master MASTER_URL||spark://host:port, mesos://host:port, |
k8s://https://host:port, or local (Default: local[*]).
|–deploy-mode DEPLOY_MODE||Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”)|
|–py-files PY_FILES||Comma-separated list of .zip, .egg, or .py files to place.|
|–name NAME||Specify the name of your application.|
|–packages||Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths.|
|–files FILES||Comma-separated list of files to be placed in the working directory of each executor.|
For the complete list of
spark-shell options use the
This yields the below output. If you closely look at it most of the options are similar to spark-submit command.
In this article, you have learned What is PySpark shell, how to use it with several commands, and the different command options available.
Happy Learning !!