Apache Spark default comes with the
spark-shell command that is used to interact with Spark from the command line. This is usually used to quickly analyze data or test spark commands from the command line. PySpark shell is referred to as REPL (Read Eval Print Loop). Apache Spark supports
spark-shell for Scala, pyspark for Python, and
sparkr for R language. Java is not supported at this time.
Spark Shell Key Points –
- Spark shell is referred as REPL (Read Eval Print Loop) which is used to quickly test Spark/PySpark statements.
- The Spark Shell supports only Scala, Python and R (Java might be supported in previous versions).
spark-shell2command is used to launch Spark with Scala shell. I have covered this in detail in this article.
- The pyspark command is used to launch Spark with Python shell also call PySpark.
sparkrcommand is used to launch Spark with R language.
- In Spark shell, Spark by default provides
sparkis an object of SparkSession and
scis an object of SparkContext.
- In Shell you cannot create your own SparkContext
Pre-requisites: Before you proceed make sure you have Apache Spark installed.
1. Launch Spark Shell (spark-shell) Command
Go to the Apache Spark Installation directory from the command line and type
bin/spark-shell and press enter, this launches Spark shell and gives you a scala prompt to interact with Spark in scala language. If you have set the Spark in a PATH then just enter spark-shell in command line or terminal (mac users).
Yields below output.
Let’s understand a few statements from the above screenshot.
- By default, spark-shell creates a Spark context which internally creates a Web UI with URL http://localhost:4040. Since it is unable to bind on 4040 for me it was created on 4042 port.
- Spark context created with app id local-*
- By default it uses local[*] as master
- Spark context and session are created with variables
- Shows Spark, Scala and Java versions used.
2. Spark Shell Web UI
By default Spark Web UI launches on port 4040, if it could not bind then it tries on 4041, 4042, and son until it binds.
3. Run Spark Statements from Shell
Let’s create a Spark DataFrame with some sample data to validate the installation. Enter the following commands in the Spark Shell in the same order.
import spark.implicits._ val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000")) val df = data.toDF() df.show()
Yields below output. For more examples on Apache Spark refer to Spark Tutorial with Examples.
4. Spark Shell Examples
Let’s see the different spark-shell command options
Example 1: Launch in Cluster mode
./bin/spark-shell \ --master yarn \ --deploy-mode cluster
This launches the Spark driver program in
cluster. By default, it uses
client mode which launches the driver on the same machine where you are running shell.
Example 2: In case you wanted to add dependency jars
./bin/spark-shell \ --master yarn \ --deploy-mode cluster \ --jars file1.jar,file2.jar
Example 3: Adding jars to spark-shell
If you wanted to add a jar to spark-shell use –driver-class-path option.
spark-shell --driver-class-path /path/to/example.jar:/path/to/another.jar
Example 4: With Configs
./bin/spark-shell \ --master yarn \ --deploy-mode cluster \ --driver-memory 8g \ --executor-memory 16g \ --executor-cores 2 \ --conf "spark.sql.shuffle.partitions=20000" \ --conf "spark.executor.memoryOverhead=5244" \ --conf "spark.memory.fraction=0.8" \ --conf "spark.memory.storageFraction=0.2" \ --jars file1.jar,file2.jar
5. Commands in Spark Shell
While you interacting in shell, you probably require some help for example what all the different imports are available, all history commands e.t.c. You can get all available options by using
All commands can be abbreviated, e.g.,
:he instead of
:completions <string> Output completions for the given string
:edit <id>|<line> Edit history
:help [command] Print this summary or command-specific help
:history [num] Show the history (optional num of commands to show)
:h? <string> Search the history
:imports [name name ...] Show import history, identifying sources of names
:implicits [-v] Show the implicits in scope
:javap <path|class> Disassemble a file or class name
:line <id>|<line> Place line(s) at the end of history
:load <path> Interpret lines in a file
:paste [-raw] [path] Enter paste mode or paste a file
:power Enable power user mode
:quit Exit the interpreter
:replay [options] Reset the repl and replay all previous commands
:require <path> Add a jar to the classpath
:reset [options] Reset the repl to its initial state, forgetting all session entries
:save <path> Save replayable session to a file
:sh <command line> Run a shell command (result is implicitly => List[String])
:settings <options> Update compiler options, if possible; see reset
:silent Disable/enable automatic printing of results
:type [-v] <expr> Display the type of an expression without evaluating it
:kind [-v] <type> Display the kind of a type. see also :help kind
:warnings Show the suppressed warnings from the most recent line which had any
6. Accessing Environment Vaiables
Sometimes you would be required to access environment variables in shell, you can achieve this by accessing
System.getenv() method. Note that this is a Java method but you can use it.
For example on UNIX shell set a variable.
Now open spark-shell and access it from the scala prompt.
7. Run Unix Shell Script File
In case you wanted to run a Unix shell file (.sh file) from the scala prompt, you can do this by using
:sh <file-name>. I have
nnk.sh file with content
echo 'SparkByExamples.com' > nnk.out
scala> :sh /Users/admin/nnk.sh res0: scala.tools.nsc.interpreter.ProcessResult = `/Users/admin/nnk.sh` (0 lines, exit 0)
nnk.sh file which creates
nnk.out file with content
8. Load Scala Script
:load from a shell, you can load the Scala file. First, create a scala file, I will be creating
nnk.scala with content
Now let’s launch shell and load this scala program. This comes in handy if you have commands in a scala file and wanted to run from a shell.
scala> :load nnk.scala Loading nnk.scala... SparkByExamples.com scala>
9. Spark Shell Options
Like any other shell command, Apache Spark shell also provides several options, you can get all available options with -h (help). Below are some of the important options.
|Spark Shell Options||Option Description|
|-I <file>||preload <file>, enforcing line-by-line interpretation|
|–master MASTER_URL||spark://host:port, mesos://host:port, |
k8s://https://host:port, or local (Default: local[*]).
|–deploy-mode DEPLOY_MODE||Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”)|
|–class CLASS_NAME||Main class you wanted to run. This is applicable to Java / Scala apps.|
|–py-files PY_FILES||Comma-separated list of .zip, .egg, or .py files to place. This is applicable only to Python.|
|–name NAME||Specify the name of your application.|
|–jars JARS||Comma-separated list of jars to include on the driver and executor classpaths.|
|–packages||Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths.|
|–files FILES||Comma-separated list of files to be placed in the working directory of each executor.|
For the complete list of spark-shell options use the -h command.
This yields the below output. If you closely look at it most of the options are similar to spark-submit command.
In this article, you have learned What is Spark shell, how to use it with examples, and the different options available inside a shell.