Spark Shell Command Usage with Examples

Spread the love

Apache Spark default comes with the spark-shell command that is used to interact with Spark from the command line. This is usually used to quickly analyze data or test spark commands from the command line. PySpark shell is referred to as REPL (Read Eval Print Loop). Apache Spark supports spark-shell for Scala, pyspark for Python, and sparkr for R language. Java is not supported at this time.

Spark Shell Key Points –

  1. Spark shell is referred as REPL (Read Eval Print Loop) which is used to quickly test Spark/PySpark statements.
  2. The Spark Shell supports only Scala, Python and R (Java might be supported in previous versions).
  3. The spark-shell2 command is used to launch Spark with Scala shell. I have covered this in detail in this article.
  4. The pyspark command is used to launch Spark with Python shell also call PySpark.
  5. The sparkr command is used to launch Spark with R language.
  6. In Spark shell, Spark by default provides spark and sc variables. spark is an object of SparkSession and sc is an object of SparkContext.
  7. In Shell you cannot create your own SparkContext

Pre-requisites: Before you proceed make sure you have Apache Spark installed.

1. Launch Spark Shell (spark-shell) Command

Go to the Apache Spark Installation directory from the command line and type bin/spark-shell and press enter, this launches Spark shell and gives you a scala prompt to interact with Spark in scala language. If you have set the Spark in a PATH then just enter spark-shell in command line or terminal (mac users).


./bin/spark-shell

Yields below output.

spark shell command
Spark Shell Command

Let’s understand a few statements from the above screenshot.

  1. By default, spark-shell creates a Spark context which internally creates a Web UI with URL http://localhost:4040. Since it is unable to bind on 4040 for me it was created on 4042 port.
  2. Spark context created with app id local-*
  3. By default it uses local[*] as master
  4. Spark context and session are created with variables 'sc' and 'spark' respectively.
  5. Shows Spark, Scala and Java versions used.

2. Spark Shell Web UI

By default Spark Web UI launches on port 4040, if it could not bind then it tries on 4041, 4042, and son until it binds.

3. Run Spark Statements from Shell

Let’s create a Spark DataFrame with some sample data to validate the installation. Enter the following commands in the Spark Shell in the same order.


import spark.implicits._
val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
val df = data.toDF() 
df.show()

Yields below output. For more examples on Apache Spark refer to Spark Tutorial with Examples.

spark shell examples

4. Spark Shell Examples

Let’s see the different spark-shell command options

Example 1: Launch in Cluster mode


./bin/spark-shell \
   --master yarn \
   --deploy-mode cluster

This launches the Spark driver program in cluster. By default, it uses client mode which launches the driver on the same machine where you are running shell.

Example 2: In case you wanted to add dependency jars


./bin/spark-shell \
   --master yarn \
   --deploy-mode cluster \
   --jars file1.jar,file2.jar

Example 3: Adding jars to spark-shell

If you wanted to add a jar to spark-shell use –driver-class-path option.


spark-shell --driver-class-path /path/to/example.jar:/path/to/another.jar

Example 4: With Configs


./bin/spark-shell \
   --master yarn \
   --deploy-mode cluster \
   --driver-memory 8g \
   --executor-memory 16g \
   --executor-cores 2  \
   --conf "spark.sql.shuffle.partitions=20000" \
   --conf "spark.executor.memoryOverhead=5244" \
   --conf "spark.memory.fraction=0.8" \
   --conf "spark.memory.storageFraction=0.2" \
   --jars file1.jar,file2.jar

5. Commands in Spark Shell

While you interacting in shell, you probably require some help for example what all the different imports are available, all history commands e.t.c. You can get all available options by using :help

scala> :help

All commands can be abbreviated, e.g., :he instead of :help.

:completions <string>    Output completions for the given string

:edit <id>|<line>        Edit history

:help [command]          Print this summary or command-specific help

:history [num]           Show the history (optional num of commands to show)

:h? <string>             Search the history

:imports [name name ...] Show import history, identifying sources of names

:implicits [-v]          Show the implicits in scope

:javap <path|class>      Disassemble a file or class name

:line <id>|<line>        Place line(s) at the end of history

:load <path>             Interpret lines in a file

:paste [-raw] [path]     Enter paste mode or paste a file

:power                   Enable power user mode

:quit                    Exit the interpreter

:replay [options]        Reset the repl and replay all previous commands

:require <path>          Add a jar to the classpath

:reset [options]        Reset the repl to its initial state, forgetting all session entries

:save <path>             Save replayable session to a file

:sh <command line>       Run a shell command (result is implicitly => List[String])

:settings <options>      Update compiler options, if possible; see reset

:silent                  Disable/enable automatic printing of results

:type [-v] <expr>        Display the type of an expression without evaluating it

:kind [-v] <type>        Display the kind of a type. see also :help kind

:warnings                Show the suppressed warnings from the most recent line which had any

6. Accessing Environment Vaiables

Sometimes you would be required to access environment variables in shell, you can achieve this by accessing System.getenv() method. Note that this is a Java method but you can use it.

For example on UNIX shell set a variable.


export ENV_NAME='SparkByExamples.com'

Now open spark-shell and access it from the scala prompt.


scala>System.getenv('ENV_NAME')

7. Run Unix Shell Script File

In case you wanted to run a Unix shell file (.sh file) from the scala prompt, you can do this by using :sh <file-name>. I have nnk.sh file with content echo 'SparkByExamples.com' > nnk.out


scala> :sh /Users/admin/nnk.sh
res0: scala.tools.nsc.interpreter.ProcessResult = `/Users/admin/nnk.sh` (0 lines, exit 0)

This executes nnk.sh file which creates nnk.out file with content 'SparkByExamples.com'

8. Load Scala Script

By using :load from a shell, you can load the Scala file. First, create a scala file, I will be creating nnk.scala with content println("SparkByExamples.com")

Now let’s launch shell and load this scala program. This comes in handy if you have commands in a scala file and wanted to run from a shell.


scala> :load nnk.scala
Loading nnk.scala...
SparkByExamples.com

scala> 

9. Spark Shell Options

Like any other shell command, Apache Spark shell also provides several options, you can get all available options with -h (help). Below are some of the important options.

Spark Shell OptionsOption Description
-I <file>preload <file>, enforcing line-by-line interpretation
–master MASTER_URLspark://host:port, mesos://host:port,
yarn,
k8s://https://host:port, or local (Default: local[*]).
–deploy-mode DEPLOY_MODEWhether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”)
(Default: client).
–class CLASS_NAMEMain class you wanted to run. This is applicable to Java / Scala apps.
–py-files PY_FILESComma-separated list of .zip, .egg, or .py files to place. This is applicable only to Python.
–name NAMESpecify the name of your application.
–jars JARSComma-separated list of jars to include on the driver and executor classpaths.
–packagesComma-separated list of maven coordinates of jars to include on the driver and executor classpaths.
–files FILESComma-separated list of files to be placed in the working directory of each executor.
Spark Shell Options

For the complete list of spark-shell options use the -h command.


.bin/spark-shell -h

This yields the below output. If you closely look at it most of the options are similar to spark-submit command.

spark shell command
Spark Shell Help

Conclusion

In this article, you have learned What is Spark shell, how to use it with examples, and the different options available inside a shell.

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing Spark Shell Command Usage with Examples