In this article, I will explain step-by-step how to do Apache Spark Installation on windows os 7, 10, and the latest version and also explain how to start a history server and monitor your jobs using Web UI.
Install Java 8 or Later
To install Apache Spark on windows, you would need Java 8 or the latest version hence download the Java version from Oracle and install it on your system. If you wanted OpenJDK you can download it from here.
After download, double click on the downloaded .exe (
jdk-8u201-windows-x64.exe) file in order to install it on your windows system. Choose any custom directory or keep the default location.
Note: This article explains Installing Apache Spark on Java 8, same steps will also work for Java 11 and 13 versions.
Apache Spark Installation on Windows
Apache Spark comes in a compressed tar/zip files hence installation on windows is not much of a deal as you just need to download and untar the file. Download Apache spark by accessing the Spark Download page and select the link from “Download Spark (point 3 from below screenshot)”.
If you wanted to use a different version of Spark & Hadoop, select the one you wanted from the drop-down; the link on point 3 changes to the selected version and provides you with an updated link to download.
After download, untar the binary using 7zip or any zip utility to extract the zip file and copy the extracted directory
Spark Environment Variables
Post Java and Apache Spark installation on windows, set
PATH environment variables. If you know how to set the environment variable on windows, add the following.
JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201 PATH = %PATH%;%JAVA_HOME% SPARK_HOME = C:\apps\opt\spark-3.0.0-bin-hadoop2.7 HADOOP_HOME = C:\apps\opt\spark-3.0.0-bin-hadoop2.7 PATH=%PATH%;%SPARK_HOME%
Follow the below steps if you are not aware of how to add or edit environment variables on windows.
- Open System Environment Variables window and select Environment Variables.
2. On the following Environment variable screen, add
JAVA_HOME by selecting the New option.
3. This opens up the New User Variables window where you can enter the variable name and value.
4. Now Edit the PATH variable
5. Add Spark, Java, and Hadoop bin location by selecting New option.
Spark with winutils.exe on Windows
Many beginners think Apache Spark needs a Hadoop cluster installed to run but that’s not true, Spark can run on AWS by using S3, Azure by using blob storage without Hadoop and HDFSe.t.c.
To run Apache Spark on windows, you need
winutils.exe as it uses POSIX like file access operations in windows using windows API.
winutils.exe enables Spark to use Windows-specific services including running shell commands on a windows environment.
Download winutils.exe for Hadoop 2.7 and copy it to
%SPARK_HOME%\bin folder. Winutils are different for each Hadoop version hence download the right version based on your Spark vs Hadoop distribution from https://github.com/steveloughran/winutils
Apache Spark shell
spark-shell is a CLI utility that comes with Apache Spark distribution, open command prompt, go to
cd %SPARK_HOME%/bin and type
spark-shell command to run Apache Spark shell. You should see something like below (ignore the error you see at the end). Sometimes it may take a minute or two for your Spark instance to initialize to get to the below screen.
On spark-shell command line, you can run any Spark statements like creating an RDD, getting Spark version e.t.c
scala> spark.version res2: String = 3.0.0 scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD at parallelize at console:24 scala>
This completes the installation of Apache Spark on Windows 7, 10, and any latest.
Where to go Next?
You can continue following the below document to see how you can debug the logs using Spark Web UI and enable the Spark history server or follow the links as next steps
Web UI on Windows
Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application, resource consumption of Spark cluster, and Spark configurations. On Spark Web UI, you can see how the operations are executed.
History server keeps a log of all Spark applications you submit by
spark-shell. You can enable Spark to collect the logs by adding the below configs to
spark-defaults.conf file, conf file is located at
spark.eventLog.enabled true spark.history.fs.logDirectory file:///c:/logs/path
After setting the above properties, start the history server by starting the below command.
By default History server listens at 18080 port and you can access it from browser using http://localhost:18080/
By clicking on each App ID, you will get the details of the application in Spark web UI.
In summary, you have learned how to install Apache Spark on windows and run sample statements in
spark-shell, and learned how to start spark web-UI and history server.
If you have any issues, setting up, please message me in the comments section, I will try to respond with the solution.
Happy Learning !!