Steps to install Apache Spark 3.5 Installation on Windows – In this article, I will explain step-by-step how to do Apache Spark 3.5 Installation on Windows OS 7, 10, and the latest version and how to start a history server and monitor your jobs using Web UI.
Install Java 8 or Later
To install Apache Spark 3.5 on Windows 10 or 11, you would need to install Java/JDK 8, 11, 17, or the latest version either from oracle.com or https://openjdk.org/ or https://jdk.java.net/ on your system.
After downloading, double-click on the downloaded .exe file to install it on your Windows system, choose any custom directory, or keep the default location.
Note: This article explains Installing Apache Spark 3.5 with Java 17; the same steps will also work for Java 8, 11, and 13 versions.
Apache Spark Installation on Windows
Apache Spark comes in compressed tar/zip files; hence, installation on Windows is not much of a deal as you need to download and untar the file. Download Apache Spark by accessing the Spark Download page and selecting the link from “Download Spark (point 3 from below screenshot)”.
If you want to use a different version of Spark & Hadoop, select the one you want from the drop-down; the link on point 3 changes to the selected version and provides you with an updated link to download.
After download, untar the binary using 7zip or any zip utility to extract the zip file and copy the extracted directory
Spark Environment Variables
Post Java and Apache Spark installation on Windows, set JAVA_HOME, SPARK_HOME, HADOOP_HOME, and PATH environment variables. If you know how to set the environment variable on Windows, add the following.
JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201 SPARK_HOME = C:\apps\opt\spark-3.5.0-bin-hadoop3 HADOOP_HOME = C:\apps\opt\spark-3.5.0-bin-hadoop3 PATH=%PATH%;%SPARK_HOME%\bin;%JAVA_HOME%\bin
Follow the steps below if you are unaware of how to add or edit environment variables on Windows.
- Open the System Environment Variables window and select Environment Variables.
2. On the following Environment variable screen, add
JAVA_HOME by selecting the New option.
3. This opens up the New User Variables window where you can enter the variable name and value. Add respective paths to these variables.
4. Now Edit the PATH variable.
5. Add Spark, Java, and Hadoop bin locations by selecting the New option.
Spark with winutils.exe on Windows
Many beginners think Apache Spark needs a Hadoop cluster installed to run, but that’s not true; Spark can run on AWS using S3 and Azure using blob storage without Hadoop and HDFSe.t.c.
To run Apache Spark on Windows, you need winutils.exe as it uses POSIX like file access operations in Windows using Windows API.
winutils.exe enables Spark to use Windows-specific services, including running shell commands on a Windows environment.
Download winutils.exe for Hadoop 3.3 and copy it to %SPARK_HOME%\bin folder. Winutils differ for each Hadoop version; hence, download the right version based on your Spark vs Hadoop distribution.
Apache Spark shell
spark-shell is a CLI utility that comes with Apache Spark distribution, open command prompt, go to
cd %SPARK_HOME%/bin and type
spark-shell command to run Apache Spark shell. You should see something like this below (ignore the error you see at the end). Sometimes, your Spark instance may take a minute or two to initialize to get to the below screen.
Spark-shell also creates a Spark context web UI and by default, it can access from http://localhost:4041.
On spark-shell command line, you can run any Spark statements like creating an RDD, getting a Spark version e.t.c
scala> spark.version res2: String = 3.5.0 scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD at parallelize at console:24 scala>
This completes the installation of Apache Spark on Windows 10, 11, or any latest version.
Where to go Next?
You can continue following the below document to see how you can debug the logs using Spark Web UI and enable the Spark history server or follow the links as next steps
Web UI on Windows
Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application, resource consumption of Spark cluster, and Spark configurations. On Spark Web UI, you can see how the operations are executed.
The History server keeps a log of all Spark applications you submit by
spark-shell. You can enable Spark to collect the logs by adding the below configs to
spark-defaults.conf file, conf file is located at
spark.eventLog.enabled true spark.history.fs.logDirectory file:///c:/logs/path
After setting the above properties, start the history server by starting the below command.
By default, the History server listens at 18080 port and you can access it from the browser using http://localhost:18080/
By clicking on each App ID, you will get the details of the application in Spark web UI.
In summary, you have learned how to install Apache Spark on Windows and run sample statements in spark-shell, and learned how to start to spark web-UI and history server.
If you have any issues, setting up, please message me in the comments section, and I will try to respond with the solution.
Happy Learning !!
- Apache Spark Installation on Linux
- Install Apache Spark Latest Version on Mac
- How to Check Spark Version
- What does setMaster(local[*]) mean in Spark
- Spark Start History Server
- Spark with Scala setup on IntelliJ
- Spark Submit Command Explained with Examples
- Spark Shell Command Usage with Examples
- Spark Hello World Example in IntelliJ IDEA
- Spark Word Count Explained with Example
- Spark Setup on Hadoop Cluster with Yarn
- What is SparkSession and How to create it?
- What is SparkContext and How to create it?
- How to Check Spark Version
- Install PySpark on Ubuntu running on Linux
- Install PySpark in Jupyter on Mac using Homebrew
- Install PySpark in Anaconda & Jupyter Notebook
- How to Install PySpark on Mac
- How to Install PySpark on Windows
- Install Pyspark using pip or condo