PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. so there is no PySpark library to download. All you need is Spark.
Follow the below steps to Install PySpark on Windows.
Install Python or Anaconda distribution
Download and install either Python from Python.org or Anaconda distribution which includes Python, Spyder IDE, and Jupyter notebook. I would recommend using Anaconda as it’s popular and used by the Machine Learning & Data science community.
Install Java 8
To run the PySpark application, you would need Java 8 or a later version hence download the Java version from Oracle and install it on your system.
Post-installation set JAVA_HOME and PATH variable.
JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201 PATH = %PATH%;C:\Program Files\Java\jdk1.8.0_201\bin
PySpark Install on Windows
PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. so there is no PySpark library to download. All you need is Spark; follow the below steps to install PySpark on windows.
1. On Spark Download page, select the link “Download Spark (point 3)” to download. If you wanted to use a different version of Spark & Hadoop, select the one you wanted from drop-downs, and the link on point 3 changes to the selected version and provides you with an updated link to download.
2. After download, untar the binary using 7zip and copy the underlying folder
3. Now set the following environment variables.
SPARK_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7 HADOOP_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7 PATH=%PATH%;C:\apps\spark-3.0.0-bin-hadoop2.7\bin
Install winutils.exe on Windows
Download winutils.exe file from winutils, and copy it to
%SPARK_HOME%\bin folder. Winutils are different for each Hadoop version hence download the right version from https://github.com/steveloughran/winutils
Now open the command prompt and type pyspark command to run the PySpark shell. You should see something like this below.
Spark-shell also creates a Spark context web UI and by default, it can access from http://localhost:4041.
Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application.
History servers, keep a log of all PySpark applications you submit by spark-submit, pyspark shell. before you start, first you need to set the below config on
spark.eventLog.enabled true spark.history.fs.logDirectory file:///c:/logs/path
Now, start the history server on Linux or Mac by running.
If you are running PySpark on windows, you can start the history server by starting the below command.
By default, History server listens at 18080 port and you can access it from the browser using http://localhost:18080/
By clicking on each App ID, you will get the details of the application in PySpark web UI.
In summary, you have learned how to install PySpark on windows and run sample statements in spark-shell
If you have any issues, setting up, please message me in the comments section, I will try to respond with the solution.
Happy Learning !!
- Apache Spark Setup with Scala and IntelliJ
- Apache Spark Installation on Windows
- Spark Installation on Linux Ubuntu
- Spark Hello World Example in IntelliJ IDEA
- Spark Word Count Explained with Example
- Spark Setup on Hadoop Cluster with Yarn
- Spark Start History Server
- How to Check Spark Version
- Install PySpark on Ubuntu running on Linux
- Install PySpark in Anaconda & Jupyter Notebook
- Install PySpark in Jupyter on Mac using Homebrew
- How to Install PySpark on Mac
- Install Pyspark using pip or condo
- Dynamic way of doing ETL through Pyspark
How to Find PySpark Version?
PySpark Shell Command Usage with Examples
Install Anaconda & Run pandas on Jupyter Notebook
- Pyspark: Exception: Java gateway process exited before sending the driver its port number