How to Install PySpark on Windows

Spread the love

In this article, I will explain how to install and run PySpark on windows and also explain how to start a history server and monitor your jobs using Web UI.

Related:

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. so there is no PySpark library to download. All you need is Spark.

Follow the below steps to Install PySpark on Windows.

Install Python or Anaconda distribution

Download and install either Python from Python.org or Anaconda distribution which includes Python, Spyder IDE, and Jupyter notebook. I would recommend using Anaconda as it’s popular and used by the Machine Learning & Data science community.

Follow Install PySpark using Anaconda & run Jupyter notebook

Install Java 8

To run the PySpark application, you would need Java 8 or a later version hence download the Java version from Oracle and install it on your system.

Post-installation set JAVA_HOME and PATH variable.


JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201
PATH = %PATH%;C:\Program Files\Java\jdk1.8.0_201\bin

PySpark Install on Windows

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. so there is no PySpark library to download. All you need is Spark; follow the below steps to install PySpark on windows.

1. On Spark Download page, select the link “Download Spark (point 3)” to download. If you wanted to use a different version of Spark & Hadoop, select the one you wanted from drop-downs, and the link on point 3 changes to the selected version and provides you with an updated link to download.

Pyspark install windows

2. After download, untar the binary using 7zip and copy the underlying folder spark-3.0.0-bin-hadoop2.7 to c:\apps

3. Now set the following environment variables.


SPARK_HOME  = C:\apps\spark-3.0.0-bin-hadoop2.7
HADOOP_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7
PATH=%PATH%;C:\apps\spark-3.0.0-bin-hadoop2.7\bin

Install winutils.exe on Windows

Download winutils.exe file from winutils, and copy it to %SPARK_HOME%\bin folder. Winutils are different for each Hadoop version hence download the right version from https://github.com/steveloughran/winutils

PySpark shell

Now open the command prompt and type pyspark command to run the PySpark shell. You should see something like this below.

pyspark installation windows

Spark-shell also creates a Spark context web UI and by default, it can access from http://localhost:4041.

Web UI

Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application.

Spark Web UI
Spark Web UI

History Server

History servers, keep a log of all PySpark applications you submit by spark-submit, pyspark shell. before you start, first you need to set the below config on spark-defaults.conf


spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/path

Now, start the history server on Linux or Mac by running.


$SPARK_HOME/sbin/start-history-server.sh

If you are running PySpark on windows, you can start the history server by starting the below command.


$SPARK_HOME/bin/spark-class.cmd org.apache.spark.deploy.history.HistoryServer

By default, History server listens at 18080 port and you can access it from the browser using http://localhost:18080/

pyspark installation windows
History Server

By clicking on each App ID, you will get the details of the application in PySpark web UI.

Conclusion

In summary, you have learned how to install PySpark on windows and run sample statements in spark-shell

If you have any issues, setting up, please message me in the comments section, I will try to respond with the solution.

Happy Learning !!

Naveen (NNK)

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

This Post Has 6 Comments

  1. Basha

    ‘pyspark’ is not recognized as an internal or external command, operable program or batch file– please provide the solution for this error. It will be useful for others as well

    1. NNK

      Hi Bash, Are you using windows or Linux platform? regardless you need to set the following on environment variables. Let me know if you don’t know how to set these.
      SPARK_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7
      HADOOP_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7
      PATH=%PATH%;C:\apps\spark-3.0.0-bin-hadoop2.7\bin

      Change the path according to your setup.

  2. Basha

    ‘pyspark’ is not recognized as an internal or external command, operable program or batch file– please provide the solution for this error. It will be useful for others as well

  3. Sindhu

    I have set the variables as mentioned below in spark-defaults.conf yet im getting the exception finle not found /tmp/spark-events. Please let me know what am i doing wrong.
    # spark.eventLog.enabled true
    # spark.eventLog.dir C://logs/path
    # spark.history.fs.logDirectory C://logs/path
    “`
    Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
    21/09/16 13:54:59 INFO HistoryServer: Started daemon with process name: [email protected]
    21/09/16 13:54:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    21/09/16 13:55:00 INFO SecurityManager: Changing view acls to: sindhu
    21/09/16 13:55:00 INFO SecurityManager: Changing modify acls to: sindhu
    21/09/16 13:55:00 INFO SecurityManager: Changing view acls groups to:
    21/09/16 13:55:00 INFO SecurityManager: Changing modify acls groups to:
    21/09/16 13:55:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sindhu); groups with view permissions: Set(); users with modify permissions: Set(sindhu); groups with modify permissions: Set()
    21/09/16 13:55:00 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
    21/09/16 13:55:00 INFO Utils: Successfully started service on port 18080.
    21/09/16 13:55:00 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://LAPTOP-ECK2KRNH.mshome.net:18080
    Exception in thread “main” java.io.FileNotFoundException: Log directory specified does not exist: file:/tmp/spark-events Did you configure the correct one through spark.history.fs.logDirectory?
    at org.apache.spark.deploy.history.FsHistoryProvider.startPolling(FsHistoryProvider.scala:280)
    at org.apache.spark.deploy.history.FsHistoryProvider.initialize(FsHistoryProvider.scala:228)
    at org.apache.spark.deploy.history.FsHistoryProvider.start(FsHistoryProvider.scala:410)
    at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:303)
    at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
    Caused by: java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
    at org.apache.spark.deploy.history.FsHistoryProvider.startPolling(FsHistoryProvider.scala:270)

    “`

    1. NNK

      Hi Sindhu, Why you have # infront of the properties. Remove # and try again.

  4. Sindhu

    I have set the variables as mentioned below in spark-defaults.conf yet im getting the exception finle not found /tmp/spark-events. Please let me know what am i doing wrong.
    # spark.eventLog.enabled true
    # spark.eventLog.dir C://logs/path
    # spark.history.fs.logDirectory C://logs/path
    “`
    Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
    21/09/16 13:54:59 INFO HistoryServer: Started daemon with process name: [email protected]
    21/09/16 13:54:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    21/09/16 13:55:00 INFO SecurityManager: Changing view acls to: sindhu
    21/09/16 13:55:00 INFO SecurityManager: Changing modify acls to: sindhu
    21/09/16 13:55:00 INFO SecurityManager: Changing view acls groups to:
    21/09/16 13:55:00 INFO SecurityManager: Changing modify acls groups to:
    21/09/16 13:55:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sindhu); groups with view permissions: Set(); users with modify permissions: Set(sindhu); groups with modify permissions: Set()
    21/09/16 13:55:00 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
    21/09/16 13:55:00 INFO Utils: Successfully started service on port 18080.
    21/09/16 13:55:00 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://LAPTOP-ECK2KRNH.mshome.net:18080
    Exception in thread “main” java.io.FileNotFoundException: Log directory specified does not exist: file:/tmp/spark-events Did you configure the correct one through spark.history.fs.logDirectory?
    at org.apache.spark.deploy.history.FsHistoryProvider.startPolling(FsHistoryProvider.scala:280)
    at org.apache.spark.deploy.history.FsHistoryProvider.initialize(FsHistoryProvider.scala:228)
    at org.apache.spark.deploy.history.FsHistoryProvider.start(FsHistoryProvider.scala:410)
    at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:303)
    at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
    Caused by: java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
    at org.apache.spark.deploy.history.FsHistoryProvider.startPolling(FsHistoryProvider.scala:270)

    “`

You are currently viewing How to Install PySpark on Windows