In this article, I will explain how to setup and run the PySpark application on the Spyder IDE. Spyder IDE is a popular tool to write and run Python applications and you can use this tool to run PySpark application during the development phase.
Install Java 8 or later version
PySpark uses Py4J library which is a Java library that integrates python to dynamically interface with JVM objects when running the PySpark application. Hence, you would need Java to be installed. Download the Java 8 or later version from Oracle and install it on your system.
Post installation, set
JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201 PATH = %PATH%;C:\Program Files\Java\jdk1.8.0_201\bin
Install Apache Spark
Download Apache spark by accessing Spark Download page and select the link from “Download Spark (point 3)”. If you wanted to use a different version of Spark & Hadoop, select the one you wanted from drop downs and the link on point 3 changes to the selected version and provides you with an updated link to download.
After download, untar the binary using 7zip and copy the underlying folder
Now set the following environment variables.
SPARK_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7 HADOOP_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7 PATH=%PATH%;C:\apps\spark-3.0.0-bin-hadoop2.7\bin
Download wunutils.exe file from winutils, and copy it to %SPARK_HOME%\bin folder. Winutils are different for each Hadoop version hence download the right version from https://github.com/steveloughran/winutils
Now open command prompt and type
pyspark command to run PySpark shell. You should see something like below.
Run PySpark application from Spyder IDE
To write PySpark applications, you would need an IDE, there are 10’s of IDE to work with and I choose to use Spyder IDE. If you have not installed Spyder IDE along with Anaconda distribution, install these before you proceed.
Now, set the following environment variable.
PYTHONPATH => %SPARK_HOME%/python;$SPARK_HOME/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%
Now open Spyder IDE and create a new file with below simple PySpark program and run it. You should see 5 in output.
Happy Learning !!