How to Import PySpark in Python Script

Let’s see how to import the PySpark library in Python Script or how to use it in shell, sometimes even after successfully installing Spark on Linux/windows/mac, you may have issues while importing PySpark libraries in Python, below I have explained some possible ways to resolve the import issues.

1. Check PySpark Installation is Right

Sometimes you may have issues in PySpark installation hence you will have errors while importing libraries in Python. Post successful installation of PySpark, use PySpark shell which is REPL (read–eval–print loop), and is used to start an interactive shell to test/run a few individual PySpark commands. This is mostly used to quickly test some commands during the development time.

Import PySpark in Python shell — PySpark shell

The following examples demonstrate how to fix the below issue and any issues with importing the PySpark library.


ModuleNotFoundError: No module named 'pyspark'

2. Import PySpark in Python Using findspark

Even after successful install PySpark you may have issues importing pyspark in Python, you can resolve it by installing and import findspark, In case you are not sure what it is, findspark searches pyspark installation on the server and adds PySpark installation path to sys.path at runtime so that you can import PySpark modules.

First Install findspark using pip command.


pip install findspark

Post successful installation, import it in Python program or shell to validate PySpark imports. Run below commands in sequence.


import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()

In case for any reason, you can’t install findspark, you can resolve the issue in other ways by manually setting environment variables.

3. Setting PySpark Environment Variables

To set PySpark environment variables, first, get the PySpark installation direction path by running the Python command pip show.


pip show pyspark

Now set the SPARK_HOME & PYTHONPATH according to your installation, For my articles, I run my PySpark programs in Linux, Mac and Windows hence I will show what configurations I have for each. After setting these, you should not see No module named pyspark while importing PySpark in Python.

3.1 Linux on Ubuntu


export SPARK_HOME=/Users/prabha/apps/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

Put these on .bashrc file and re-load the file by using source ~/.bashrc

3.2 Mac OS

On Mac I have Spark 2.4.0 version, hence the below variables.


export SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH

Put these on .bashrc file and re-load the file by using source ~/.bashrc

3.3 Windows PySpark environment

For my windows environment, I have the PySpark version spark-3.0.0-bin-hadoop2.7 so below are my environment variables. Set these on the Windows environment variables screen.


set SPARK_HOME=C:\apps\opt\spark-3.0.0-bin-hadoop2.7
set HADOOP_HOME=%SPARK_HOME%
set PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%

If you have a different Spark version, use the version accordingly.

Conclusion

In summary, you have learned how to import PySpark libraries in Jupyter or shell/script either by setting the right environment variables or installing and using findspark module.

Happy Learning !!