How to resolve No module named ‘pyspark’ Error in Jupyter notebook and any python editor? In python when you try to import PySpark library without installing or properly setting environment variables you would get ‘No module named ‘pyspark’ error’.
ModuleNotFoundError: No module named 'pyspark'
1. Install PySpark to resolve No module named ‘pyspark’ Error
Note that PySpark doesn’t come with Python installation hence it will not be available by default, in order to use, first you need to install pyspark by using pip or conda (if you are using anaconda) commands.
$ pip install pyspark
Even after successful installing Spark/PySpark on Linux/windows/mac, you may still have issues importing PySpark libraries in Python, below I have explained some possible ways to resolve the import issues.
Note: Do not use Python shell
or Python command
to run PySpark program.
2. Using findspark
Even after installing PySpark you are getting “No module named pyspark"
in Python, this could be due to environment variables issues, you can solve this by installing and import findspark.
findspark library searches pyspark installation on the server and adds PySpark installation path to sys.path
at runtime so that you can import PySpark modules. In order to use first, you need to Install findspark
using pip
command.
pip install findspark
Now run the below commands in sequence on Jupyter Notebook or in Python script.
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()
3. Setting Environment Variables
To set PySpark environment variables, first, get the PySpark installation direction path by running the Python command pip show
.
pip show pyspark
Now set the SPARK_HOME
& PYTHONPATH
according to your installation, For my articles, I run my PySpark programs in Linux, Mac and Windows hence I will show what configurations I have for each. After setting these, you should not see "No module named pyspark
” while importing PySpark in Python.
3.1 Linux on Ubuntu
export SPARK_HOME=/Users/prabha/apps/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
Put these on .bashrc
file and re-load the file by using source ~/.bashrc
3.2 Mac OS
On Mac I have Spark 2.4.0 version, hence the below variables.
export SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
Put these on .bashrc
file and re-load the file by using source ~/.bashrc
3.3 Windows PySpark environment
For my windows environment, I have the PySpark version spark-3.0.0-bin-hadoop2.7
so below are my environment variables. Set these on the Windows environment variables screen.
set SPARK_HOME=C:\apps\opt\spark-3.0.0-bin-hadoop2.7
set HADOOP_HOME=%SPARK_HOME%
set PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%
If you have a different Spark version, use the version accordingly.
Conclusion
In summary, you can resolve No module named pyspark
error by importing modules/libraries in PySpark (shell/script) either by setting the right environment variables or installing and using findspark module.
Happy Learning !!