• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:6 mins read
You are currently viewing Python: No module named ‘pyspark’ Error

How to resolve No module named ‘pyspark’ Error in Jupyter notebook and any python editor? In python when you try to import PySpark library without installing or properly setting environment variables you would get ‘No module named ‘pyspark’ error’.


ModuleNotFoundError: No module named 'pyspark'

1. Install PySpark to resolve No module named ‘pyspark’ Error

Note that PySpark doesn’t come with Python installation hence it will not be available by default, in order to use, first you need to install pyspark by using pip or conda (if you are using anaconda) commands.


$ pip install pyspark

Even after successful installing Spark/PySpark on Linux/windows/mac, you may still have issues importing PySpark libraries in Python, below I have explained some possible ways to resolve the import issues.

Note: Do not use Python shell or Python command to run PySpark program.

2. Using findspark

Even after installing PySpark you are getting “No module named pyspark" in Python, this could be due to environment variables issues, you can solve this by installing and import findspark.

findspark library searches pyspark installation on the server and adds PySpark installation path to sys.path at runtime so that you can import PySpark modules. In order to use first, you need to Install findspark using pip command.


pip install findspark 

Now run the below commands in sequence on Jupyter Notebook or in Python script.


import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()

3. Setting Environment Variables

To set PySpark environment variables, first, get the PySpark installation direction path by running the Python command pip show.


pip show pyspark

Now set the SPARK_HOME & PYTHONPATH according to your installation, For my articles, I run my PySpark programs in Linux, Mac and Windows hence I will show what configurations I have for each. After setting these, you should not see "No module named pyspark” while importing PySpark in Python.

3.1 Linux on Ubuntu


export SPARK_HOME=/Users/prabha/apps/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

Put these on .bashrc file and re-load the file by using source ~/.bashrc

3.2 Mac OS

On Mac I have Spark 2.4.0 version, hence the below variables.


export SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH

Put these on .bashrc file and re-load the file by using source ~/.bashrc

3.3 Windows PySpark environment

For my windows environment, I have the PySpark version spark-3.0.0-bin-hadoop2.7 so below are my environment variables. Set these on the Windows environment variables screen.


set SPARK_HOME=C:\apps\opt\spark-3.0.0-bin-hadoop2.7
set HADOOP_HOME=%SPARK_HOME%
set PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%

If you have a different Spark version, use the version accordingly.

Conclusion

In summary, you can resolve No module named pyspark error by importing modules/libraries in PySpark (shell/script) either by setting the right environment variables or installing and using findspark module.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium