To address the ‘No module named ‘pyspark’’ error in both Jupyter Notebook and any Python editor, ensure you have correctly installed PySpark and configured the environment variables. This error occurs when attempting to import the PySpark library without a proper installation or environment setup.
ModuleNotFoundError: No module named 'pyspark'
1. Install PySpark to resolve No module named ‘pyspark’ Error
Note that PySpark is not installed with Python installation hence it will not be available by default, in order to use, first you need to install pyspark by using pip or conda (if you are using anaconda) commands.
# Install pyspark
$ pip install pyspark
Even following a successful installation of Spark/PySpark on Linux, Windows, or macOS, you might encounter challenges importing PySpark libraries in Python. Below, I’ve outlined several potential solutions to address these import issues.
2. Using findspark
Even after installing PySpark you are getting “No module named pyspark"
in Python, this could be due to environment variables issues, you can solve this by installing and import findspark.
You need findspark
to help Python locate and integrate PySpark into your Python environment. Without it, Python may not recognize the PySpark installation directory and therefore be unable to import PySpark libraries. findspark
helps bridge this gap by dynamically adding PySpark to the Python path, allowing you to import PySpark libraries seamlessly.
In order to use it, first, use pip command to install findspark
.
# Install findspark
pip install findspark
Now run the below commands in sequence on Jupyter Notebook or in Python script. The findspark.init()
function from findspark
library initializes the PySpark environment by locating the Spark installation directory and adding it to the Python path. This allows Python to recognize and import PySpark libraries without needing to explicitly set the PYTHONPATH
environment variable or modify system configurations.
# Import findspark
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()
3. Setting Environment Variables
Setting environment variables resolves the ‘No module named ‘pyspark” error by ensuring that Python can locate the PySpark installation directory. When you install PySpark, its modules are stored in a specific directory. However, Python needs to know where this directory is located in order to import these modules successfully.
To get the current installation of PySpark, run the below command.
# Get pyspark installation location
pip show pyspark
By setting environment variables such as SPARK_HOME
and PYTHONPATH
, you provide Python with the necessary information to locate the PySpark installation. This allows Python to find and import the PySpark modules without encountering the ‘No module named ‘pyspark” error.
Note: Change the Apache Spark/PySpark version according to your versions.
3.1 Linux on Ubuntu
Open the .bashrc
file and add the following statements. By running source ~/.bashrc
re-load the file.
# On Linux
export SPARK_HOME=/Users/prabha/apps/spark-3.5.0-bin-hadoop3
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
3.2 Mac OS
On Mac, add the following environment variables to the profile file.
# On Mac
export SPARK_HOME=/usr/local/Cellar/apache-spark/3.5.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
3.3 Windows
To set environment variables on Windows, follow these steps:
- Open Control Panel: Press
Win + R
, typecontrol panel
, and press Enter. - Navigate to System: In Control Panel, select “System and Security”, then click on “System”.
- Open Advanced System Settings: On the left sidebar, click on “Advanced system settings”.
- Open Environment Variables: In the System Properties window, click on the “Environment Variables” button.
- Add or Edit User Variables: Under the “User variables” section (for your user account), click “New” to add a new environment variable, or select an existing variable and click “Edit” to modify it.
Add these to the user variables.
# On windows
set SPARK_HOME=C:\apps\opt\spark-3.5.0-bin-hadoop3
set HADOOP_HOME=%SPARK_HOME%
set PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%
Conclusion
In conclusion, resolving the ‘No module named pyspark’ error involves various methods, each is used to address specific issues:
- Correct Installation: Ensure PySpark is correctly installed, either through pip, conda, or by downloading and configuring it manually.
- Setting Environment Variables: Set environment variables such as
SPARK_HOME
and add the Sparkbin
directory to thePATH
variable to help Python locate the PySpark installation. - Using findspark: Utilize the
findspark
library to dynamically locate and integrate PySpark into the Python environment, simplifying the import process. - Proper Configuration: Verify that the Python environment and PySpark versions are compatible, and that all necessary dependencies are installed.
Happy Learning !!