You are currently viewing Python: No module named ‘pyspark’ Error

To address the ‘No module named ‘pyspark’’ error in both Jupyter Notebook and any Python editor, ensure you have correctly installed PySpark and configured the environment variables. This error occurs when attempting to import the PySpark library without a proper installation or environment setup.

Advertisements

ModuleNotFoundError: No module named 'pyspark'

1. Install PySpark to resolve No module named ‘pyspark’ Error

Note that PySpark is not installed with Python installation hence it will not be available by default, in order to use, first you need to install pyspark by using pip or conda (if you are using anaconda) commands.


# Install pyspark
$ pip install pyspark

Even following a successful installation of Spark/PySpark on Linux, Windows, or macOS, you might encounter challenges importing PySpark libraries in Python. Below, I’ve outlined several potential solutions to address these import issues.

2. Using findspark

Even after installing PySpark you are getting “No module named pyspark" in Python, this could be due to environment variables issues, you can solve this by installing and import findspark.

You need findspark to help Python locate and integrate PySpark into your Python environment. Without it, Python may not recognize the PySpark installation directory and therefore be unable to import PySpark libraries. findspark helps bridge this gap by dynamically adding PySpark to the Python path, allowing you to import PySpark libraries seamlessly.

In order to use it, first, use pip command to install findspark.


# Install findspark
pip install findspark 

Now run the below commands in sequence on Jupyter Notebook or in Python script. The findspark.init() function from findspark library initializes the PySpark environment by locating the Spark installation directory and adding it to the Python path. This allows Python to recognize and import PySpark libraries without needing to explicitly set the PYTHONPATH environment variable or modify system configurations.


# Import findspark
import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()

3. Setting Environment Variables

Setting environment variables resolves the ‘No module named ‘pyspark” error by ensuring that Python can locate the PySpark installation directory. When you install PySpark, its modules are stored in a specific directory. However, Python needs to know where this directory is located in order to import these modules successfully.

To get the current installation of PySpark, run the below command.


# Get pyspark installation location
pip show pyspark

By setting environment variables such as SPARK_HOME and PYTHONPATH, you provide Python with the necessary information to locate the PySpark installation. This allows Python to find and import the PySpark modules without encountering the ‘No module named ‘pyspark” error.

Note: Change the Apache Spark/PySpark version according to your versions.

3.1 Linux on Ubuntu

Open the .bashrc file and add the following statements. By running source ~/.bashrc re-load the file.


# On Linux
export SPARK_HOME=/Users/prabha/apps/spark-3.5.0-bin-hadoop3
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

3.2 Mac OS

On Mac, add the following environment variables to the profile file.


# On Mac
export SPARK_HOME=/usr/local/Cellar/apache-spark/3.5.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH

3.3 Windows

To set environment variables on Windows, follow these steps:

  1. Open Control Panel: Press Win + R, type control panel, and press Enter.
  2. Navigate to System: In Control Panel, select “System and Security”, then click on “System”.
  3. Open Advanced System Settings: On the left sidebar, click on “Advanced system settings”.
  4. Open Environment Variables: In the System Properties window, click on the “Environment Variables” button.
  5. Add or Edit User Variables: Under the “User variables” section (for your user account), click “New” to add a new environment variable, or select an existing variable and click “Edit” to modify it.

Add these to the user variables.


# On windows
set SPARK_HOME=C:\apps\opt\spark-3.5.0-bin-hadoop3
set HADOOP_HOME=%SPARK_HOME%
set PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%

Conclusion

In conclusion, resolving the ‘No module named pyspark’ error involves various methods, each is used to address specific issues:

  1. Correct Installation: Ensure PySpark is correctly installed, either through pip, conda, or by downloading and configuring it manually.
  2. Setting Environment Variables: Set environment variables such as SPARK_HOME and add the Spark bin directory to the PATH variable to help Python locate the PySpark installation.
  3. Using findspark: Utilize the findspark library to dynamically locate and integrate PySpark into the Python environment, simplifying the import process.
  4. Proper Configuration: Verify that the Python environment and PySpark versions are compatible, and that all necessary dependencies are installed.

Happy Learning !!