How to resolve Python: No module named ‘findspark’ Error in Jupyter notebook or any Python editor while working with PySpark? In Python when you try to import the PySpark library without installing findspark
or properly setting environment variables you would get a no module named ‘findspark’ error.
This findspark
module doesn’t come with Python or Pyspark installation as it is not part of the Python and Spark distribution respectively hence, you need to install this module and set appropriate environment variables before using this.
>>> import pyspark
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named 'findspark'
The error message No module named ‘findspark’ means that the Python interpreter is unable to find the findspark
module, which is required for using Apache Spark with Python.
Here are some steps you can take to resolve this error:
Table of contents
1. Using Pip to Install findspark Module
To resolve the No module named ‘findspark‘ Error, check if you have installed the findspark
module, if not install this module using the pip.
- Open your command prompt or terminal and run the following command
pip show findspark
. - If the module is not installed, you can install it using pip by running the command
pip install findspark
.
2. Check Spark Installation Path
Check if you have set the correct path for Spark. If you have installed Spark on your system, make sure that you have set the correct path for it. To resolve the error No module named ‘findspark‘, you need to set the path using the findspark.init()
function, as follows:
# Import findspark module
import findspark
findspark.init("/path/to/spark")
Replace “/path/to/spark” with the actual path where Spark is installed on your system.
3. Setting Environment Variables
Check if you have set the SPARK_HOME environment variable. Post Spark/PySpark installation you need to set the SPARK_HOME
environment variable with the installation directory of Spark. Use pip show pyspark
to get the path of Spark installation directory and use this value to set to SPARK_HOME.
# Show pyspark installation path
pip show pyspark
On Linux/macOS:
# On Linux/MacOS
export SPARK_HOME=/path/to/spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
Replace “/path/to/spark” or “C:\path\to\spark” with the actual path where Spark is installed on your system. Put these on .bashrc
file and re-load the file by using source ~/.bashrc
On Windows:
For my Windows environment, I have the PySpark version spark-3.0.0-bin-hadoop2.7
so below are my environment variables.
# On windows
set SPARK_HOME=C:\path\to\spark
set HADOOP_HOME=%SPARK_HOME%
set PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%
Replace “/path/to/spark” or “C:\path\to\spark” with the actual path where Spark is installed on your system.
4. Other methods to Fix “No module named ‘findspark‘”
If none of the above steps work to resolve the error No module named ‘findspark’,
- try uninstalling and reinstalling the
findspark
module using pip. - You can uninstall the module using the command
pip uninstall findspark
, and - then reinstall it using the command
pip install findspark
.
Hopefully, one of these steps will help you resolve the error.
5. Conclusion
The “No module named ‘findspark'” error occurs when the findspark
module is not installed or cannot be found in your Python environment to find PySpark. To resolve this error, you need to install the findspark
module using pip and set the path to the Spark installation directory using the findspark.init()
function or the SPARK_HOME
environment variable in your Python code.
Related Articles
- Python: No module named ‘pyspark’ Error
- Spark Exception: Python in worker has different version 3.4 than that in driver 2.7, PySpark cannot run with different minor versions
- Spark Session configuration in PySpark.
- PySpark Create DataFrame from List
- Python json.loads() Method with Examples
- PySpark Write to CSV File