Let’s see how to import the PySpark library in Python Script or how to use it in shell, sometimes even after successfully installing Spark on Linux/windows/mac, you may have issues while importing PySpark libraries in Python, below I have explained some possible ways to resolve the import issues.
You should either use the spark-submit
command to run the PySpark (Spark with python) application or use the PySpark shell to run interactive commands for testing.
Note: Do not use Python shell
or Python command
to run PySpark program.
1. Check PySpark Installation is Right
Sometimes you may have issues in PySpark installation hence you will have errors while importing libraries in Python. Post successful installation of PySpark, use PySpark shell which is REPL (read–eval–print loop), and is used to start an interactive shell to test/run a few individual PySpark commands. This is mostly used to quickly test some commands during the development time.
The following examples demonstrate how to fix the below issue and any issues with importing the PySpark library.
ModuleNotFoundError: No module named 'pyspark'
2. Import PySpark in Python Using findspark
Even after successful install PySpark you may have issues importing pyspark in Python, you can resolve it by installing and import findspark, In case you are not sure what it is, findspark searches pyspark installation on the server and adds PySpark installation path to sys.path
at runtime so that you can import PySpark modules.
First Install findspark
using pip
command.
pip install findspark
Post successful installation, import it in Python program or shell to validate PySpark imports. Run below commands in sequence.
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()
In case for any reason, you can’t install findspark, you can resolve the issue in other ways by manually setting environment variables.
3. Setting PySpark Environment Variables
To set PySpark environment variables, first, get the PySpark installation direction path by running the Python command pip show
.
pip show pyspark
Now set the SPARK_HOME
& PYTHONPATH
according to your installation, For my articles, I run my PySpark programs in Linux, Mac and Windows hence I will show what configurations I have for each. After setting these, you should not see No module named pyspark while importing PySpark in Python.
3.1 Linux on Ubuntu
export SPARK_HOME=/Users/prabha/apps/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
Put these on .bashrc
file and re-load the file by using source ~/.bashrc
3.2 Mac OS
On Mac I have Spark 2.4.0 version, hence the below variables.
export SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
Put these on .bashrc
file and re-load the file by using source ~/.bashrc
3.3 Windows PySpark environment
For my windows environment, I have the PySpark version spark-3.0.0-bin-hadoop2.7
so below are my environment variables. Set these on the Windows environment variables screen.
set SPARK_HOME=C:\apps\opt\spark-3.0.0-bin-hadoop2.7
set HADOOP_HOME=%SPARK_HOME%
set PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%
If you have a different Spark version, use the version accordingly.
Conclusion
In summary, you have learned how to import PySpark libraries in Jupyter or shell/script either by setting the right environment variables or installing and using findspark module.
Happy Learning !!