How to Import PySpark in Python Script

Let’s see how to import the PySpark library in Python Script or how to use it in shell, sometimes even after successfully installing Spark on Linux/windows/mac, you may have issues like “No module named pyspark” while importing PySpark libraries in Python, below I have explained some possible ways to resolve the import issues.


No module named pyspark

You should either use the spark-submit command to run the PySpark (Spark with python) application or use the PySpark shell to run interactive commands for testing.

Note: Do not use Python shell or Python command to run PySpark program.

1. Make sure PySpark Installation is Right

Sometimes you may have issues in PySpark installation hence you will have errors while import libraries in Python. Post successful installation of PySpark, use PySpark shell which is REPL (read–eval–print loop), and is used to start an interactive shell to test/run few individual PySpark commands. This is mostly used to quickly test some commands during the development time.

Import PySpark in Python shell
PySpark shell

2. Import PySpark in Python Using findspark

The simplest way to resolve “No module named pyspark" in Python is by installing and import <a href="https://github.com/minrk/findspark">findspark</a>, In case if you are not sure what it is, findspark searches pyspark installation on the server and adds PySpark installation path to sys.path at runtime so that you can import PySpark modules.

First Install findspark using pip command.


pip install findspark 

Post successful installation, import it in Python program or shell to validate PySpark imports. Run below commands in sequence.


import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()

In case for any reason, if you can’t install findspark, you can resolve the issue in other ways by manually setting environment variables.

3. Setting PySpark Environment Variables

To set PySpark environment variables, first, get the PySpark installation direction path by running the Python command pip show.


pip show pyspark

Now set the SPARK_HOME & PYTHONPATH according to your installation, For my articles, I run my PySpark programs in Linux, Mac and Windows hence I will show what configurations I have for each. After setting these, you should not see "No module named pyspark” while importing PySpark in Python.

3.1 Linux on Ubuntu


export SPARK_HOME=/Users/prabha/apps/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

Put these on .bashrc file and re-load the file by using source ~/.bashrc

3.2 Mac OS

On Mac I have Spark 2.4.0 version, hence the below variables.


export SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH

Put these on .bashrc file and re-load the file by using source ~/.bashrc

3.3 Windows PySpark environment

For my windows environment, I have the PySpark version spark-3.0.0-bin-hadoop2.7 so below are my environment variables. Set these on the Windows environment variables screen.


set SPARK_HOME=C:\apps\opt\spark-3.0.0-bin-hadoop2.7
set HADOOP_HOME=%SPARK_HOME%
set PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%

If you have different Spark version, use version accordingly.

Conclusion

In summary, you can resolve No module named pyspark error by importing modules/libraries in PySpark (shell/script) either by setting right environment variables or installing and using findspark module.

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

How to Import PySpark in Python Script