• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:7 mins read
You are currently viewing How to Import PySpark in Python Script

Let’s see how to import the PySpark library in Python Script or how to use it in shell, sometimes even after successfully installing Spark on Linux/windows/mac, you may have issues while importing PySpark libraries in Python, below I have explained some possible ways to resolve the import issues.

You should either use the spark-submit command to run the PySpark (Spark with python) application or use the PySpark shell to run interactive commands for testing.

Note: Do not use Python shell or Python command to run PySpark program.

1. Check PySpark Installation is Right

Sometimes you may have issues in PySpark installation hence you will have errors while importing libraries in Python. Post successful installation of PySpark, use PySpark shell which is REPL (read–eval–print loop), and is used to start an interactive shell to test/run a few individual PySpark commands. This is mostly used to quickly test some commands during the development time.

Import PySpark in Python shell
PySpark shell

The following examples demonstrate how to fix the below issue and any issues with importing the PySpark library.


ModuleNotFoundError: No module named 'pyspark'

2. Import PySpark in Python Using findspark

Even after successful install PySpark you may have issues importing pyspark in Python, you can resolve it by installing and import findspark, In case you are not sure what it is, findspark searches pyspark installation on the server and adds PySpark installation path to sys.path at runtime so that you can import PySpark modules.

First Install findspark using pip command.


pip install findspark 

Post successful installation, import it in Python program or shell to validate PySpark imports. Run below commands in sequence.


import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()

In case for any reason, you can’t install findspark, you can resolve the issue in other ways by manually setting environment variables.

3. Setting PySpark Environment Variables

To set PySpark environment variables, first, get the PySpark installation direction path by running the Python command pip show.


pip show pyspark

Now set the SPARK_HOME & PYTHONPATH according to your installation, For my articles, I run my PySpark programs in Linux, Mac and Windows hence I will show what configurations I have for each. After setting these, you should not see No module named pyspark while importing PySpark in Python.

3.1 Linux on Ubuntu


export SPARK_HOME=/Users/prabha/apps/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

Put these on .bashrc file and re-load the file by using source ~/.bashrc

3.2 Mac OS

On Mac I have Spark 2.4.0 version, hence the below variables.


export SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH

Put these on .bashrc file and re-load the file by using source ~/.bashrc

3.3 Windows PySpark environment

For my windows environment, I have the PySpark version spark-3.0.0-bin-hadoop2.7 so below are my environment variables. Set these on the Windows environment variables screen.


set SPARK_HOME=C:\apps\opt\spark-3.0.0-bin-hadoop2.7
set HADOOP_HOME=%SPARK_HOME%
set PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%

If you have a different Spark version, use the version accordingly.

Conclusion

In summary, you have learned how to import PySpark libraries in Jupyter or shell/script either by setting the right environment variables or installing and using findspark module.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium