PySpark “ImportError: No module named py4j.java_gateway” Error

  • Post author:
  • Post category:PySpark
  • Post last modified:April 16, 2021

Problem: When I was running PySpark commands after successful installation of PySpark on Linux, I got an error “<strong>ImportError: No module named py4j.java_gateway</strong>“, I had spent some time and understand what is Py4J module and resolve the issue. I would like to share it here.

ImportError: No module named py4j.java_gateway

Solution: Resolve ImportError: No module named py4j.java_gateway

In order to resolve “<strong>ImportError: No module named py4j.java_gateway</strong>” Error, first understand what is the py4j module. Spark basically written in Scala and later due to its industry adaptation, it’s API PySpark released for Python using Py4J.

Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects. so Py4J is a mandatory module to run the PySpark application and it is located at $SPARK_HOME/python/lib/py4j-* directory.

After Spark installation, you need to set Py4j module to PYTHONPATH environment variable in order to run the PySpark application. Not setting this module to env, you get ImportError: No module named py4j.java_gateway error.

export SPARK_HOME=/Users/prabha/apps/spark-3.0.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$SPARK_HOME/python/lib/$PYTHONPATH

Put these on .bashrc file and re-load the file by using source ~/.bashrc

Based on the PySpark version you are using, the py4j module version also changes, in order to set this version right from the path use below.

export PYTHONPATH=${SPARK_HOME}/python/:$(echo ${SPARK_HOME}/python/lib/py4j-*${PYTHONPATH}

To know the path of the PySpark location use pip show command.

pip show pyspark

On Windows, use the below environment variables to resolve ImportError: No module named py4j.java_gateway error.

set SPARK_HOME=C:\apps\opt\spark-3.0.0-bin-hadoop2.7


To resolve ImportError: No module named py4j.java_gateway error in PySpark, set the py4j module from ${SPARK_HOME}/python/lib directory to PYTHONPATH environment variable.

Happy Learning !!

NNK is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply