You are currently viewing Debug Spark application Locally or Remote

We often need to debug Spark application or job to look at the values in runtime in order to fix issues, we typically use IntelliJ Idea or Eclipse IDE to debug locally or remote running applications written in Scala or Java.

Advertisements

In this article, I will explain how to debug the Spark application running locally and remotely using IntelliJ Idea IDE.

Before you proceed with this article, Install and setup Spark to run local and on remote and have your IntelliJ Idea IDE setup to run Spark applications.

1. Debug Spark application running Locally

To debug a Scala or Java application, you need to run the application with JVM options agentlib:jdwp, where agentlib:jdwp is the Java Debug Wire Protocol (JDWP) option, followed by a comma-separated list of sub-option


// Debug Spark application running locally
agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

But to run with spark-submit, you need to add agentlib:jdwp with --conf spark.driver.extraJavaOptions along with options as shown below.


spark-submit \
  --name SparkByExamples.com \
  --class org.sparkbyexamples.SparkWordCountExample \
  --conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
  spark-by-examples.jar

By running the above command, it prompts you with the below message, and your application pauses.


Listening for transport dt_socket at address: 5005

Now, open the IntelliJ editor and do the following.

  • Open the project you wanted to debug
  • Open the Spark project you wanted to debug.
  • Add some debugging breakpoints to the scala classes.

And, follow the below steps to create Remote application and start to debug.

  • Open your Spark application you wanted to debug in IntelliJ Idea IDE
  • Access Run -> Edit Configurations, this brings you Run/Debug Configurations window
  • Now select Applications and select + sign from the top left corner and select Remote option.
  • Enter your debugger name for Name field. for example, enter SparkLocalDebug.
  • For Debugger mode option select Attach to local JVM.
  • For Transport, select Socket (this selected by default).
  • For Host, enter localhost as we are debugging Local and enter the port number for Port. For our example, we are using 5005.
  • Finally, select OK. This just creates the Application to debug but it doesn’t start.
spark debug locally
Spark debug locally with IntelliJ

In order to start the application, select the Run -> Debug SparkLocalDebug, this tries to start the application by attaching to 5005 port.

Now you should see your spark-submit application running and when it encounter debug breakpoint, you will get the control to IntelliJ.

Now use the debug control keys or options to step through the application. In case if you are not sure how to step through, follow this IntelliJ step through article.

In case you are not running spark application on 5005 port on the localhost, this returns below error message.


Error running 'SparkLocalDebug': Unable to open debugger port (localhost:5005): java.net.ConnectException "Connection refused: connect" (6 minutes ago)

2. Debug Spark application running on Remote server

If you are running spark application on a remote node and you wanted to debug via IntelliJ, you need to set the environment variable SPARK_SUBMIT_OPTS with the debug information.


// Debug Spark application running on Remote server
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5050

Now run your spark-submit, which will wait for the the debugger.

Finally, Open the IntelliJ and follow the above points. and for the host, enter your remote host where your spark application is running.

4. Conclusion

In this article, you have learned how to debug Spark application or job running local or remote server using IntelliJ IDE, you can also follow the similar steps to debug from eclipse as well.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has 3 Comments

  1. Sri

    Thank you. This article was helpful

  2. NNK

    Thank you for correcting. appreciate your help.

  3. Qinsi Long

    –conf spark.driver.extraJavaOptions=agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
    should be:
    –conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005