Spark - Different Types of Issues While Running in Cluster?

What are the different types of issues you get while running Apache Spark projects or PySpark? If you are attending Apache Spark Interview most often you will get what are the different problems or challenges you face while running Spark application/job in the cluster (EMR, Cloudera, Azure Databricks, MapR e.t.c).

1. Serialization Issues

If you know the Spark architecture, Spark splits your application into multiple chunks and sends these to executors to execute. In order to send it to executors over the network, it needs to serialize the object. For any reason, if Spark is unable to serialize the objects you get the below error.


// Serialization Issues
org.apache.spark.SparkException: Task not serializable

In Spark, you get this org.apache.spark.SparkException: Task not serializable in several scenarios. To resolve this error, you need to do the following based on your scenario.

Make Custom Classes Serializable: If you write custom classes in Spark/Scala make sure your class uses extends Serializable. A serializable converts its state to a byte stream so that it can be transferred over the network. If you are using Scala use Case class which is by default serializable.

Use @transient: If you don’t want to transfer an object to executors mark these objects with annotation @transient. Any object marked with this annotation will be ignored to serialized and not transferred to the executors. For example object of Database connections, File e.t.c

Use Object: If you are using Scala, use Object as it is by default Serializable.

2. Out Of Memory Exceptions

OutOfMermoryError is the most common and annoying issue we all get while running Spark applications in the cluster. We might get this error across multiple components of Spark execution.

Driver
Executor

Driver:

Use one of the following commands in Spark Submit Command Line Options to increase drive memory


--driver-memory <XX>G
// (or)
--conf spark.driver.memory= <XX>g

Executor:

Use one of the following commands in Spark Submit Command Line Options to increase executor memory.


--executor-memory <XX>G
// (or)
--conf spark.executor.memory= <XX>g

3. Optimizing Long Running Jobs

Before you go to production with your Spark project you need to make sure your jobs going to complete in a given SLA. In order to do so, you need to run some performance tests, if you are not meeting your SLA, you need to improve your spark application performance.

The following are a few things you can try to optimize the Spark applications to run faster.

Caching intermediate results
Use broadcast hash join
Repartition to the right number of partitions
Reduce shuffle operations
Use Apache Parquet or Avro

4. Result Exceeds Driver Memory

When you are working with very large datasets and sometimes actions result in the below error when the total size of results is greater than the Spark Driver Max Result Size value spark.driver.maxResultSize .

Spark Error: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of z tasks (x MB) is bigger than spark.driver.maxResultSize (y MB).


// Result Exceeds Driver Memory
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of z tasks (x MB) is bigger
than spark.driver.maxResultSize (y MB)

Though it is not recommended to collect large data, sometimes you may be required to collect the data using the collect() method, if you do you will also get this error and you need to increase the size by using spark-submit configuration or SparkConf class.

5. Using coalesce() – Creates Uneven Partitions

coalesce() is used to reduce the number of partitions in an efficient way and this function is used as one of the Spark performance optimizations over using repartition(), for differences between these two refer to Spark coalesce vs repartition differences.

However, if you are using the result of coalesce() on a join with another Spark DataFrame you might see a performance issue as coalescing results in uneven partition, and using an uneven partition DataFrame on an even partition DataFrame results in a Data Skew issue.

So you have to be careful when to use coalesce() function, I have used this function in my project where I wanted to create a single part file from DataFrame.

6. Broadcasting Large Data

If you are using broadcasting either for broadcasting variable or broadcast join, you need to make sure the data you are broadcasting fits in driver memory, if you try to broadcast a larger data size greater than driver memory capacity you will get out of memory error.

To resolve this either you need to remove the unwanted data from your object or increase the size of the driver memory.


--driver-memory <XX>G
// (or)
--conf spark.driver.memory= <XX>g

This Post Has 3 Comments

sneha December 15, 2023

Thank you for sharing this. Would be more helpful if u explain the challenges by with the help of a use case.
venkat February 9, 2023

WOW.. super good stuff. Very helpfull
Teja September 19, 2022

Thank you for sharing all this amazing information !

Comments are closed.