Spark – Different Types of Issues While Running in Cluster?

What are the different types of issues you get while running Apache Spark projects or PySpark? If you are attending Apache Spark Interview most often you will get what are the different problems or challenges you face while running Spark application/job in the cluster (EMR, Cloudera, Azure Databricks, MapR e.t.c).

Regardless of what cluster you are using to run the Spark/PySpark application, you would face some common issues that I explained here. Besides these, you might also get other different issues based on what cluster you are using.

The following are the most common different issues we face while running Spark/PySpark applications. As you know each project and cluster is different hence, if you faced any other issues please share in the comment. Spark community can learn from your experiences.

Understanding the apache spark architecture is one of the keys to writing better Spark programming. Writing a good Spark code without knowing the architecture would result in slow-running jobs and many other issues explained in this article. So to write the best spark programming you need to understand how Spark architecture and how it executes the application in a distributed way in the cluster (EMR, Cloudera, Azure Databricks, MapR e.t.c).

1. Serialization Issues

If you know the Spark architecture, Spark splits your application into multiple chunks and sends these to executors to execute. In order to send it to executors over the network, it needs to serialize the object. For any reason, if Spark is unable to serialize the objects you get the below error.


org.apache.spark.SparkException: Task not serializable

In Spark, you get this org.apache.spark.SparkException: Task not serializable in several scenarios. To resolve this error, you need to do the following based on your scenario.

Make Custom Classes Serializable: If you write custom classes in Spark/Scala make sure your class uses extends Serializable. A serializable converts its state to a byte stream so that it can be transferred over the network. If you are using Scala use Case class which is by default serializable.

Use @transient: If you don’t want to transfer an object to executors mark these objects with annotation @transient. Any object marked with this annotation will be ignored to serialized and not transferred to the executors. For example object of Database connections, File e.t.c

Use Object: If you are using Scala, use Object as it is by default Serializable.

2. Out Of Memory Exceptions

OutOfMermoryError is the most common and annoying issue we all get while running Spark applications in the cluster. We might get this error across multiple components of Spark execution.

  • Driver
  • Executor

Driver:

Use one of the following commands in Spark Submit Command Line Options to increase drive memory


--driver-memory <XX>G
#(or)
--conf spark.driver.memory= <XX>g

Executor:

Use one of the following commands in Spark Submit Command Line Options to increase executor memory.


--executor-memory <XX>G
#(or)
--conf spark.executor.memory= <XX>g

3. Optimizing Long Running Jobs

Before you go to production with your Spark project you need to make sure your jobs going to complete in a given SLA. In order to do so, you need to run some performance tests, if you are not meeting your SLA, you need to improve your spark application performance.

The following are a few things you can try to optimize the Spark applications to run faster.

  • Caching intermediate results
  • Use broadcast hash join
  • Repartition to the right number of partitions
  • Reduce shuffle operations
  • Use Apache Parquet or Avro

4. Result Exceeds Driver Memory

When you are working with very large datasets and sometimes actions result in the below error when the total size of results is greater than the Spark Driver Max Result Size value spark.driver.maxResultSize .

Spark Error: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of z tasks (x MB) is bigger than spark.driver.maxResultSize (y MB).


org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of z tasks (x MB) is bigger
than spark.driver.maxResultSize (y MB)

Though it is not recommended to collect large data, sometimes you may be required to collect the data using the collect() method, if you do you will also get this error and you need to increase the size by using spark-submit configuration or SparkConf class.

5. Using coalesce() – Creates Uneven Partitions

coalesce() is used to reduce the number of partitions in an efficient way and this function is used as one of the Spark performance optimizations over using repartition(), for differences between these two refer to Spark coalesce vs repartition differences.

However, if you are using the result of coalesce() on a join with another Spark DataFrame you might see a performance issue as coalescing results in uneven partition, and using an uneven partition DataFrame on an even partition DataFrame results in a Data Skew issue.

So you have to be careful when to use coalesce() function, I have used this function in my project where I wanted to create a single part file from DataFrame.

6. Broadcasting Large Data

If you are using broadcasting either for broadcasting variable or broadcast join, you need to make sure the data you are broadcasting fits in driver memory, if you try to broadcast a larger data size greater than driver memory capacity you will get out of memory error.

To resolve this either you need to remove the unwanted data from your object or increase the size of the driver memory.


--driver-memory <XX>G
#(or)
--conf spark.driver.memory= <XX>g


spark cluster

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Spark – Different Types of Issues While Running in Cluster?