What are the different types of issues you get while running Apache Spark projects or PySpark? If you are attending Apache Spark Interview most often you will get what are the different problems or challenges you face while running Spark application/job in the cluster (EMR, Cloudera, Azure Databricks, MapR e.t.c).
Regardless of what cluster you are using to run the Spark/PySpark application, you would face some common issues that I explained here. Besides these, you might also get other different issues based on what cluster you are using.
The following are the most common different issues we face while running Spark/PySpark applications. As you know each project and cluster is different hence, if you faced any other issues please share in the comment. Spark community can learn from your experiences.
- Serialization Issues
- Out of Memory Exceptions
- Optimizing Long Running Jobs
- Result Exceeds Driver Memory
- Using coalesce() – Creates Uneven Partitions
- Broadcasting Large Data
- Data Skewness
- Too Small and Too Large Partitions
- Reading Encoding Files
- Creating Small Files
- Inferschema Issues
- Missing data
Understanding the apache spark architecture is one of the keys to writing better Spark programming. Writing a good Spark code without knowing the architecture would result in slow-running jobs and many other issues explained in this article. So to write the best spark programming you need to understand how Spark architecture and how it executes the application in a distributed way in the cluster (EMR, Cloudera, Azure Databricks, MapR e.t.c).
1. Serialization Issues
If you know the Spark architecture, Spark splits your application into multiple chunks and sends these to executors to execute. In order to send it to executors over the network, it needs to serialize the object. For any reason, if Spark is unable to serialize the objects you get the below error.
org.apache.spark.SparkException: Task not serializable
In Spark, you get this
org.apache.spark.SparkException: Task not serializable in several scenarios. To resolve this error, you need to do the following based on your scenario.
Make Custom Classes Serializable: If you write custom classes in Spark/Scala make sure your class uses
extends Serializable. A serializable converts its state to a byte stream so that it can be transferred over the network. If you are using Scala use Case class which is by default serializable.
Use @transient: If you don’t want to transfer an object to executors mark these objects with annotation @transient. Any object marked with this annotation will be ignored to serialized and not transferred to the executors. For example object of Database connections, File e.t.c
Use Object: If you are using Scala, use Object as it is by default Serializable.
2. Out Of Memory Exceptions
OutOfMermoryError is the most common and annoying issue we all get while running Spark applications in the cluster. We might get this error across multiple components of Spark execution.
Use one of the following commands in Spark Submit Command Line Options to increase drive memory
--driver-memory <XX>G #(or) --conf spark.driver.memory= <XX>g
Use one of the following commands in Spark Submit Command Line Options to increase executor memory.
--executor-memory <XX>G #(or) --conf spark.executor.memory= <XX>g
3. Optimizing Long Running Jobs
Before you go to production with your Spark project you need to make sure your jobs going to complete in a given SLA. In order to do so, you need to run some performance tests, if you are not meeting your SLA, you need to improve your spark application performance.
The following are a few things you can try to optimize the Spark applications to run faster.
- Caching intermediate results
- Use broadcast hash join
- Repartition to the right number of partitions
- Reduce shuffle operations
- Use Apache Parquet or Avro
4. Result Exceeds Driver Memory
When you are working with very large datasets and sometimes actions result in the below error when the total size of results is greater than the Spark Driver Max Result Size value
Spark Error: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of z tasks (x MB) is bigger than spark.driver.maxResultSize (y MB).
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of z tasks (x MB) is bigger than spark.driver.maxResultSize (y MB)
Though it is not recommended to collect large data, sometimes you may be required to collect the data using the collect() method, if you do you will also get this error and you need to increase the size by using spark-submit configuration or SparkConf class.
5. Using coalesce() – Creates Uneven Partitions
coalesce() is used to reduce the number of partitions in an efficient way and this function is used as one of the Spark performance optimizations over using
repartition(), for differences between these two refer to Spark coalesce vs repartition differences.
However, if you are using the result of
coalesce() on a join with another Spark DataFrame you might see a performance issue as coalescing results in uneven partition, and using an uneven partition DataFrame on an even partition DataFrame results in a Data Skew issue.
So you have to be careful when to use coalesce() function, I have used this function in my project where I wanted to create a single part file from DataFrame.
6. Broadcasting Large Data
If you are using broadcasting either for broadcasting variable or broadcast join, you need to make sure the data you are broadcasting fits in driver memory, if you try to broadcast a larger data size greater than driver memory capacity you will get out of memory error.
To resolve this either you need to remove the unwanted data from your object or increase the size of the driver memory.
--driver-memory <XX>G #(or) --conf spark.driver.memory= <XX>g
- Spark Deploy Modes – Client vs Cluster Explained
- Spark – Initial job has not accepted any resources; check your cluster UI
- Spark Set JVM Options to Driver & Executors
- Spark Set Environment Variable to Executors
- Spark Merge Two DataFrames with Different Columns or Schema
- Spark Streaming – Different Output modes explained
- Different ways to create Spark RDD
- Broadcast Join in Spark