By using unpersist() method of RDD/DataFrame/Dataset you can drop the DataFrame cache in Spark or PySpark. In this article, Let’s understand how to drop Spark DataFrame from the cache and what exactly cache is.
1. What is Cache in Spark?
In Spark or PySpark, Caching DataFrame is the most used technique for reusing some computation. Spark has the capability to boost the queries that are using the same data by cached results of previous operations.
Caching in Spark DataFrame is a lazy transformation, so immediately after calling the cache()
function nothing happens with the data but the query plan is updated by the Cache Manager by adding a new operator. Acutal cache happens when you call Spark actions.
— InMemoryRelation. On a high level this is just to store some information that would be used during the query execution later on when some action is called. Spark will look for the data in the caching layer and read it from there if it is available. If it doesn’t find the data in the caching layer (which happens for sure the first time the query runs), it will become responsible for getting the data there and it will use it immediately afterward.
To create a cache use the following. Here, count() is an action hence this function initiattes caching the DataFrame.
// Cache the DataFrame
df.cache()
df.count()
2. Monitoring Cache
Actually, Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. So least recently used will be removed first from cache.
3. Drop DataFrame from Cache
You can also manually remove DataFrame from the cache using unpersist()
method in Spark/PySpark. unpersist() marks the DataFrame as non-persistent, and removes all blocks for it from memory and disk. unpersist(Boolean) with argument blocks until all blocks from the cache are deleted.
Syntax
// Syntax
unpersist() : Dataset.this.type
unpersist(blocking : scala.Boolean) : Dataset.this.type
Example
4. Uncache Table
If tables are cached by using createOrReplaceTempView() method, then you have to use different approach to remove it from cache. Here, count triggers the cache to happen.
// Create table
df.createOrReplaceTempView(tableName)
spark.table("nums").cache
spark.table("nums").count
To remove this table from cache use the following
// Remove from cache a specific table
spark.catalog.uncacheTable(tableName)
To remove all tables from cache.
// Clear all tables from cache
spark.catalog.clearCache()
If you are using older version of Spark (<1.6), use the methods from SQLContext.
// Remove from cache a specific table
sqlContext.uncacheTable(tableName)
// Clear all tables from cache
sqlContext.clearCache()
4. Conclusion
In this article, you have learned how to cache()
the Spark DataFrame and how to remove using unpersist()
method. cache() is a optimization techniques to save interim computation results of DataFrame or Dataset and reuse them subsequently. Also, learned how to drop DataFrame from Cache in spark with Scala examples.
5. Related Articles
- Spark Create DataFrame with Examples
- Spark DataFrame Cache and Persist Explained
- Spark – Difference between Cache and Persist?
- What does setMaster(local[*]) mean in Spark
- Difference in DENSE_RANK and ROW_NUMBER in Spark
- Spark Drop, Delete, Truncate Differences
- Spark Drop Rows with NULL Values in DataFrame