You are currently viewing Spark Drop DataFrame from Cache

By using unpersist() method of RDD/DataFrame/Dataset you can drop the DataFrame cache in Spark or PySpark. In this article, Let’s understand how to drop Spark DataFrame from the cache and what exactly cache is.

1. What is Cache in Spark?

In Spark or PySpark, Caching DataFrame is the most used technique for reusing some computation. Spark has the capability to boost the queries that are using the same data by cached results of previous operations.

Caching in Spark DataFrame is a lazy transformation, so immediately after calling the cache() function nothing happens with the data but the query plan is updated by the Cache Manager by adding a new operator. Acutal cache happens when you call Spark actions.

— InMemoryRelation. On a high level this is just to store some information that would be used during the query execution later on when some action is called. Spark will look for the data in the caching layer and read it from there if it is available. If it doesn’t find the data in the caching layer (which happens for sure the first time the query runs), it will become responsible for getting the data there and it will use it immediately afterward.

To create a cache use the following. Here, count() is an action hence this function initiattes caching the DataFrame.


// Cache the DataFrame
df.cache()
df.count()

2. Monitoring Cache

Actually, Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. So least recently used will be removed first from cache.

3. Drop DataFrame from Cache

You can also manually remove DataFrame from the cache using unpersist() method in Spark/PySpark. unpersist() marks the DataFrame as non-persistent, and removes all blocks for it from memory and disk. unpersist(Boolean) with argument blocks until all blocks from the cache are deleted.

Syntax


// Syntax
unpersist() : Dataset.this.type
unpersist(blocking : scala.Boolean) : Dataset.this.type

Example

Drop DataFrame Cache spark

4. Uncache Table

If tables are cached by using createOrReplaceTempView() method, then you have to use different approach to remove it from cache. Here, count triggers the cache to happen.


// Create table
df.createOrReplaceTempView(tableName)
spark.table("nums").cache
spark.table("nums").count 

To remove this table from cache use the following


// Remove from cache a specific table
spark.catalog.uncacheTable(tableName)

To remove all tables from cache.


// Clear all tables from cache
spark.catalog.clearCache()

If you are using older version of Spark (<1.6), use the methods from SQLContext.


// Remove from cache a specific table
sqlContext.uncacheTable(tableName)

// Clear all tables from cache
sqlContext.clearCache()

4. Conclusion

In this article, you have learned how to  cache() the Spark DataFrame and how to remove using unpersist() method. cache() is a optimization techniques to save interim computation results of DataFrame or Dataset and reuse them subsequently. Also, learned how to drop DataFrame from Cache in spark with Scala examples.

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.