PySpark createOrReplaceTempView() Explained

How does the createOrReplaceTempView() method work in PySpark and what is it used for? One of the main advantages of Apache PySpark is working with SQL along with DataFrame/Dataset API. So if you are comfortable with SQL, you can create a temporary view on DataFrame/Dataset by using createOrReplaceTempView() and using SQL to select and manipulate the data.

1. Syntax of createOrReplaceTempView()

Following is the syntax of the createOrReplaceTempView() method. This method is from the Dataset class. So you can use this method from PySpark DataFrame and Dataset objects.


#Syntax of createOrReplaceTempView()
createOrReplaceTempView(viewName)

2. How Does createOrReplaceTempView() work in PySpark?

createOrReplaceTempView() in PySpark creates a view only if not exist, if it exits it replaces the existing view with the new one. PySpark SQL views are lazily evaluated meaning it does not persist in memory unless you cache the dataset by using the cache() method.

Some KeyPoints to note:

createOrReplaceTempView() is used when you wanted to store the table for a specific SparkSession.
Once created you can use it to run SQL queries.
These temporary views are session-scoped i.e. valid only that running SparkSession.
It can’t be shared between the sessions
These views will be dropped when the session ends unless you created it as Hive table.
Use saveAsTable() to materialize the contents of the DataFrame and create a pointer to the data in the metastore.

3. Create a Temporary View

The createOrReplaceTempView() is used to create a temporary view/table from the PySpark DataFrame or Dataset objects. Since it is a temporary view, the lifetime of the table/view is tied to the current SparkSession. Hence, It will be automatically removed when your SparkSession ends.

If you are using an older version prior to PySpark 2.0, you can use registerTempTable() to create a temporary table.

Following are the steps to create a temporary view in PySpark and access it.

Step 1: Create a PySpark DataFrame
Step 2: Convert it to an SQL table (a.k.a view)
Step 3: Access view using SQL query

3.1 Create a DataFrame

First, let’s create a PySpark DataFrame with columns firstname, lastname, country and state columns.


from pyspark.sql import SparkSession

# Create spark session
spark = SparkSession \
    .builder \
    .appName("SparkByExamples.com") \
    .enableHiveSupport() \
    .getOrCreate()
    
data = [("James","Smith","USA","CA"),
  ("Michael","Rose","USA","NY"),
  ("Robert","Williams","USA","CA"),
  ("Maria","Jones","USA","FL")
  ]
columns = ["firstname","lastname","country","state"]

# Create dataframe
sampleDF = spark.sparkContext.parallelize(data).toDF(columns)
sampleDF.show()

3.2 Create Table/View using PySpark createOrReplaceTempView()

Let’s create a temporary view on top of the DataFrame object by using df.createOrReplaceTempView(). PySpark SQL temporary views are session-scoped and will not be available if the session that creates it terminates. The following examples create a Person table.


# Create Temporary View/Table 
sampleDF.createOrReplaceTempView("Person")

If you want to have a temporary view that is shared among all sessions and keep alive until the PySpark application terminates, you can create a global temporary view using createGlobalTempView()

3.3 Access View using PySpark SQL Query

Using SparkSession you can access PySpark SQL capabilities in Apache PySpark. Once you have a temporary view you can run any ANSI SQL queries using spark.sql() method.


# Run SQL Query
spark.sql("select firstname, lastname from Person").show()

Yields below output.

4. Use createOrReplaceTempView() on Azure Databricks

Below is a simple snippet on how to use createOrReplaceTempView() on Azure Databricks and how to access it using PySpark SQL query.

createOrReplaceTempView PySpark — createOrReplaceTempView in Spark

5. Conclusion

In this article, we discussed how to create a temporary view or table in PySpark by using createOrReplaceTempView() and explained how it works. Temp views are lazily evaluated views in PySpark and can be used same as like a table but they are not materialized and don’t have pointers for the data in metastore.

References

https://spark.apache.org/docs/2.0.0-preview/sql-programming-guide.html