How does the createOrReplaceTempView() method work in PySpark and what is it used for? One of the main advantages of Apache PySpark is working with SQL along with DataFrame/Dataset API. So if you are comfortable with SQL, you can create a temporary view on DataFrame/Dataset by using
createOrReplaceTempView() and using SQL to select and manipulate the data.
A Temporary view in PySpark is similar to a real SQL table that contains rows and columns but the view is not materialized into files. In this article, we will be discussing what is
createOrReplaceTempView() and how to use it to create a temporary view and run PySpark SQL queries.
Note that throughout this article, I will use a table and view interchangeably.
1. Syntax of createOrReplaceTempView()
Following is the syntax of the
createOrReplaceTempView() method. This method is from the Dataset class. So you can use this method from PySpark DataFrame and Dataset objects.
#Syntax of createOrReplaceTempView() createOrReplaceTempView(viewName)
2. How Does createOrReplaceTempView() work in PySpark?
createOrReplaceTempView() in PySpark creates a view only if not exist, if it exits it replaces the existing view with the new one. PySpark SQL views are lazily evaluated meaning it does not persist in memory unless you cache the dataset by using the cache() method.
Some KeyPoints to note:
createOrReplaceTempView()is used when you wanted to store the table for a specific SparkSession.
- Once created you can use it to run SQL queries.
- These temporary views are session-scoped i.e. valid only that running SparkSession.
- It can’t be shared between the sessions
- These views will be dropped when the session ends unless you created it as Hive table.
saveAsTable()to materialize the contents of the DataFrame and create a pointer to the data in the metastore.
3. Create a Temporary View
createOrReplaceTempView() is used to create a temporary view/table from the PySpark DataFrame or Dataset objects. Since it is a temporary view, the lifetime of the table/view is tied to the current SparkSession. Hence, It will be automatically removed when your SparkSession ends.
If you are using an older version prior to PySpark 2.0, you can use
registerTempTable() to create a temporary table.
Following are the steps to create a temporary view in PySpark and access it.
- Step 1: Create a PySpark DataFrame
- Step 2: Convert it to an SQL table (a.k.a view)
- Step 3: Access view using SQL query
3.1 Create a DataFrame
First, let’s create a PySpark DataFrame with columns
from pyspark.sql import SparkSession # Create spark session spark = SparkSession \ .builder \ .appName("SparkByExamples.com") \ .enableHiveSupport() \ .getOrCreate() data = [("James","Smith","USA","CA"), ("Michael","Rose","USA","NY"), ("Robert","Williams","USA","CA"), ("Maria","Jones","USA","FL") ] columns = ["firstname","lastname","country","state"] # Create dataframe sampleDF = spark.sparkContext.parallelize(data).toDF(columns) sampleDF.show()
3.2 Create Table/View using PySpark createOrReplaceTempView()
Let’s create a temporary view on top of the DataFrame object by using
df.createOrReplaceTempView(). PySpark SQL temporary views are session-scoped and will not be available if the session that creates it terminates. The following examples create a
# Create Temporary View/Table sampleDF.createOrReplaceTempView("Person")
If you want to have a temporary view that is shared among all sessions and keep alive until the PySpark application terminates, you can create a global temporary view using
3.3 Access View using PySpark SQL Query
Using SparkSession you can access PySpark SQL capabilities in Apache PySpark. Once you have a temporary view you can run any ANSI SQL queries using
# Run SQL Query spark.sql("select firstname, lastname from Person").show()
Yields below output.
4. Use createOrReplaceTempView() on Azure Databricks
Below is a simple snippet on how to use
createOrReplaceTempView() on Azure Databricks and how to access it using PySpark SQL query.
In this article, we discussed how to create a temporary view or table in PySpark by using createOrReplaceTempView() and explained how it works. Temp views are lazily evaluated views in PySpark and can be used same as like a table but they are not materialized and don’t have pointers for the data in metastore.
- PySpark parallelize() – Create RDD from a list data
- PySpark cache() Explained.
- PySpark repartition() – Explained with Examples
- PySpark SparkContext Explained
- What is PySpark DataFrame?
- Fonctions filter where en PySpark | Conditions Multiples
- PySpark Replace Column Values in DataFrame
- PySpark JSON Functions with Examples
- PySpark Read and Write SQL Server Table