Site icon Spark By {Examples}

Spark Types of Tables and Views

Spark Tables

Apache Spark & PySpark supports SQL natively through Spark SQL API which allows us to run SQL queries by creating tables and views on top of DataFrame.

In this article, we shall discuss the types of tables and view available in Apache Spark & PySpark. Apache Spark is a distributed data processing engine that allows you to create three main types of non-temporary cataloged tables EXTERNAL, MANAGED, and different types of VIEW.

VIEW is used for persistent views, whereas EXTERNAL and MANAGED are used for tables.

1. What is a Table?

Like any RDBMS table, Spark Table is a collection of rows and columns stored as data files in object storage (S3, HDFS, Azure BLOB e.t.c). There are mainly two types of tables in Apache spark (Internally these are Hive tables)

Related: Hive Difference Between Internal vs External Tables

1.1. Spark Internal Table

An Internal table is a Spark SQL table that manages both the data and the metadata.

Let’s create an internal table and look at the table properties in Databricks.


//Creates a Delta table in default SPARK SQL Warehouse Directory
CREATE TABLE IF NOT EXISTS internal_table (id INT, FirstName String, LastName String);
INSERT INTO internal_table values (1,"John", "Smith");
SELECT * FROM internal_table;

//output:
id|FirstName|LastName
1|John|Smith

Above we have created a managed Spark table (internal_table) and inserted a few records into it. Now let us look at the table properties using DESCRIBE command


//Describe table
DESCRIBE EXTENDED internal_table

Output:

Internal Spark Table
Internal Spark Table

In the table properties, we see the type of the Spark table as “MANAGED” and the data of the table gets stored in the default Spark SQL Warehouse path i.e /user/hive/warehouse/internal_table

Note: If we run the drop command on this internal table, Both Data including the Location and Metadata of the table from database objects get removed.

1.2. Spark External Table

An External table is a SQL table that Spark manages the metadata and we control the location of table data.

Let us create an external Spark table with a storage location specified while creating.


//Create a Delta table under storage location "/user/tmp/external_table"
CREATE TABLE IF NOT EXISTS external_table (id INT, FirstName String, LastName String) LOCATION '/user/tmp/external_table';
INSERT INTO external_table values (1,"John", "Smith");
SELECT * FROM external_table;

//output:
id|FirstName|LastName
1|John|Smith

Above we have created an external Spark table (external_table) with a storage location specified and inserted a few records into it. Now let us look at the table properties using DESCRIBE command

External Spark table
External Spark table

In the table properties, we see the Type of the Spark table as “EXTERNAL” and the data of the table gets stored in the location specified while creating the Spark SQL table i.e “/user/tmp/external_Table“.

Note: If we run the drop command on this External table, only the Metadata of the table object from database objects gets removed. Data remains as it is.

2. What is a View?

A Spark view is more like a virtual table with no physical data available. we have mainly three types of views in Apache Spark

2.1. Temporary View

TEMPORARY Spark views are SparkSession scoped, they are only available to the session that created them and is dropped automatically as soon as the session ends.

These Spark views are not accessible by any other sessions or clusters. Also not get stored in the Metastore.

Related: Spark createOrReplaceTempView() Explained

Syntax:


//Syntax for creating a temp view
CREATE OR REPLACE TEMP VIEW viewName AS (select expression from a table);

Example:


//Creating a temp view
CREATE OR REPLACE TEMP VIEW tempView AS SELECT * FROM internal_table;
SELECT * FROM tempView;

//output:
id|FirstName|LastName
1|John|Smith

Above we have created a temporary Spark view using the internal_table that we created earlier.

2.2. Global Temporary View

As Temporary views in Spark SQL are session-scoped, if we want temporary views to be able to be shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view.

Syntax:


//Syntax for creating a Global temp view
CREATE OR REPLACE TEMP GLOBAL VIEW viewName AS (select expression from a table);

Examples:


//Syntax for creating a Global temp view
CREATE OR REPLACE GLOBAL TEMP VIEW globalTempView AS SELECT * FROM internal_table;

Note: The global temporary view remains accessible as long as the application is alive.

2.2. Global Permanent View

Global Permanent views are created by persisting the data.

Syntax:


//Syntax for creating a Global Permanent views
CREATE OR REPLACE VIEW viewName AS (select expression from a table);

Example:


//Syntax for creating a Global Permanent views
CREATE OR REPLACE VIEW globalPermanentView AS select * from a internal_table;

In the Spark view properties, we see the Type as “VIEW” and the view definition under view text.

Spark Global Permanent View

3. Conclusion

In Apache Spark or PySpark, we have these options in tables and views so that we can use each of its features based on our application needs.

We create Internal tables during

We create an External Table during.

Exit mobile version