In PySpark SQL, you can create tables using different methods depending on your requirements and preferences. For example, you can create tables from Temporary views or external source files. In this article, we will learn how to create a table in Spark/PySpark with Hive and Databricks.
In order to create a Hive table from Spark or PySpark SQL, you need to create a SparkSession with enableHiveSupport(). The enableHiveSupport()
method is a configuration option in PySpark that enables integration with Apache Hive. When you enable Hive support, PySpark gains access to the Hive SerDe (Serializer/Deserialization)
//enableHiveSupport() -> enables sparkSession to connect with Hive
val spark = SparkSession.builder()
.master("local[*]")
.appName("SparkCreateTableExample")
.enableHiveSupport()
.getOrCreate()
Table of contents
1. Create a Table in Hive from Spark
You can create a hive table in Spark directly from the DataFrame using saveAsTable() or from the temporary view using spark.sql(), or using Databricks. Lets create a DataFrame and on top of it creates a temporary view using the DataFrame inbuild function createOrReplaceTempView.
import spark.implicits._
// Create DataFrame
val sampleDF = Seq(
(1, "Tiger"),
(2, "Lion"),
(3, "Monkey")
).toDF("id", "animal")
// Create temporary view
sampleDF.createOrReplaceTempView("sampleView")
1.1. Using Spark SQL Temporary View
Use spark.sql()
method and CREATE TABLE
statement to create a table in Hive from Spark temporary view. Above we have created a temporary view “sampleView
“. Now we shall create a Database and Table using SQL in Hive Metastore and insert data into the Hive table using the view we created above.
//Create a Database CT
spark.sql("CREATE DATABASE IF NOT EXISTS ct")
//Create a Table naming as sampleTable under CT database.
spark.sql("CREATE TABLE ct.sampleTable (number Int, word String)")
//Insert into sampleTable using the sampleView.
spark.sql("INSERT INTO TABLE ct.sampleTable SELECT * FROM sampleView")
//Lets view the data in the table
spark.sql("SELECT * FROM ct.sampleTable").show()
//Output
+---+------+
| id|animal|
+---+------+
| 1| Tiger|
| 2| Lion|
| 3|Monkey|
+---+------+
When Spark Hive support is activated, Spark typically saves data to the default Hive warehouse location, which is /user/hive/warehouse when connected to a Hive cluster. However, when operating locally, data is saved in the current directory. To modify this behavior, you can adjust the spark.sql.warehouse.dir configuration when initializing a SparkSession.
As we’re running it locally from my laptop, it generates a metadata database called metastore_db and a spark-warehouse directory in the current directory. Within spark-warehouse, we observe that the database (ct) and a table (sampletable) are stored in the Hive Metastore as an internal table.
1.2. Create Table using Spark DataFrame saveAsTable()
Use saveAsTable()
method from DataFrameWriter
to create a Hive table from Spark or PySpark DataFrame. We can use the DataFrame to write into a new/existing table.
1.2.1 Create Internal Table from Spark
The saveAsTable()
method by default creates an internal or managed table in the Hive metastore. Internal tables, also known as managed tables, are tables that are owned and managed by Hive. These tables are created and maintained by Hive, which means that Hive controls both the table structure and the underlying files that store the table data. Dropping an internal table removes both the metadata and the data files, ensuring that the table is fully deleted from the system.
Let’s use the same DataFrame that we used above to create Hive table.
// Create Hive Internal table
sampleDF.write.mode(SaveMode.Overwrite)
.saveAsTable("ct.sampletable2")
1.2.2 Create External Table
An external table in Spark is a table whose data is stored outside of the Spark environment, typically in a user-defined location in a file system such as HDFS or S3. Spark SQL creates metadata for the external table in the Hive metastore, similar to internal tables. However, it does not manage the data files, as they are stored externally.
// Create Hive External table
sampleDF.write.mode(SaveMode.Overwrite)
.option("path", "/path/to/external/table")
.saveAsTable("ct.sampletable2")
Here,
sampleDF.write
: This initiates the process of writing the DataFrame to a storage location..mode(SaveMode.Overwrite)
: This specifies the mode for writing data to the storage location. In this case, it’s set to overwrite existing data if any..option("path", "/path/to/external/table")
: This specifies the path where the external table’s data will be stored. Replace/path/to/external/table
with the desired storage location..saveAsTable("ct.sampletable2")
: This saves the DataFrame as a Hive external table namedsampletable2
in the databasect
. If the specified table already exists, its contents will be overwritten.
There are multiple saveMode options available for DataFrame in saveAsTable() method.
1.2.3 Save Modes
- Overwrite: Used to overwrite the existing data of the table
- Append: Used to append the full incoming data into the table
- Ignore: Ignore mode is used mostly when saving a DataFrame to a data source, if incoming data already exists, the save operation is expected not to save the contents of the DataFrame and not to change the existing data.
1.2.4 List of Tables
To get the list of tables use the following method.
// List tables
spark.catalog.listTables()
2. Create a Table in Databricks
By default, all the tables created in Databricks are delta tables with underlying data in parquet format.
Let us see how we create a Spark or PySpark table in Databricks and its properties. First, we create a SQL notebook in Databricks and add the below command into the cell.
%sql
//Creates a Delta table in default SPARK SQL Warehouse Directory
CREATE DATABASE IF NOT EXISTS sparkExamples;
CREATE TABLE IF NOT EXISTS sparkExamples.sampleTable (id INT, FirstName String, LastName String);
// Insert to table
INSERT INTO sparkExamples.sampleTable values (1,"John", "Smith");
//select from table
SELECT * FROM sparkExamples.sampleTable;
//output:
id|FirstName|LastName
1|John|Smith
Above we have created a managed Spark table (sparkExamples.sampleTable) and inserted a few records into it. Now let us look at the table properties using DESCRIBE command
%sql
//Describe table
DESCRIBE EXTENDED sparkExamples.sampleTable;
Output:
In the table properties, we see the type of the Spark table as “MANAGED
” and the data of the table gets stored in the default Spark SQL Warehouse path i.e /user/hive/warehouse/sparkexamples.db/sampletable
we can also create external tables in HIVE and Databricks by passing the table LOCATION while creating the table.
For more information on types of tables see: Spark Types of Tables and Views
3. Conclusion
In this article, you have learned by using Apache Spark or PySpark we can create table in Hive, Databricks, and many external storage systems. We can also specify while creating a table whether if want to manage only the table or data and table combined (by creating an internal or external table). This can helps in improving data security and limit managing resources. From all the methods shown to create a table above, each method has its own advantages, choosing one among them is purely dependent on the application requirement.
Related Articles
- Spark Types of Tables and Views
- Spark createOrReplaceTempView() Explained
- Time Travel with Delta Tables in Databricks?
- Spark Internal Execution plan
- Broadcast Join in Spark
- Spark SQL Left Outer Join with Example
- Spark SQL Left Anti Join with Example
- Spark SQL Right Outer Join with Example