Spark & PySpark SQL allows you to create a database and table either directly from DataFrame, from Temporary views, or from external source files. In this article, we shall discuss how to create a table in Hive and Databricks.
In order to create a Hive table from Spark or PySpark SQL you need to create a SparkSession with enableHiveSupport(). So, let’s create a Spark Session with Hive support enabled while creating the Spark Sessions using its builder() method.
//enableHiveSupport() -> enables sparkSession to connect with Hive val spark = SparkSession.builder() .master("local[*]") .appName("SparkCreateTableExample") .enableHiveSupport() .getOrCreate()
Table of contents
1. Create a Table in Hive from Spark
You can create a hive table in Spark directly from the DataFrame using saveAsTable() or from the temporary view using spark.sql(), or using Databricks. Lets create a DataFrame and on top of it creates a temporary view using the DataFrame inbuild function createOrReplaceTempView.
import spark.implicits._ // Create DataFrame val sampleDF = Seq( (1, "Tiger"), (2, "Lion"), (3, "Monkey") ).toDF("id", "animal") // Create temporary view sampleDF.createOrReplaceTempView("sampleView")
1.1. Using Spark SQL Temporary View
spark.sql() method and
CREATE TABLE statement to create a table in Hive from Spark temporary view. Above we have created a temporary view “
sampleView“. Now we shall create a Database and Table using SQL in Hive Metastore and insert data into the Hive table using the view we created above.
//Create a Database CT spark.sql("CREATE DATABASE IF NOT EXISTS ct") //Create a Table naming as sampleTable under CT database. spark.sql("CREATE TABLE ct.sampleTable (number Int, word String)") //Insert into sampleTable using the sampleView. spark.sql("INSERT INTO TABLE ct.sampleTable SELECT * FROM sampleView") //Lets view the data in the table spark.sql("SELECT * FROM ct.sampleTable").show() //Output +---+------+ | id|animal| +---+------+ | 1| Tiger| | 2| Lion| | 3|Monkey| +---+------+
With Spark Hive support enabled, by default, Spark writes the data to the default Hive warehouse location which is
/user/hive/warehouse when you use a Hive cluster. But on local it creates in the current directory. You can change this behavior, using the
spark.sql.warehouse.dir configuration while creating a
Since we are running it locally from IntelliJ, it creates a metadata database
spark-warehouse under the current directory. We can see below spark-warehouse holds the database (ct) and a table (sampletable) in Hive-Metastore as an internal table.
1.2. Create Table using Spark DataFrame saveAsTable()
saveAsTable() method from
DataFrameWriter to create a Hive table from Spark or PySpark DataFrame. We can use the DataFrame to write into a new/existing table. Pass the table name you wanted to save as an argument to this function and make sure the table name is in the form of
1.2.1 Create Internal Table from Spark
Note that by default this method creates an Internal or Managed table. Internal tables are also known as Managed tables that are owned and managed by Hive. By default, Hive creates a table as an Internal table and owned the table structure and the files. When you drop an internal table, it drops the data and also drops the metadata of the table.
Let’s use the same DataFrame that we used above to create Hive table.
// Create Hive Internal table sampleDF.write.mode(SaveMode.Overwrite) .saveAsTable("ct.sampletable2")
1.2.2 Create External Table
To create an external table use the path of your choice using
option(). The data in External tables are not owned or managed by Hive. Dropping an external table just drops the metadata but not the actual data. The actual data is still accessible outside of Hive.
// Create Hive External table sampleDF.write.mode(SaveMode.Overwrite) .option("path", "/path/to/external/table") .saveAsTable("ct.sampletable2")
The above two examples create a DataFrame and create the ct.sampletable2 table. If a table already exists, it overwrites the table. There are multiple saveMode options available for DataFrame in saveAsTable() method.
1.2.3 Save Modes
- Overwrite: Used to overwrite the existing data of the table
- Append: Used to append the full incoming data into the table
- Ignore: Ignore mode is used mostly when saving a DataFrame to a data source, if incoming data already exists, the save operation is expected not to save the contents of the DataFrame and not to change the existing data.
1.2.4 List of Tables
To get the list of tables use the following method.
// List tables spark.catalog.listTables()
2. Create a Table in Databricks
By default, all the tables created in Databricks are delta tables with underlying data in parquet format.
Let us see how we create a Spark or PySpark table in Databricks and its properties. First, we create a SQL notebook in Databricks and add the below command into the cell.
%sql //Creates a Delta table in default SPARK SQL Warehouse Directory CREATE DATABASE IF NOT EXISTS sparkExamples; CREATE TABLE IF NOT EXISTS sparkExamples.sampleTable (id INT, FirstName String, LastName String); // Insert to table INSERT INTO sparkExamples.sampleTable values (1,"John", "Smith"); //select from table SELECT * FROM sparkExamples.sampleTable; //output: id|FirstName|LastName 1|John|Smith
Above we have created a managed Spark table (sparkExamples.sampleTable) and inserted a few records into it. Now let us look at the table properties using DESCRIBE command
%sql //Describe table DESCRIBE EXTENDED sparkExamples.sampleTable;
In the table properties, we see the type of the Spark table as “
MANAGED” and the data of the table gets stored in the default Spark SQL Warehouse path i.e
we can also create external tables in HIVE and Databricks by passing the table LOCATION while creating the table.
For more information on types of tables see: Spark Types of Tables and Views
In this article, you have learned by using Apache Spark or PySpark we can create table in Hive, Databricks, and many external storage systems. We can also specify while creating a table whether if want to manage only the table or data and table combined (by creating an internal or external table). This can helps in improving data security and limit managing resources. From all the methods shown to create a table above, each method has its own advantages, choosing one among them is purely dependent on the application requirement.
- Spark Types of Tables and Views
- Spark createOrReplaceTempView() Explained
- Time Travel with Delta Tables in Databricks?
- Spark Internal Execution plan
- Broadcast Join in Spark
- Spark SQL Left Outer Join with Example
- Spark SQL Left Anti Join with Example
- Spark SQL Right Outer Join with Example