You are currently viewing Spark SQL Create a Table

Spark & PySpark SQL allows you to create a database and table either directly from DataFrame, from Temporary views, or from external source files. In this article, we shall discuss how to create a table in Hive and Databricks.

In order to create a Hive table from Spark or PySpark SQL you need to create a SparkSession with enableHiveSupport(). So, let’s create a Spark Session with Hive support enabled while creating the Spark Sessions using its builder() method.


//enableHiveSupport() -> enables sparkSession to connect with Hive
val spark = SparkSession.builder()
    .master("local[*]")
    .appName("SparkCreateTableExample")
    .enableHiveSupport()
    .getOrCreate()

1. Create a Table in Hive from Spark

You can create a hive table in Spark directly from the DataFrame using saveAsTable() or from the temporary view using spark.sql(), or using Databricks. Lets create a DataFrame and on top of it creates a temporary view using the DataFrame inbuild function createOrReplaceTempView.


import spark.implicits._

// Create DataFrame
val sampleDF = Seq(
    (1, "Tiger"),
    (2, "Lion"),
    (3, "Monkey")
  ).toDF("id", "animal")

// Create temporary view
sampleDF.createOrReplaceTempView("sampleView")

1.1. Using Spark SQL Temporary View

Use spark.sql() method and CREATE TABLE statement to create a table in Hive from Spark temporary view. Above we have created a temporary view “sampleView“. Now we shall create a Database and Table using SQL in Hive Metastore and insert data into the Hive table using the view we created above.


//Create a Database CT
spark.sql("CREATE DATABASE IF NOT EXISTS ct")

//Create a Table naming as sampleTable under CT database.
spark.sql("CREATE TABLE ct.sampleTable (number Int, word String)")

//Insert into sampleTable using the sampleView. 
spark.sql("INSERT INTO TABLE ct.sampleTable  SELECT * FROM sampleView")

//Lets view the data in the table
spark.sql("SELECT * FROM ct.sampleTable").show()

//Output
+---+------+
| id|animal|
+---+------+
|  1| Tiger|
|  2|  Lion|
|  3|Monkey|
+---+------+

With Spark Hive support enabled, by default, Spark writes the data to the default Hive warehouse location which is /user/hive/warehouse when you use a Hive cluster. But on local it creates in the current directory. You can change this behavior, using the spark.sql.warehouse.dir configuration while creating a SparkSession .

Since we are running it locally from IntelliJ, it creates a metadata database metastore_db and spark-warehouse under the current directory. We can see below spark-warehouse holds the database (ct) and a table (sampletable) in Hive-Metastore as an internal table.

Spark SQL create table
Spark SQL creates a table

1.2. Create Table using Spark DataFrame saveAsTable()

Use saveAsTable() method from DataFrameWriter to create a Hive table from Spark or PySpark DataFrame. We can use the DataFrame to write into a new/existing table. Pass the table name you wanted to save as an argument to this function and make sure the table name is in the form of database.tablename.

1.2.1 Create Internal Table from Spark

Note that by default this method creates an Internal or Managed table. Internal tables are also known as Managed tables that are owned and managed by Hive. By default, Hive creates a table as an Internal table and owned the table structure and the files. When you drop an internal table, it drops the data and also drops the metadata of the table.

Let’s use the same DataFrame that we used above to create Hive table.


// Create Hive Internal table
sampleDF.write.mode(SaveMode.Overwrite)
        .saveAsTable("ct.sampletable2")

1.2.2 Create External Table

To create an external table use the path of your choice using option(). The data in External tables are not owned or managed by Hive. Dropping an external table just drops the metadata but not the actual data. The actual data is still accessible outside of Hive.


// Create Hive External table
sampleDF.write.mode(SaveMode.Overwrite)
        .option("path", "/path/to/external/table")
        .saveAsTable("ct.sampletable2")

The above two examples create a DataFrame and create the ct.sampletable2 table. If a table already exists, it overwrites the table. There are multiple saveMode options available for DataFrame in saveAsTable() method.

1.2.3 Save Modes

  • Overwrite: Used to overwrite the existing data of the table
  • Append: Used to append the full incoming data into the table
  • Ignore: Ignore mode is used mostly when saving a DataFrame to a data source, if incoming data already exists, the save operation is expected not to save the contents of the DataFrame and not to change the existing data.

1.2.4 List of Tables

To get the list of tables use the following method.


// List tables
spark.catalog.listTables()

2. Create a Table in Databricks

By default, all the tables created in Databricks are delta tables with underlying data in parquet format.

Let us see how we create a Spark or PySpark table in Databricks and its properties. First, we create a SQL notebook in Databricks and add the below command into the cell.


%sql
//Creates a Delta table in default SPARK SQL Warehouse Directory
CREATE DATABASE IF NOT EXISTS sparkExamples;
CREATE TABLE IF NOT EXISTS sparkExamples.sampleTable (id INT, FirstName String, LastName String);

// Insert to table
INSERT INTO sparkExamples.sampleTable values (1,"John", "Smith");

//select from table
SELECT * FROM sparkExamples.sampleTable;

//output:
id|FirstName|LastName
1|John|Smith

Above we have created a managed Spark table (sparkExamples.sampleTable) and inserted a few records into it. Now let us look at the table properties using DESCRIBE command


%sql
//Describe table
DESCRIBE EXTENDED sparkExamples.sampleTable;

Output:

pyspark create table

In the table properties, we see the type of the Spark table as “MANAGED” and the data of the table gets stored in the default Spark SQL Warehouse path i.e /user/hive/warehouse/sparkexamples.db/sampletable

we can also create external tables in HIVE and Databricks by passing the table LOCATION while creating the table.

For more information on types of tables see: Spark Types of Tables and Views

3. Conclusion

In this article, you have learned by using Apache Spark or PySpark we can create table in Hive, Databricks, and many external storage systems. We can also specify while creating a table whether if want to manage only the table or data and table combined (by creating an internal or external table). This can helps in improving data security and limit managing resources. From all the methods shown to create a table above, each method has its own advantages, choosing one among them is purely dependent on the application requirement.

Related Articles

References

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.