How to read a Hive table into Spark DataFrame? Spark SQL supports reading a Hive table to DataFrame in two ways: the spark.read.table()
method and the spark.sql()
statement. spark.read
is an object of DataFrameReader cleass.
In order to read a Hive table, you need to create a SparkSession with enableHiveSupport(). This method is available at spark.sql.SparkSession.builder.enableHiveSupport()
which is used to enable Hive support, including connectivity to a persistent Hive metastore, support for Hive SerDes, and Hive user-defined functions.
Steps to Read Hive Table into Spark DataFrame
- Step1 – Add spark hive dependencies to the classpath
- Step 2 – Create SparkSession with Hive enabled
- Step 3 – Read Hive table into Spark DataFrame
1. Spark Hive Dependencies
To enable Hive support you would need the following dependencies in Maven pom.xml file. If you are using sbt use the following dependencies accordingly.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.2.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>3.2.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.13</artifactId>
<version>3.2.1</version>
</dependency>
2. Create Spark Session with Hive Enabled
Create a SparkSession with Hive support enabled. In case you wanted to read from remove hive cluster refer to How to connect Remote Hive Cluster from Spark.
import org.apache.spark.sql.SparkSession
//enableHiveSupport() -> enables sparkSession to connect with Hive
val spark = SparkSession.builder()
.master("local[*]")
.appName("SparkCreateTableExample")
.enableHiveSupport()
.getOrCreate()
Spark reads the data from the default Hive warehouse location which is /user/hive/warehouse
when you use a Hive cluster. But on local, it reads from the current directory. You can change this behavior, using the spark.sql.warehouse.dir
configuration while creating a SparkSession
.
And to connect to the remote Hive cluster and metastore use the below.
import org.apache.spark.sql.SparkSession
// Change default wareshouse location and metastore
val spark = SparkSession.builder().master("local[*]")
.appName("SparkCreateTableExample")
.config("spark.sql.warehouse.dir", "/hive/warehouse/dir")
.config("hive.metastore.uris", "thrift://remote-host:9083")
.enableHiveSupport()
.getOrCreate()
// or Use the below approach
// Change using conf
spark.sparkContext().conf().set("spark.sql.warehouse.dir", "/user/hive/warehouse");
spark.sparkContext().conf().set("hive.metastore.uris", "thrift://localhost:9083");
3. Spark Read Hive Table into DataFrame
In my previous article, I created a Hive table from Spark DataFrame as shown below. The Hive files are default created inside the spark-warehouse directory within the current directory.

Let’s read the employee Hive table into Spark DataFrame
import org.apache.spark.sql.SparkSession
// Create Spark Session with Hive enabled
val spark = SparkSession.builder().master("local[*]")
.appName("SparkByExamples.com")
.enableHiveSupport()
.getOrCreate()
// Read Hive table
val df = spark.sql("select * from employee")
df.show()
Yields the below output.

4. Using spark.read.table()
Alternatively, you can also read by using spark.read.table() method. here, spark.read is an object of the class DataFrameReader.
// Read Hive table
val df = spark.read.table("employee")
df.show()
5. Conclusion
In this article, you have learned how to read the Hive table into Spark DataFrame by creating SparkSession with enableHiveSupport()
and using the dependencies required to connect to the Hive.
You can find the complete example at Github Spark Hive Example project.
Related Articles
- How to Connect Spark to Remote Hive
- Spark Save DataFrame to Hive Table
- Spark SQL Create a Table
- Spark Types of Tables and Views
- Spark Drop, Delete, Truncate Differences
- Time Travel with Delta Tables in Databricks?
- Spark createOrReplaceTempView() Explained
- Difference in DENSE_RANK and ROW_NUMBER in Spark