Spark SQL Read Hive Table

How to read a Hive table into Spark DataFrame? Spark SQL supports reading a Hive table to DataFrame in two ways: the spark.read.table() method and the spark.sql() statement. spark.read is an object of DataFrameReader cleass.

In order to read a Hive table, you need to create a SparkSession with enableHiveSupport(). This method is available at spark.sql.SparkSession.builder.enableHiveSupport() which is used to enable Hive support, including connectivity to a persistent Hive metastore, support for Hive SerDes, and Hive user-defined functions.

Steps to Read Hive Table into Spark DataFrame

  • Step1 – Add spark hive dependencies to the classpath
  • Step 2 – Create SparkSession with Hive enabled
  • Step 3 – Read Hive table into Spark DataFrame

1. Spark Hive Dependencies

To enable Hive support you would need the following dependencies in Maven pom.xml file. If you are using sbt use the following dependencies accordingly.


<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-core_2.13</artifactId>
   <version>3.2.1</version>
   <scope>compile</scope>
</dependency>

<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-sql_2.13</artifactId>
   <version>3.2.1</version>
   <scope>compile</scope>
</dependency>

<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-hive_2.13</artifactId>
   <version>3.2.1</version>
</dependency>

2. Create Spark Session with Hive Enabled

Create a SparkSession with Hive support enabled. In case you wanted to read from remove hive cluster refer to How to connect Remote Hive Cluster from Spark.


import org.apache.spark.sql.SparkSession

//enableHiveSupport() -> enables sparkSession to connect with Hive
val spark = SparkSession.builder()
    .master("local[*]")
    .appName("SparkCreateTableExample")
    .enableHiveSupport()
    .getOrCreate()

Spark reads the data from the default Hive warehouse location which is /user/hive/warehouse when you use a Hive cluster. But on local, it reads from the current directory. You can change this behavior, using the spark.sql.warehouse.dir configuration while creating a SparkSession .

And to connect to the remote Hive cluster and metastore use the below.


import org.apache.spark.sql.SparkSession

// Change default wareshouse location and metastore 
val spark = SparkSession.builder().master("local[*]")
    .appName("SparkCreateTableExample")
    .config("spark.sql.warehouse.dir", "/hive/warehouse/dir")
    .config("hive.metastore.uris", "thrift://remote-host:9083")
    .enableHiveSupport()
    .getOrCreate()

// or Use the below approach
// Change using conf
spark.sparkContext().conf().set("spark.sql.warehouse.dir", "/user/hive/warehouse");
spark.sparkContext().conf().set("hive.metastore.uris", "thrift://localhost:9083");

3. Spark Read Hive Table into DataFrame

In my previous article, I created a Hive table from Spark DataFrame as shown below. The Hive files are default created inside the spark-warehouse directory within the current directory.

spark read hive table

Let’s read the employee Hive table into Spark DataFrame


import org.apache.spark.sql.SparkSession

// Create Spark Session with Hive enabled
val spark = SparkSession.builder().master("local[*]")
    .appName("SparkByExamples.com")
    .enableHiveSupport()
    .getOrCreate()

// Read Hive table
val df = spark.sql("select * from employee")
df.show()

Yields the below output.

4. Using spark.read.table()

Alternatively, you can also read by using spark.read.table() method. here, spark.read is an object of the class DataFrameReader.


// Read Hive table
val df = spark.read.table("employee")
df.show()

5. Conclusion

In this article, you have learned how to read the Hive table into Spark DataFrame by creating SparkSession with enableHiveSupport() and using the dependencies required to connect to the Hive.

You can find the complete example at Github Spark Hive Example project.

Related Articles

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing Spark SQL Read Hive Table