When you are working with Spark and Hive you would be required to connect Spark to the remote hive cluster. In my last article, I explained how to write Spark DataFrame to a Hive table now, let’s learn how to connect Spark to a Remote Hive cluster instance and perform operations like create and read table.
Steps to connect to remove Hive cluster from Spark.
- Step1 – Have Spark Hive Dependencies
- Step2 -Identify the Hive metastore database connection details
- Step3 – Create SparkSession with Hive enabled
- Step4 – Create DataFrame and Save as a Hive table
Before you proceed make sure you have the following running.
1. Spark Hive Dependencies
Note: Apache Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution hence you need to provide them in the Spark class-path (to Driver and Executors). These Hive dependencies are required to be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.
The below snippet has some dependencies you would be required.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.2.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>3.2.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.13</artifactId>
<version>3.2.1</version>
</dependency>
2. Spark Configurations to Connect to Remove Hive Cluster
Besides the above dependencies, you would also require the following files in your $SPARK_HOME/conf
directory or in the classpath.
hive-site.xml
– Hive configurations like metastore detailscore-site.xml
– Used for Security configurationshdfs-site.xml
– Used for HDFS configurations
You can find the hive-site.xml at $HIVE_HOME/conf
and the other two at Hadoop conf.
If you don’t have hive-site.xml
, you can create one with the below minimum config (with Hive metastore details).
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://192.168.1.190:9083</value>
</property>
</configuration>
Note that the hive.metastore.warehouse.dir
property in hive-site.xml
is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir
to specify the default location of database in warehouse.
3. Create Spark Session By Connecting to Remove Hive
In order to connect Spark to remote Hive database cluster first, you need to create the SparkSession with the hive enabled.
import org.apache.spark.sql.SparkSession
// Create Spark Session with Hive enabled
val spark = SparkSession.builder().master("local[*]")
.appName("SparkByExamples.com")
.config("spark.sql.warehouse.dir","/users/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
The following are two important properties of Hive you would need to understand to connect Spark to a remote Hive database.
- Hive Meatastore – Specified by using
hive.metastore.uri
property - Hive Warehouse – Specified by using
spark.sql.warehouse.dir
property
Hive Metastore: The Hive Metastore is used to store the metadata about the database and tables and by default, it uses the Derby database; You can change this to any RDBMS database like MySQL and Postgress e.t.c. By default the Metastore database name is metastore_db
.
Hive Warehouse: Hive stores tables files by default at /user/hive/warehouse
location on HDFS file system. You need to create these directories on HDFS before you use Hive. On this location, you can find the directories for all databases you create and subdirectories with the table name you use.
Alternatively, If you don’t have hive-site.xml, you can also create SparkSession by pointing hive.metastore.uris
to the remote Hive cluster metastore. In the below, change the IP address and Port according to your Hive metastore server address and port.
import org.apache.spark.sql.SparkSession
// Create Spark Session with Hive enabled
val spark = SparkSession.builder().master("local[*]")
.appName("SparkByExamples.com")
.config("hive.metastore.uris", "thrift://192.168.1.190:9083")
.config("spark.sql.warehouse.dir","/users/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
Spark SQL connects to remote Hive metastore using thrift, so we need to provide the thrift server URI while creating the Spark session.
4. Save DataFrame to Remove Hive Table
Make sure you have the Hive server running if not, first, Run the HiveServer2
$HIVE_HOME/bin/hiveserver2
Second, Run the Hive metastore by using the below command.
$HIVE_HOME/bin/hive --service metastore
...
Starting Hive Metastore Server
Next, run the following Spark program which connects to remote Hive, creates the database emp, and finally saves the Spark DataFram to Hive table.
import org.apache.spark.sql.SparkSession
// Create Spark Session with Hive enabled
val spark = SparkSession.builder().master("local[*]")
.appName("SparkByExamples.com")
.config("hive.metastore.uris", "thrift://192.168.1.190:9083")
.config("spark.sql.warehouse.dir","/users/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
// Create DataFrame
val sampleDF = Seq((1, "James",30,"M"),
(2, "Ann",40,"F"), (3, "Jeff",41,"M"),
(4, "Jennifer",20,"F")
).toDF("id", "name","age","gender")
// Create Database
spark.sql("CREATE DATABASE IF NOT EXISTS emp")
// Create Hive Internal table
sampleDF.write.mode(SaveMode.Overwrite)
.saveAsTable("emp.employee")
5. Read the Remote Hive Table
By using spark.sql()
you can read the Hive table into Spark DataFrame.
// Read Hive table
val df = spark.sql("select * from emp.employee")
df.show()
Yields below output.

6. Connect to Remote Hive From Shell
To connect to remove Hive from Spark-shell use the below.
$SPARK_HOME/bin/spark-shell \
--files hive-site.xml,hdfs-site.xml,hive-site.xml \
--conf spark.sql.warehouse.dir=hdfs://namenode-name:9000/user/hive/warehouse
Conclusion
In this article, you have learned how to connect the Spark application to a remote Hive cluster. Also learned what is metastore URI, warehouse location, and finally creating a spark session with Hive enabled and reading a Hive table from a remote Hive database.
Related Articles
- Spark Read Hive Table to DataFrame
- Spark Save DataFrame to Hive Table
- Spark SQL Create a Table
- Spark Types of Tables and Views
- Spark Drop, Delete, Truncate Differences
- Time Travel with Delta Tables in Databricks?
- Spark createOrReplaceTempView() Explained
- Difference in DENSE_RANK and ROW_NUMBER in Spark
- Spark Enable Hive Support