You are currently viewing How to Connect Spark to Remote Hive

When you are working with Spark and Hive you would be required to connect Spark to the remote hive cluster. In my last article, I explained how to write Spark DataFrame to a Hive table now, let’s learn how to connect Spark to a Remote Hive cluster instance and perform operations like create and read table.

Steps to connect to remove Hive cluster from Spark.

  • Step1 – Have Spark Hive Dependencies
  • Step2 – Identify the Hive metastore database connection details
  • Step3 – Create SparkSession with Hive enabled
  • Step4 – Create DataFrame and Save as a Hive table

Before you proceed make sure you have the following running.

1. Spark Hive Dependencies

Note: Apache Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution hence you need to provide them in the Spark class-path (to Driver and Executors). These Hive dependencies are required to be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.

The below snippet has some dependencies you would be required.


<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-core_2.13</artifactId>
   <version>3.2.1</version>
   <scope>compile</scope>
</dependency>

<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-sql_2.13</artifactId>
   <version>3.2.1</version>
   <scope>compile</scope>
</dependency>

<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-hive_2.13</artifactId>
   <version>3.2.1</version>
</dependency>

2. Spark Configurations to Connect to Remote Hive Cluster

Besides the above dependencies, you would also require the following files in your $SPARK_HOME/conf directory or in the classpath.

  • hive-site.xml – Hive configurations like metastore details
  • core-site.xml – Used for Security configurations
  • hdfs-site.xml – Used for HDFS configurations

You can find the hive-site.xml at $HIVE_HOME/conf and the other two at Hadoop conf.

If you don’t have hive-site.xml, you can create one with the below minimum config (with Hive metastore details).


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://192.168.1.190:9083</value>
  </property>
</configuration>

Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. 

3. Create Spark Session By Connecting to Remote Hive

In order to connect Spark to remote Hive database cluster first, you need to create the SparkSession with the hive enabled.


import org.apache.spark.sql.SparkSession

// Create Spark Session with Hive enabled
val spark = SparkSession.builder().master("local[*]")
    .appName("SparkByExamples.com")
    .config("spark.sql.warehouse.dir","/users/hive/warehouse")
    .enableHiveSupport()
    .getOrCreate()

The following are two important properties of Hive you would need to understand to connect Spark to a remote Hive database.

  • Hive Meatastore – Specified by using hive.metastore.uri property
  • Hive Warehouse – Specified by using spark.sql.warehouse.dir property

Hive Metastore: The Hive Metastore is used to store the metadata about the database and tables and by default, it uses the Derby database; You can change this to any RDBMS database like MySQL and Postgress e.t.c. By default the Metastore database name is metastore_db.

Hive Warehouse: Hive stores tables files by default at /user/hive/warehouse location on HDFS file system. You need to create these directories on HDFS before you use Hive. On this location, you can find the directories for all databases you create and subdirectories with the table name you use.

Alternatively, If you don’t have hive-site.xml, you can also create SparkSession by pointing hive.metastore.uris to the remote Hive cluster metastore. In the below, change the IP address and Port according to your Hive metastore server address and port.


import org.apache.spark.sql.SparkSession

// Create Spark Session with Hive enabled
val spark = SparkSession.builder().master("local[*]")
    .appName("SparkByExamples.com")
    .config("hive.metastore.uris", "thrift://192.168.1.190:9083")
    .config("spark.sql.warehouse.dir","/users/hive/warehouse")
    .enableHiveSupport()
    .getOrCreate()

Spark SQL connects to remote Hive metastore using thrift, so we need to provide the thrift server URI while creating the Spark session.

4. Save DataFrame to Remote Hive Table

Make sure you have the Hive server running if not, first, Run the HiveServer2


$HIVE_HOME/bin/hiveserver2

Second, Run the Hive metastore by using the below command.


$HIVE_HOME/bin/hive --service metastore
...
Starting Hive Metastore Server

Next, run the following Spark program which connects to remote Hive, creates the database emp, and finally saves the Spark DataFram to Hive table.


import org.apache.spark.sql.SparkSession

// Create Spark Session with Hive enabled
val spark = SparkSession.builder().master("local[*]")
    .appName("SparkByExamples.com")
    .config("hive.metastore.uris", "thrift://192.168.1.190:9083")
    .config("spark.sql.warehouse.dir","/users/hive/warehouse")
    .enableHiveSupport()
    .getOrCreate()

import spark.implicits._

// Create DataFrame
val sampleDF = Seq((1, "James",30,"M"),
    (2, "Ann",40,"F"), (3, "Jeff",41,"M"),
    (4, "Jennifer",20,"F")
    ).toDF("id", "name","age","gender")

// Create Database
spark.sql("CREATE DATABASE IF NOT EXISTS emp")

// Create Hive Internal table
sampleDF.write.mode(SaveMode.Overwrite)
    .saveAsTable("emp.employee")

5. Read the Remote Hive Table

By using spark.sql() you can read the Hive table into Spark DataFrame.


// Read Hive table
val df = spark.sql("select * from emp.employee")
df.show()

Yields below output.

spark connect remote hive

6. Connect to Remote Hive From Shell

To connect to remove Hive from Spark-shell use the below.


$SPARK_HOME/bin/spark-shell \
  --files hive-site.xml,hdfs-site.xml,hive-site.xml \ 
  --conf spark.sql.warehouse.dir=hdfs://namenode-name:9000/user/hive/warehouse

Conclusion

In this article, you have learned how to connect the Spark application to a remote Hive cluster. Also learned what is metastore URI, warehouse location, and finally creating a spark session with Hive enabled and reading a Hive table from a remote Hive database.

Related Articles

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium