PySpark SQL Read Hive Table

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:Apache Hive / PySpark
Post last modified:March 27, 2024
Reading time:6 mins read

You are currently viewing PySpark SQL Read Hive Table

How to read or query a Hive table into PySpark DataFrame? PySpark SQL supports reading a Hive table to DataFrame in two ways: the SparkSesseion.read.table() method and the SparkSession.sql() statement.

1. Create Spark Session with Hive Enabled

In order to read the hive table into pySpark DataFrame first, you need to create a SparkSession with Hive support enabled. In case you wanted to read from remove hive cluster refer to How to connect Remote Hive Cluster from Spark.


from os.path import abspath
from pyspark.sql import SparkSession

#enableHiveSupport() -> enables sparkSession to connect with Hive
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
    .builder \
    .appName("SparkByExamples.com") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

PySpark reads the data from the default Hive warehouse location which is /user/hive/warehouse when you use a Hive cluster. But on local, it reads from the current directory. You can change this behavior using the spark.sql.warehouse.dir configuration while creating a SparkSession .

2. PySpark Read or Query Hive Table into DataFrame

In my previous article, I saved a Hive table from PySpark DataFrame, which created Hive files at the default location, which is inside the spark-warehouse directory within the current directory.

Let’s read the Hive table into PySpark DataFrame.


# Read Hive table
df = spark.sql("select * from emp.employee")
df.show()

Yields the below output.

3. Using spark.read.table()

Alternatively, you can also read by using spark.read.table() method. here, spark.read is an object of the class DataFrameReader.


# Read Hive table
df = spark.read.table("employee")
df.show()

4. PySpark Read Hive Table from Remote Hive


from os.path import abspath
from pyspark.sql import SparkSession

#enableHiveSupport() -> enables sparkSession to connect with Hive
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
    .builder \
    .appName("SparkByExamples.com") \
    .config("spark.sql.warehouse.dir", "/hive/warehouse/dir") \
    .config("hive.metastore.uris", "thrift://remote-host:9083") \
    .enableHiveSupport() \
    .getOrCreate()

# or Use the below approach
# Change using conf
spark.sparkContext().conf().set("spark.sql.warehouse.dir", "/user/hive/warehouse");
spark.sparkContext().conf().set("hive.metastore.uris", "thrift://localhost:9083");

5. Conclusion

In this article, you have learned how to read the Hive table into Spark DataFrame by creating SparkSession with enableHiveSupport() and using the dependencies required to connect to the Hive. Also, learned how to read by using SparkSesseion.read.table() method and the SparkSession.sql().

You can find the complete working example at GitHub PySpark Hive Example

Tags: pyspark-hive-examples

1. Create Spark Session with Hive Enabled

2. PySpark Read or Query Hive Table into DataFrame

3. Using spark.read.table()

4. PySpark Read Hive Table from Remote Hive

5. Conclusion

Related Articles