While I am trying to create a Hive Table from PySpark I was getting an error AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);
from os.path import abspath
from pyspark.sql import SparkSession
# warehouse_location
warehouse_location = abspath('spark-warehouse')
# Create spark session with hive enabled
spark = SparkSession \
.builder \
.appName("SparkByExamples.com") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
columns = ["id", "name","age","gender"]
# Create DataFrame
data = [(1, "James",30,"M"), (2, "Ann",40,"F"),
(3, "Jeff",41,"M"),(4, "Jennifer",20,"F")]
sampleDF = spark.sparkContext.parallelize(data).toDF(columns)
# Create temporary view
sampleDF.createOrReplaceTempView("sampleView")
# Create a Database CT
spark.sql("CREATE DATABASE IF NOT EXISTS ct")
# Create a Table naming as sampleTable under CT database.
spark.sql("CREATE TABLE ct.sampleTable (id Int, name String, age Int, gender String)")
# Insert into sampleTable using the sampleView.
spark.sql("INSERT INTO TABLE ct.sampleTable SELECT * FROM sampleView")
# Lets view the data in the table
spark.sql("SELECT * FROM ct.sampleTable").show()
Running the above PySpark application I got the following error.
AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);
‘CreateTable ct
.sampleTable
, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists
Why are we getting this error?
From Spark documentation & https://stackoverflow.com/ I found the following explanation, hope this helps you understanding.
SparkSession is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application.
SessionState is the state separation layer between Spark SQL sessions, including SQL configuration, tables, functions, UDFs, SQL parser, and everything else that depends on a SQLConf.
SessionState is available as the SessionState property of a SparkSession
Internally, SessionState clones the optional parent SessionState (if given when creating the SparkSession) or creates a new SessionState using BaseSessionStateBuilder as defined by spark.sql.catalogImplementation configuration property:
- in-memory (default) for org.apache.spark.sql.internal.SessionStateBuilder
- hive for org.apache.spark.sql.hive.HiveSessionStateBuilder
Solution
For using hive
you should use the class org.apache.spark.sql.hive.HiveSessionStateBuilder
and according to the document this can be done by setting the property spark.sql.catalogImplementation
to hive
when creating a SparkSession object
As suggested by others I have also tried setting the spark.sql.catalogImplementation
property to hive
while creating PySpark SparkSession, but the problem is not solved.
# Create spark session with hive enabled
spark = SparkSession \
.builder \
.appName("SparkByExamples.com") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config("spark.sql.catalogImplementation", "hive") \
.enableHiveSupport() \
.getOrCreate()
# tried the below
spark.conf.set("spark.sql.catalogImplementation", "hive")
Finally setting the same property –conf spark.sql.catalogImplementation=hive to spark-submit the problem is solved.
spark-submit --deploy-mode cluster
--master yarn
--conf spark.sql.catalogImplementation=hive
.........
........
Or you can also configure it for all Spark jobs by setting the property in /etc/spark/conf/spark-defaults.conf file.
I hope you are able to solve the error AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);
Happy Learning !!
Related Articles
- PySpark Save DataFrame to Hive Table
- PySpark Read and Write SQL Server Table
- PySpark Read and Write MySQL Database Table
- PySpark Read JDBC Table to DataFrame
- PySpark SQL Read Hive Table
- PySpark createOrReplaceTempView() Explained
- What is PySpark DataFrame?
- PySpark partitionBy() – Write to Disk Example