You are currently viewing PySpark AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);

While I am trying to create a Hive Table from PySpark I was getting an error AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);

Advertisements

from os.path import abspath
from pyspark.sql import SparkSession

# warehouse_location
warehouse_location = abspath('spark-warehouse')

# Create spark session with hive enabled
spark = SparkSession \
    .builder \
    .appName("SparkByExamples.com") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

columns = ["id", "name","age","gender"]

# Create DataFrame 
data = [(1, "James",30,"M"), (2, "Ann",40,"F"),
    (3, "Jeff",41,"M"),(4, "Jennifer",20,"F")]
sampleDF = spark.sparkContext.parallelize(data).toDF(columns)

# Create temporary view
sampleDF.createOrReplaceTempView("sampleView")

# Create a Database CT
spark.sql("CREATE DATABASE IF NOT EXISTS ct")

# Create a Table naming as sampleTable under CT database.
spark.sql("CREATE TABLE ct.sampleTable (id Int, name String, age Int, gender String)")

# Insert into sampleTable using the sampleView. 
spark.sql("INSERT INTO TABLE ct.sampleTable  SELECT * FROM sampleView")

# Lets view the data in the table
spark.sql("SELECT * FROM ct.sampleTable").show()

Running the above PySpark application I got the following error.

AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);
‘CreateTable ct.sampleTable, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists

Why are we getting this error?

From Spark documentation & https://stackoverflow.com/ I found the following explanation, hope this helps you understanding.

SparkSession is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application.

SessionState is the state separation layer between Spark SQL sessions, including SQL configuration, tables, functions, UDFs, SQL parser, and everything else that depends on a SQLConf.

SessionState is available as the SessionState property of a SparkSession

Internally, SessionState clones the optional parent SessionState (if given when creating the SparkSession) or creates a new SessionState using BaseSessionStateBuilder as defined by spark.sql.catalogImplementation configuration property:

  • in-memory (default) for org.apache.spark.sql.internal.SessionStateBuilder
  • hive for org.apache.spark.sql.hive.HiveSessionStateBuilder

Solution

For using hive you should use the class org.apache.spark.sql.hive.HiveSessionStateBuilder and according to the document this can be done by setting the property spark.sql.catalogImplementation to hive when creating a SparkSession object

As suggested by others I have also tried setting the spark.sql.catalogImplementation property to hive while creating PySpark SparkSession, but the problem is not solved.


# Create spark session with hive enabled
spark = SparkSession \
    .builder \
    .appName("SparkByExamples.com") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .config("spark.sql.catalogImplementation", "hive") \
    .enableHiveSupport() \
    .getOrCreate()

# tried the below
spark.conf.set("spark.sql.catalogImplementation", "hive")

Finally setting the same property –conf spark.sql.catalogImplementation=hive to spark-submit the problem is solved.


spark-submit --deploy-mode cluster 
             --master yarn 
             --conf spark.sql.catalogImplementation=hive
             .........
             ........

Or you can also configure it for all Spark jobs by setting the property in /etc/spark/conf/spark-defaults.conf file.

I hope you are able to solve the error AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium