Site icon Spark By {Examples}

PySpark AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);

PySpark Analysis Exception

While I am trying to create a Hive Table from PySpark I was getting an error AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);


from os.path import abspath
from pyspark.sql import SparkSession

# warehouse_location
warehouse_location = abspath('spark-warehouse')

# Create spark session with hive enabled
spark = SparkSession \
    .builder \
    .appName("SparkByExamples.com") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

columns = ["id", "name","age","gender"]

# Create DataFrame 
data = [(1, "James",30,"M"), (2, "Ann",40,"F"),
    (3, "Jeff",41,"M"),(4, "Jennifer",20,"F")]
sampleDF = spark.sparkContext.parallelize(data).toDF(columns)

# Create temporary view
sampleDF.createOrReplaceTempView("sampleView")

# Create a Database CT
spark.sql("CREATE DATABASE IF NOT EXISTS ct")

# Create a Table naming as sampleTable under CT database.
spark.sql("CREATE TABLE ct.sampleTable (id Int, name String, age Int, gender String)")

# Insert into sampleTable using the sampleView. 
spark.sql("INSERT INTO TABLE ct.sampleTable  SELECT * FROM sampleView")

# Lets view the data in the table
spark.sql("SELECT * FROM ct.sampleTable").show()

Running the above PySpark application I got the following error.

AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);
‘CreateTable ct.sampleTable, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists

Why are we getting this error?

From Spark documentation & https://stackoverflow.com/ I found the following explanation, hope this helps you understanding.

SparkSession is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application.

SessionState is the state separation layer between Spark SQL sessions, including SQL configuration, tables, functions, UDFs, SQL parser, and everything else that depends on a SQLConf.

SessionState is available as the SessionState property of a SparkSession

Internally, SessionState clones the optional parent SessionState (if given when creating the SparkSession) or creates a new SessionState using BaseSessionStateBuilder as defined by spark.sql.catalogImplementation configuration property:

Solution

For using hive you should use the class org.apache.spark.sql.hive.HiveSessionStateBuilder and according to the document this can be done by setting the property spark.sql.catalogImplementation to hive when creating a SparkSession object

As suggested by others I have also tried setting the spark.sql.catalogImplementation property to hive while creating PySpark SparkSession, but the problem is not solved.


# Create spark session with hive enabled
spark = SparkSession \
    .builder \
    .appName("SparkByExamples.com") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .config("spark.sql.catalogImplementation", "hive") \
    .enableHiveSupport() \
    .getOrCreate()

# tried the below
spark.conf.set("spark.sql.catalogImplementation", "hive")

Finally setting the same property –conf spark.sql.catalogImplementation=hive to spark-submit the problem is solved.


spark-submit --deploy-mode cluster 
             --master yarn 
             --conf spark.sql.catalogImplementation=hive
             .........
             ........

Or you can also configure it for all Spark jobs by setting the property in /etc/spark/conf/spark-defaults.conf file.

I hope you are able to solve the error AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);

Happy Learning !!

Exit mobile version