PySpark – What is SparkSession?

  • Post author:
  • Post category:PySpark
  • Post last modified:October 17, 2023
  • Reading time:21 mins read

Since Spark 2.0 SparkSession has become an entry point to PySpark to work with RDD, and DataFrame. Prior to 2.0, SparkContext used to be an entry point. Here, I will mainly focus on explaining what is SparkSession by defining and describing how to create SparkSession and using default SparkSession spark variable from pyspark-shell.

What is SparkSession

SparkSession was introduced in version 2.0, It is an entry point to underlying PySpark functionality in order to programmatically create PySpark RDD, DataFrame. It’s object spark is default available in pyspark-shell and it can be created programmatically using SparkSession.

1. SparkSession

With Spark 2.0 a new class SparkSession (pyspark.sql import SparkSession) has been introduced. SparkSession is a combined class for all different contexts we used to have prior to 2.0 release (SQLContext and HiveContext e.t.c). Since 2.0 SparkSession can be used in replace with SQLContext, HiveContext, and other contexts defined prior to 2.0.

As mentioned in the beginning SparkSession is an entry point to PySpark and creating a SparkSession instance would be the first statement you would write to program with RDD, DataFrame, and Dataset. SparkSession will be created using SparkSession.builder builder patterns.

Though SparkContext used to be an entry point prior to 2.0, It is not completely replaced with SparkSession, many features of SparkContext are still available and used in Spark 2.0 and later. You should also know that SparkSession internally creates SparkConfig and SparkContext with the configuration provided with SparkSession.

SparkSession also includes all the APIs available in different contexts –

  • SparkContext,
  • SQLContext,
  • StreamingContext,
  • HiveContext.

How many SparkSessions can you create in a PySpark application?

You can create as many SparkSession as you want in a PySpark application using either SparkSession.builder() or SparkSession.newSession(). Many Spark session objects are required when you wanted to keep PySpark tables (relational entities) logically separated.

2. SparkSession in PySpark shell

Be default PySpark shell provides “spark” object; which is an instance of SparkSession class. We can directly use this object where required in spark-shell. Start your “pyspark” shell from $SPARK_HOME\bin folder and enter the pyspark command.

Once you are in the PySpark shell enter the below command to get the PySpark version.


# Usage of spark object in PySpark shell
>>>spark.version
3.1.2
PySpark SparkContext

Similar to the PySpark shell, in most of the tools, the environment itself creates a default SparkSession object for us to use so you don’t have to worry about creating a SparkSession object.

3. Create SparkSession

In order to create SparkSession programmatically (in .py file) in PySpark, you need to use the builder pattern method builder() as explained below. getOrCreate() method returns an already existing SparkSession; if not exists, it creates a new SparkSession.


# Create SparkSession from builder
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

master() – If you are running it on the cluster you need to use your master name as an argument to master(). usually, it would be either yarn or mesos depends on your cluster setup.

  • Use local[x] when running in Standalone mode. x should be an integer value and should be greater than 0; this represents how many partitions it should create when using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.

appName() – Used to set your application name.

getOrCreate() – This returns a SparkSession object if already exists, and creates a new one if not exist.

Note:  SparkSession object spark is by default available in the PySpark shell.

4. Create Another SparkSession

You can also create a new SparkSession using newSession() method. This uses the same app name, master as the existing session. Underlying SparkContext will be the same for both sessions as you can have only one context per PySpark application.


# Create new SparkSession
spark2 = SparkSession.newSession
print(spark2)

This always creates a new SparkSession object.

5. Get Existing SparkSession

You can get the existing SparkSession in PySpark using the builder.getOrCreate(), for example.


# Get Existing SparkSession
spark3 = SparkSession.builder.getOrCreate
print(spark3)

6. Using Spark Config

If you wanted to set some configs to SparkSession, use the config() method.


# Usage of config()
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("SparkByExamples.com") \
      .config("spark.some.config.option", "config-value") \
      .getOrCreate()

7. Create SparkSession with Hive Enable

In order to use Hive with PySpark, you need to enable it using the enableHiveSupport() method.


# Enabling Hive to use in Spark
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("SparkByExamples.com") \
      .config("spark.sql.warehouse.dir", "<path>/spark-warehouse") \
      .enableHiveSupport() \
      .getOrCreate()

8. Using PySpark Configs

Once the SparkSession is created, you can add the spark configs during runtime or get all configs.


# Set Config
spark.conf.set("spark.executor.memory", "5g")

# Get a Spark Config
partitions = spark.conf.get("spark.sql.shuffle.partitions")
print(partitions)

9. Create PySpark DataFrame

SparkSession also provides several methods to create a Spark DataFrame and DataSet. The below example uses the createDataFrame() method which takes a list of data.


# Create DataFrame
df = spark.createDataFrame(
    [("Scala", 25000), ("Spark", 35000), ("PHP", 21000)])
df.show()

# Output
#+-----+-----+
#|   _1|   _2|
#+-----+-----+
#|Scala|25000|
#|Spark|35000|
#|  PHP|21000|
#+-----+-----+

4.3 Working with Spark SQL

Using SparkSession you can access PySpark/Spark SQL capabilities in PySpark. In order to use SQL features first, you need to create a temporary view in PySpark. Once you have a temporary view you can run any ANSI SQL queries using spark.sql() method.


# Spark SQL
df.createOrReplaceTempView("sample_table")
df2 = spark.sql("SELECT _1,_2 FROM sample_table")
df2.show()

PySpark SQL temporary views are session-scoped and will not be available if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view using createGlobalTempView()

4.4 Create Hive Table

As explained above SparkSession is used to create and query Hive tables. Note that in order to do this for testing you don’t need Hive to be installed. saveAsTable() creates Hive managed table. Query the table using spark.sql().


# Create Hive table & query it.  
spark.table("sample_table").write.saveAsTable("sample_hive_table")
df3 = spark.sql("SELECT _1,_2 FROM sample_hive_table")
df3.show()

4.5 Working with Catalogs

To get the catalog metadata, PySpark Session exposes catalog variable. Note that these methods spark.catalog.listDatabases and spark.catalog.listTables and returns the DataSet.


# Get metadata from the Catalog
# List databases
dbs = spark.catalog.listDatabases()
print(dbs)

# Output
#[Database(name='default', description='default database', 
#locationUri='file:/Users/admin/.spyder-py3/spark-warehouse')]

# List Tables
tbls = spark.catalog.listTables()
print(tbls)

#Output
#[Table(name='sample_hive_table', database='default', description=None, tableType='MANAGED', #isTemporary=False), Table(name='sample_hive_table1', database='default', description=None, #tableType='MANAGED', isTemporary=False), Table(name='sample_hive_table121', database='default', #description=None, tableType='MANAGED', isTemporary=False), Table(name='sample_table', database=None, #description=None, tableType='TEMPORARY', isTemporary=True)]

Notice the two tables we have created, Spark table is considered a temporary table and Hive table as managed table.

8. SparkSession Commonly Used Methods

version() – Returns the Spark version where your application is running, probably the Spark version your cluster is configured with.

createDataFrame() – This creates a DataFrame from a collection and an RDD

getActiveSession() – returns an active Spark session.

read() – Returns an instance of DataFrameReader class, this is used to read records from csv, parquet, avro, and more file formats into DataFrame.

readStream() – Returns an instance of DataStreamReader class, this is used to read streaming data. that can be used to read streaming data into DataFrame.

sparkContext() – Returns a SparkContext.

sql() – Returns a DataFrame after executing the SQL mentioned.

sqlContext() – Returns SQLContext.

stop() – Stop the current SparkContext.

table() – Returns a DataFrame of a table or view.

udf() – Creates a PySpark UDF to use it on DataFrame, Dataset, and SQL.

9. FAQ’s on SparkSession

How to create SparkSession?

SparkSession programmatically (in .py file) in PySpark is createed using the builder pattern method builder() as explained below. getOrCreate() method returns an already existing SparkSession; if not exist, it creates a new SparkSession.
spark = SparkSession.builder.master("local[1]").appName('SparkByExamples.com') .getOrCreate()

How many SparkSessions can I create?

You can create as many SparkSession as you want in a Spark application using either SparkSession.builder() or SparkSession.newSession(). Many Spark session objects are required when you want to keep Spark tables (relational entities) logically separated.

How to stop SparkSession?

To stop SparkSession in Apache Spark, you can use the stop() method of the SparkSession object. If you have spark as a SparkSession object then call spark.stop() to stop the session. Calling a stop() is important to do when you’re finished with your Spark application. This ensures that resources are properly released and the Spark application terminates gracefully

Do we need to stop SparkSession?

It is recommended to end the Spark session after finishing the Spark job in order for the JVMs to close and free the resources.

How do I know if my Spark session is active?

To check if your SparkSession is active, you can use the SparkSession object’s sparkContext attribute and check its isActive property. If you have spark as a SparkSession object then call spark.sparkContext.isActive. This returns true if it is active otherwise false.

10. Conclusion

In this PySpark article, you have learned SparkSession can be created using the builder() method and learned SparkSession is an entry point to PySpark, and creating a SparkSession instance would be the first statement you would write to program finally have learned some of the commonly used SparkSession methods.

Related Articles

Reference

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has One Comment

  1. Anonymous

    nice