Since Spark 2.0 SparkSession has become an entry point to PySpark to work with RDD, and DataFrame. Prior to 2.0, SparkContext used to be an entry point. Here, I will mainly focus on explaining what is SparkSession by defining and describing how to create SparkSession and using default SparkSession spark
variable from pyspark
-shell.
SparkSession was introduced in version 2.0, It is an entry point to underlying PySpark functionality in order to programmatically create PySpark RDD, DataFrame. It’s object spark
is default available in pyspark-shell and it can be created programmatically using SparkSession.
1. SparkSession
With Spark 2.0 a new class SparkSession (pyspark.sql import SparkSession
) has been introduced. SparkSession is a combined class for all different contexts we used to have prior to 2.0 release (SQLContext and HiveContext e.t.c). Since 2.0 SparkSession can be used in replace with SQLContext, HiveContext, and other contexts defined prior to 2.0.
As mentioned in the beginning SparkSession is an entry point to PySpark and creating a SparkSession instance would be the first statement you would write to program with RDD, DataFrame, and Dataset. SparkSession will be created using SparkSession.builder
builder patterns.
Though SparkContext used to be an entry point prior to 2.0, It is not completely replaced with SparkSession, many features of SparkContext are still available and used in Spark 2.0 and later. You should also know that SparkSession internally creates SparkConfig and SparkContext with the configuration provided with SparkSession.
SparkSession also includes all the APIs available in different contexts –
- SparkContext,
- SQLContext,
- StreamingContext,
- HiveContext.
How many SparkSessions can you create in a PySpark application?
You can create as many SparkSession
as you want in a PySpark application using either SparkSession.builder()
or SparkSession.newSession()
. Many Spark session objects are required when you wanted to keep PySpark tables (relational entities) logically separated.
2. SparkSession in PySpark shell
Be default PySpark shell provides “spark
” object; which is an instance of SparkSession class. We can directly use this object where required in spark-shell. Start your “pyspark
” shell from $SPARK_HOME\bin
folder and enter the pyspark
command.
Once you are in the PySpark shell enter the below command to get the PySpark version.
# Usage of spark object in PySpark shell
>>>spark.version
3.1.2

Similar to the PySpark shell, in most of the tools, the environment itself creates a default SparkSession object for us to use so you don’t have to worry about creating a SparkSession object.
3. Create SparkSession
In order to create SparkSession programmatically (in .py file) in PySpark, you need to use the builder pattern method builder()
as explained below. getOrCreate()
method returns an already existing SparkSession; if not exists, it creates a new SparkSession.
# Create SparkSession from builder
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
.appName('SparkByExamples.com') \
.getOrCreate()
master()
– If you are running it on the cluster you need to use your master name as an argument to master(). usually, it would be either yarn
or mesos
depends on your cluster setup.
- Use
local[x]
when running in Standalone mode. x should be an integer value and should be greater than 0; this represents how many partitions it should create when using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.
appName()
– Used to set your application name.
getOrCreate()
– This returns a SparkSession object if already exists, and creates a new one if not exist.
Note: SparkSession object spark
is by default available in the PySpark shell.
4. Create Another SparkSession
You can also create a new SparkSession using newSession()
method. This uses the same app name, master as the existing session. Underlying SparkContext will be the same for both sessions as you can have only one context per PySpark application.
# Create new SparkSession
spark2 = SparkSession.newSession
print(spark2)
This always creates a new SparkSession object.
5. Get Existing SparkSession
You can get the existing SparkSession in PySpark using the builder.getOrCreate()
, for example.
# Get Existing SparkSession
spark3 = SparkSession.builder.getOrCreate
print(spark3)
6. Using Spark Config
If you wanted to set some configs to SparkSession, use the config()
method.
# Usage of config()
spark = SparkSession.builder \
.master("local[1]") \
.appName("SparkByExamples.com") \
.config("spark.some.config.option", "config-value") \
.getOrCreate()
7. Create SparkSession with Hive Enable
In order to use Hive with PySpark, you need to enable it using the enableHiveSupport()
method.
# Enabling Hive to use in Spark
spark = SparkSession.builder \
.master("local[1]") \
.appName("SparkByExamples.com") \
.config("spark.sql.warehouse.dir", "<path>/spark-warehouse") \
.enableHiveSupport() \
.getOrCreate()
8. Using PySpark Configs
Once the SparkSession is created, you can add the spark configs during runtime or get all configs.
# Set Config
spark.conf.set("spark.executor.memory", "5g")
# Get a Spark Config
partitions = spark.conf.get("spark.sql.shuffle.partitions")
print(partitions)
9. Create PySpark DataFrame
SparkSession also provides several methods to create a Spark DataFrame and DataSet. The below example uses the createDataFrame()
method which takes a list of data.
# Create DataFrame
df = spark.createDataFrame(
[("Scala", 25000), ("Spark", 35000), ("PHP", 21000)])
df.show()
# Output
#+-----+-----+
#| _1| _2|
#+-----+-----+
#|Scala|25000|
#|Spark|35000|
#| PHP|21000|
#+-----+-----+
4.3 Working with Spark SQL
Using SparkSession you can access PySpark/Spark SQL capabilities in PySpark. In order to use SQL features first, you need to create a temporary view in PySpark. Once you have a temporary view you can run any ANSI SQL queries using spark.sql()
method.
# Spark SQL
df.createOrReplaceTempView("sample_table")
df2 = spark.sql("SELECT _1,_2 FROM sample_table")
df2.show()
PySpark SQL temporary views are session-scoped and will not be available if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view using createGlobalTempView()
4.4 Create Hive Table
As explained above SparkSession is used to create and query Hive tables. Note that in order to do this for testing you don’t need Hive to be installed. saveAsTable()
creates Hive managed table. Query the table using spark.sql()
.
# Create Hive table & query it.
spark.table("sample_table").write.saveAsTable("sample_hive_table")
df3 = spark.sql("SELECT _1,_2 FROM sample_hive_table")
df3.show()
4.5 Working with Catalogs
To get the catalog metadata, PySpark Session exposes catalog
variable. Note that these methods spark.catalog.listDatabases
and spark.catalog.listTables
and returns the DataSet.
# Get metadata from the Catalog
# List databases
dbs = spark.catalog.listDatabases()
print(dbs)
# Output
#[Database(name='default', description='default database',
#locationUri='file:/Users/admin/.spyder-py3/spark-warehouse')]
# List Tables
tbls = spark.catalog.listTables()
print(tbls)
#Output
#[Table(name='sample_hive_table', database='default', description=None, tableType='MANAGED', #isTemporary=False), Table(name='sample_hive_table1', database='default', description=None, #tableType='MANAGED', isTemporary=False), Table(name='sample_hive_table121', database='default', #description=None, tableType='MANAGED', isTemporary=False), Table(name='sample_table', database=None, #description=None, tableType='TEMPORARY', isTemporary=True)]
Notice the two tables we have created, Spark table is considered a temporary table and Hive table as managed table.
8. SparkSession Commonly Used Methods
version()
– Returns the Spark version where your application is running, probably the Spark version your cluster is configured with.
createDataFrame()
– This creates a DataFrame from a collection and an RDD
getActiveSession()
– returns an active Spark session.
read()
– Returns an instance of DataFrameReader
class, this is used to read records from csv, parquet, avro, and more file formats into DataFrame.
readStream()
– Returns an instance of DataStreamReader
class, this is used to read streaming data. that can be used to read streaming data into DataFrame.
sparkContext()
– Returns a SparkContext.
sql()
– Returns a DataFrame after executing the SQL mentioned.
sqlContext()
– Returns SQLContext.
stop()
– Stop the current SparkContext.
table()
– Returns a DataFrame of a table or view.
udf()
– Creates a PySpark UDF to use it on DataFrame, Dataset, and SQL.
9. Conclusion
In this PySpark article, you have learned SparkSession can be created using the builder() method and learned SparkSession is an entry point to PySpark, and creating a SparkSession instance would be the first statement you would write to program finally have learned some of the commonly used SparkSession methods.
Related Articles
- Spark Get the Current SparkContext Settings
- How to resolve NameError: Name ‘Spark’ is not Defined?
- How to resolve Spark Context ‘sc’ Not Defined?
- How to Import PySpark in Python Script
- PySpark “ImportError: No module named py4j.java_gateway” Error
- Spark – What is SparkSession Explained
- SparkSession vs SparkContext
- Spark – Create a SparkSession and SparkContext
- SparkSession vs SQLContext
- PySpark SparkContext Explained
Reference
Happy Learning !!
nice