Since Spark 2.0 SparkSession has become an entry point to PySpark to work with RDD, and DataFrame. Prior to 2.0, SparkContext used to be an entry point. Here, I will mainly focus on explaining what is SparkSession by defining and describing how to create SparkSession and using default SparkSession
spark variable from
SparkSession was introduced in version 2.0, It is an entry point to underlying PySpark functionality in order to programmatically create PySpark RDD, DataFrame. It’s object
spark is default available in pyspark-shell and it can be created programmatically using SparkSession.
With Spark 2.0 a new class SparkSession (
pyspark.sql import SparkSession) has been introduced. SparkSession is a combined class for all different contexts we used to have prior to 2.0 release (SQLContext and HiveContext e.t.c). Since 2.0 SparkSession can be used in replace with SQLContext, HiveContext, and other contexts defined prior to 2.0.
As mentioned in the beginning SparkSession is an entry point to PySpark and creating a SparkSession instance would be the first statement you would write to program with RDD, DataFrame, and Dataset. SparkSession will be created using
SparkSession.builder builder patterns.
Though SparkContext used to be an entry point prior to 2.0, It is not completely replaced with SparkSession, many features of SparkContext are still available and used in Spark 2.0 and later. You should also know that SparkSession internally creates SparkConfig and SparkContext with the configuration provided with SparkSession.
SparkSession also includes all the APIs available in different contexts –
How many SparkSessions can you create in a PySpark application?
You can create as many
SparkSession as you want in a PySpark application using either
SparkSession.newSession(). Many Spark session objects are required when you wanted to keep PySpark tables (relational entities) logically separated.
2. SparkSession in PySpark shell
Be default PySpark shell provides “
spark” object; which is an instance of SparkSession class. We can directly use this object where required in spark-shell. Start your “
pyspark” shell from
$SPARK_HOME\bin folder and enter the
Once you are in the PySpark shell enter the below command to get the PySpark version.
# Usage of spark object in PySpark shell >>>spark.version 3.1.2
Similar to the PySpark shell, in most of the tools, the environment itself creates a default SparkSession object for us to use so you don’t have to worry about creating a SparkSession object.
3. Create SparkSession
In order to create SparkSession programmatically (in .py file) in PySpark, you need to use the builder pattern method
builder() as explained below.
getOrCreate() method returns an already existing SparkSession; if not exists, it creates a new SparkSession.
# Create SparkSession from builder import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.master("local") \ .appName('SparkByExamples.com') \ .getOrCreate()
master() – If you are running it on the cluster you need to use your master name as an argument to master(). usually, it would be either
mesos depends on your cluster setup.
local[x]when running in Standalone mode. x should be an integer value and should be greater than 0; this represents how many partitions it should create when using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.
appName() – Used to set your application name.
getOrCreate() – This returns a SparkSession object if already exists, and creates a new one if not exist.
Note: SparkSession object
spark is by default available in the PySpark shell.
4. Create Another SparkSession
You can also create a new SparkSession using
newSession() method. This uses the same app name, master as the existing session. Underlying SparkContext will be the same for both sessions as you can have only one context per PySpark application.
# Create new SparkSession spark2 = SparkSession.newSession print(spark2)
This always creates a new SparkSession object.
5. Get Existing SparkSession
You can get the existing SparkSession in PySpark using the
builder.getOrCreate(), for example.
# Get Existing SparkSession spark3 = SparkSession.builder.getOrCreate print(spark3)
6. Using Spark Config
If you wanted to set some configs to SparkSession, use the
# Usage of config() spark = SparkSession.builder \ .master("local") \ .appName("SparkByExamples.com") \ .config("spark.some.config.option", "config-value") \ .getOrCreate()
7. Create SparkSession with Hive Enable
In order to use Hive with PySpark, you need to enable it using the
# Enabling Hive to use in Spark spark = SparkSession.builder \ .master("local") \ .appName("SparkByExamples.com") \ .config("spark.sql.warehouse.dir", "<path>/spark-warehouse") \ .enableHiveSupport() \ .getOrCreate()
8. Using PySpark Configs
Once the SparkSession is created, you can add the spark configs during runtime or get all configs.
# Set Config spark.conf.set("spark.executor.memory", "5g") # Get a Spark Config partitions = spark.conf.get("spark.sql.shuffle.partitions") print(partitions)
9. Create PySpark DataFrame
SparkSession also provides several methods to create a Spark DataFrame and DataSet. The below example uses the
createDataFrame() method which takes a list of data.
# Create DataFrame df = spark.createDataFrame( [("Scala", 25000), ("Spark", 35000), ("PHP", 21000)]) df.show() # Output #+-----+-----+ #| _1| _2| #+-----+-----+ #|Scala|25000| #|Spark|35000| #| PHP|21000| #+-----+-----+
4.3 Working with Spark SQL
Using SparkSession you can access PySpark/Spark SQL capabilities in PySpark. In order to use SQL features first, you need to create a temporary view in PySpark. Once you have a temporary view you can run any ANSI SQL queries using
# Spark SQL df.createOrReplaceTempView("sample_table") df2 = spark.sql("SELECT _1,_2 FROM sample_table") df2.show()
PySpark SQL temporary views are session-scoped and will not be available if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view using
4.4 Create Hive Table
As explained above SparkSession is used to create and query Hive tables. Note that in order to do this for testing you don’t need Hive to be installed.
saveAsTable() creates Hive managed table. Query the table using
# Create Hive table & query it. spark.table("sample_table").write.saveAsTable("sample_hive_table") df3 = spark.sql("SELECT _1,_2 FROM sample_hive_table") df3.show()
4.5 Working with Catalogs
To get the catalog metadata, PySpark Session exposes
catalog variable. Note that these methods
spark.catalog.listTables and returns the DataSet.
# Get metadata from the Catalog # List databases dbs = spark.catalog.listDatabases() print(dbs) # Output #[Database(name='default', description='default database', #locationUri='file:/Users/admin/.spyder-py3/spark-warehouse')] # List Tables tbls = spark.catalog.listTables() print(tbls) #Output #[Table(name='sample_hive_table', database='default', description=None, tableType='MANAGED', #isTemporary=False), Table(name='sample_hive_table1', database='default', description=None, #tableType='MANAGED', isTemporary=False), Table(name='sample_hive_table121', database='default', #description=None, tableType='MANAGED', isTemporary=False), Table(name='sample_table', database=None, #description=None, tableType='TEMPORARY', isTemporary=True)]
Notice the two tables we have created, Spark table is considered a temporary table and Hive table as managed table.
8. SparkSession Commonly Used Methods
version() – Returns the Spark version where your application is running, probably the Spark version your cluster is configured with.
getActiveSession() – returns an active Spark session.
readStream() – Returns an instance of
DataStreamReader class, this is used to read streaming data. that can be used to read streaming data into DataFrame.
sparkContext() – Returns a SparkContext.
sql() – Returns a DataFrame after executing the SQL mentioned.
sqlContext() – Returns SQLContext.
stop() – Stop the current SparkContext.
table() – Returns a DataFrame of a table or view.
udf() – Creates a PySpark UDF to use it on DataFrame, Dataset, and SQL.
In this PySpark article, you have learned SparkSession can be created using the builder() method and learned SparkSession is an entry point to PySpark, and creating a SparkSession instance would be the first statement you would write to program finally have learned some of the commonly used SparkSession methods.
- Spark Get the Current SparkContext Settings
- How to resolve NameError: Name ‘Spark’ is not Defined?
- How to resolve Spark Context ‘sc’ Not Defined?
- How to Import PySpark in Python Script
- PySpark “ImportError: No module named py4j.java_gateway” Error
Happy Learning !!