Since Spark 2.0 SparkSession has become an entry point to PySpark to work with RDD, DataFrame. Prior to 2.0, SparkContext used to be an entry point. Here, I will mainly focus on explaining what is SparkSession by defining and describing how to create SparkSession and using default SparkSession
spark variable from
SparkSession introduced in version 2.0, It is an entry point to underlying PySpark functionality in order to programmatically create PySpark RDD, DataFrame. It’s object
spark is default available in pyspark-shell and it can be created programmatically using SparkSession.
With Spark 2.0 a new class SparkSession (
pyspark.sql import SparkSession) has been introduced. SparkSession is a combined class for all different contexts we used to have prior to 2.0 relase (SQLContext and HiveContext e.t.c). Since 2.0 SparkSession can be used in replace with SQLContext, HiveContext, and other contexts defined prior to 2.0.
As mentioned in the beginning SparkSession is an entry point to PySpark and creating a SparkSession instance would be the first statement you would write to program with RDD, DataFrame, and Dataset. SparkSession will be created using
SparkSession.builder builder patterns.
Though SparkContext used to be an entry point prior to 2.0, It is not completely replaced with SparkSession, many features of SparkContext are still available and used in Spark 2.0 and later. You should also know that SparkSession internally creates SparkConfig and SparkContext with the configuration provided with SparkSession.
Spark Session also includes all the APIs available in different contexts –
- Spark Context,
- SQL Context,
- Streaming Context,
- Hive Context.
You can create as many SparkSession objects you want using either
SparkSession in PySpark shell
Be default PySpark shell provides “
spark” object; which is an instance of SparkSession class. We can directly use this object where required in spark-shell. Start your “
pyspark” shell from
$SPARK_HOME\bin folder and enter the below statement.
sqlcontext = spark.sqlContext
Similar to PySpark shell, In most of the tools, the environment itself creates default SparkSession object for us to use so you don’t have to worry about creating SparkSession object.
In order to create SparkSession programmatically( in .py file) in PySpark, you need to use the builder pattern method
builder() as explained below.
getOrCreate() method returns an already existing SparkSession; if not exists, it creates a new SparkSession.
import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.master("local") \ .appName('SparkByExamples.com') \ .getOrCreate()
master() – If you are running it on the cluster you need to use your master name as an argument to master(). usually, it would be either
mesos depends on your cluster setup.
local[x]when running in Standalone mode. x should be an integer value and should be greater than 0; this represents how many partitions it should create when using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.
appName() – Used to set your application name.
getOrCreate() – This returns a SparkSession object if already exists, creates new one if not exists.
Note: SparkSession object “spark” is by default available in PySpark shell.
You can also create a new SparkSession using
import pyspark from pyspark.sql import SparkSession sparkSession3 = SparkSession.newSession
This always creates a new SparkSession object.
SparkSession Commonly Used Methods
version() – Returns Spark version where your application is running, probably the Spark version you cluster is configured with.
getActiveSession() – returns an active Spark session.
read() – Returns an instance of
DataFrameReader class, this is used to read records from csv, parquet, avro and more file formats into DataFrame.
readStream() – Returns an instance of
DataStreamReader class, this is used to read streaming data. that can be used to read streaming data into DataFrame.
sparkContext() – Returns a SparkContext.
sql() – Returns a DataFrame after executing the SQL mentioned.
sqlContext() – Returns SQLContext.
stop() – Stop the current SparkContext.
table() – Returns a DataFrame of a table or view.
udf() – Creates a PySpark UDF to use it on DataFrame, Dataset, and SQL.
In this PySpark article, you have learned SparkSession can be created using the builder() method and learned SparkSession is an entry point to PySpark, and creating a SparkSession instance would be the first statement you would write to program and finally have learned some of the commonly used SparkSession methods.
- Spark Get the Current SparkContext Settings
- How to resolve NameError: Name ‘Spark’ is not Defined?
- How to resolve Spark Context ‘sc’ Not Defined?
- How to Import PySpark in Python Script
- PySpark “ImportError: No module named py4j.java_gateway” Error
Happy Learning !!