PySpark – What is SparkSession?

Since Spark 2.0 SparkSession has become an entry point to PySpark to work with RDD, DataFrame. Prior to 2.0, SparkContext used to be an entry point. Here, I will mainly focus on explaining what is SparkSession by defining and describing how to create SparkSession and using default SparkSession spark variable from pyspark-shell.

What is SparkSession

SparkSession introduced in version 2.0, It is an entry point to underlying PySpark functionality in order to programmatically create PySpark RDD, DataFrame. It’s object spark is default available in pyspark-shell and it can be created programmatically using SparkSession.

SparkSession

With Spark 2.0 a new class SparkSession (pyspark.sql import SparkSession) has been introduced. SparkSession is a combined class for all different contexts we used to have prior to 2.0 relase (SQLContext and HiveContext e.t.c). Since 2.0 SparkSession can be used in replace with SQLContext, HiveContext, and other contexts defined prior to 2.0.

As mentioned in the beginning SparkSession is an entry point to PySpark and creating a SparkSession instance would be the first statement you would write to program with RDD, DataFrame, and Dataset. SparkSession will be created using SparkSession.builder builder patterns.

Though SparkContext used to be an entry point prior to 2.0, It is not completely replaced with SparkSession, many features of SparkContext are still available and used in Spark 2.0 and later. You should also know that SparkSession internally creates SparkConfig and SparkContext with the configuration provided with SparkSession.

Spark Session also includes all the APIs available in different contexts –

  • Spark Context,
  • SQL Context,
  • Streaming Context,
  • Hive Context.

You can create as many SparkSession objects you want using either SparkSession.builder or SparkSession.newSession.

SparkSession in PySpark shell

Be default PySpark shell provides “spark” object; which is an instance of SparkSession class. We can directly use this object where required in spark-shell. Start your “pyspark” shell from $SPARK_HOME\bin folder and enter the below statement.


sqlcontext = spark.sqlContext

Similar to PySpark shell, In most of the tools, the environment itself creates default SparkSession object for us to use so you don’t have to worry about creating SparkSession object.

Create SparkSession

In order to create SparkSession programmatically( in .py file) in PySpark, you need to use the builder pattern method builder() as explained below. getOrCreate() method returns an already existing SparkSession; if not exists, it creates a new SparkSession.


import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

master() – If you are running it on the cluster you need to use your master name as an argument to master(). usually, it would be either yarn or mesos depends on your cluster setup.

  • Use local[x] when running in Standalone mode. x should be an integer value and should be greater than 0; this represents how many partitions it should create when using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.

appName() – Used to set your application name.

getOrCreate() – This returns a SparkSession object if already exists, creates new one if not exists.

Note:  SparkSession object “spark” is by default available in PySpark shell.

You can also create a new SparkSession using newSession() method.


import pyspark
from pyspark.sql import SparkSession
sparkSession3 = SparkSession.newSession

This always creates new SparkSession object.

SparkSession Commonly Used Methods


version
– Returns Spark version where your application is running, probably the Spark version you cluster is configured with.

createDataFrame() – This creates a DataFrame from a collection and an RDD

getActiveSession() – returns an active Spark session.

read() – Returns an instance of DataFrameReader class, this is used to read records from csv, parquet, avro and more file formats into DataFrame.

readStream() – Returns an instance of DataStreamReader class, this is used to read streaming data. that can be used to read streaming data into DataFrame.

sparkContext() – Returns a SparkContext.

sql – Returns a DataFrame after executing the SQL mentioned.

sqlContext() – Returns SQLContext.

stop() – Stop the current SparkContext.

table() – Returns a DataFrame of a table or view.

udf() – Creates a PySpark UDF to use it on DataFrame, Dataset, and SQL.

Conclusion

In this PySpark article, you have learned SparkSession can be created using the builder() method and learned SparkSession is an entry point to PySpark, and creating a SparkSession instance would be the first statement you would write to program and finally have learned some of the commonly used SparkSession methods.

Related Articles

Reference

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply