Spark – What is SparkSession Explained

Spread the love

Since Apache Spark 2.0, SparkSession has become a unified entry point to Spark to work with RDD, DataFrame, and Dataset. Prior to 2.0, SparkContext used to be an entry point. Here, I will mainly focus on explaining what is SparkSession by defining and describing how to create Spark Session. spark-shell which is a CLI to interact with Spark provides a default Spark Session ‘spark‘ variable hence you don’t have to create one if you are using spark-shell.

What is SparkSession

SparkSession was introduced in version Spark 2.0, It is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame, and DataSet. SparkSession’s object spark is the default variable available in spark-shell and it can be created programmatically using SparkSession builder pattern.

If you are looking for PySpark please refer to how to create SparkSession in PySpark.

1. SparkSession in Spark 2.0

  • With Spark 2.0 a new class org.apache.spark.sql.SparkSession has been introduced which is a combined class for all the different contexts we used to have prior to 2.0 (SQLContext and HiveContext e.t.c) release hence, Spark Session can be used in the place of SQLContext, HiveContext, and other contexts.
  • As mentioned in the beginning SparkSession is an entry point to Spark and creating a SparkSession instance would be the first statement you would write to program with RDD, DataFrame, and Dataset. SparkSession will be created using SparkSession.builder() builder pattern.
  • Prior to Spark 2.0, SparkContext used to be an entry point, and it’s not been completely replaced with SparkSession. Many features of SparkContext are still available and used in Spark 2.0 and later. You should also know that SparkSession internally creates SparkConfig and SparkContext with the configuration provided with SparkSession.
  • Spark Session also includes all the APIs available in different contexts –
    • SparkContext
    • SQLContext
    • StreamingContext
    • HiveContext
How many SparkSessions can you create in an application?

You can create as many SparkSession as you want in a Spark application using either SparkSession.builder() or SparkSession.newSession(). Many Spark session objects are required when you want to keep Spark tables (relational entities) logically separated.

2. SparkSession in spark-shell

By default, Spark shell provides spark object which is an instance of the SparkSession class. We can directly use this object when required in spark-shell.


// Usage of spark variable
scala> spark.version

Similar to the Spark shell, In most of the tools, notebooks, and Azure Databricks, the environment itself creates a default SparkSession object for us to use so you don’t have to worry about creating a Spark session.

3. Create SparkSession From Scala Program

To create SparkSession in Scala or Python, you need to use the builder pattern method builder() and calling getOrCreate() method. If SparkSession already exists it returns otherwise creates a new SparkSession.


// Create SparkSession object
import org.apache.spark.sql.SparkSession
object SparkSessionTest extends App {
  val spark = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate();
  println(spark)
  println("Spark Version : "+spark.version)
}

// Outputs
// org.apache.spark.sql.SparkSession@2fdf17dc
// Spark Version : 3.4.1

From the above code –

SparkSession.builder() – Return SparkSession.Builder class. This is a builder for SparkSession. master(), appName(), and getOrCreate() are methods of SparkSession.Builder.

master() – This allows Spark applications to connect and run in different modes (local, standalone cluster, Mesos, YARN), depending on the configuration.

  • Use local[x] when running on your local laptop. x should be an integer value and should be greater than 0; this represents how many partitions it should create when using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.
  • For standalone use spark://master:7077

appName() – Sets a name to the Spark application that shows in the Spark web UI. If no application name is set, it sets a random name.

getOrCreate() – This returns a SparkSession object if already exists. Creates a new one if not exist.

3.1 Get Existing SparkSession

You can get the existing SparkSession in Scala programmatically using the below example.


// Get existing SparkSession 
import org.apache.spark.sql.SparkSession
val spark2 = SparkSession.builder().getOrCreate()
print(spark2)

// Output:
// org.apache.spark.sql.SparkSession@2fdf17dc

Compare the hash of spark and spark2 object. Since it returned the existing session, both objects have the same hash value.

3.2 Create Another SparkSession

Sometimes you might be required to create multiple sessions, which you can easily achieve by using newSession() method. This uses the same app name, master as the existing session. Underlying SparkContext will be the same for both sessions as you can have only one context per Spark application.


// Create a new SparkSession
val spark3 = spark.newSession()
print(spark3)

// Output:
// org.apache.spark.sql.SparkSession@692dba54

Compare this hash with the hash from the above example, it should be different.

3.3 Setting Spark Configs

If you want to set some configs to SparkSession, use the config() method.


// Usage of config()
val spark = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .config("spark.some.config.option", "config-value")
      .getOrCreate();

3.4 Create SparkSession with Hive Enable

In order to use Hive with Spark, you need to enable it using the enableHiveSupport() method. SparkSession from Spark2.0 provides inbuilt support for Hive operations like writing queries on Hive tables using HQL, access to Hive UDFs, and reading data from Hive tables.


// Enabling Hive to use in Spark
val spark = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .config("spark.sql.warehouse.dir", "<path>/spark-warehouse")
      .enableHiveSupport()
      .getOrCreate();

4. Other Usages of SparkSession

4.1 Set & Get All Spark Configs

Once the SparkSession is created, you can add the spark configs during runtime or get all configs.


// Set Config
spark.conf.set("spark.sql.shuffle.partitions", "30")

// Get all Spark Configs
val configMap:Map[String, String] = spark.conf.getAll

4.2 Create DataFrame

SparkSession also provides several methods to create a Spark DataFrame and Dataset. The below example uses the createDataFrame() method which takes a list of data.


// Create DataFrame
val df = spark.createDataFrame(
    List(("Scala", 25000), ("Spark", 35000), ("PHP", 21000)))
df.show()

// Output:
// +-----+-----+
// |   _1|   _2|
// +-----+-----+
// |Scala|25000|
// |Spark|35000|
// |  PHP|21000|
// +-----+-----+

4.3 Working with Spark SQL

Using SparkSession you can access Spark SQL capabilities in Apache Spark. In order to use SQL features first, you need to create a temporary view in Spark. Once you have a temporary view you can run any ANSI SQL queries using spark.sql() method.


// Spark SQL
df.createOrReplaceTempView("sample_table")
val df2 = spark.sql("SELECT _1,_2 FROM sample_table")
df2.show()

Spark SQL temporary views are session-scoped and will not be available if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and kept alive until the Spark application terminates, you can create a global temporary view using createGlobalTempView().

4.4 Create Hive Table

As explained above, SparkSession can also be used to create Hive tables and query them. Note that in order to do this for testing you don’t need Hive to be installed. saveAsTable() creates Hive managed table. Query the table using spark.sql().


// Create Hive table & query it.  
spark.table("sample_table").write.saveAsTable("sample_hive_table")
val df3 = spark.sql("SELECT _1,_2 FROM sample_hive_table")
df3.show()

4.5 Working with Catalogs

To get the catalog metadata, Spark Session exposes catalog variable. Note that these methods spark.catalog.listDatabases and spark.catalog.listTables returns the Dataset.


// Get metadata from the Catalog
// List databases
val ds = spark.catalog.listDatabases
ds.show(false)

// Output:
// +-------+----------------+----------------------------+
// |name   |description     |locationUri                 |
// +-------+----------------+----------------------------+
// |default|default database|file:/<path>/spark-warehouse|
// +-------+----------------+----------------------------+

// List Tables
val ds2 = spark.catalog.listTables
ds2.show(false)

// Output:
// +-----------------+--------+-----------+---------+-----------+
// |name             |database|description|tableType|isTemporary|
// +-----------------+--------+-----------+---------+-----------+
// |sample_hive_table|default |null       |MANAGED  |false      |
// |sample_table     |null    |null       |TEMPORARY|true       |
// +-----------------+--------+-----------+---------+-----------+

Notice the two tables we have created so far, The sample_table which was created from Spark.createOrReplaceTempView is considered a temporary table and Hive table as managed table.

5. SparkSession Commonly Used Methods

version – Returns Spark version where your application is running, probably the Spark version your cluster is configured with.

catalog – Returns the catalog object to access metadata.

conf – Returns the RuntimeConfig object.

builder() – builder() is used to create a new SparkSession, this return SparkSession.Builder

newSession() – Creaetes a new SparkSession.

range(n) – Returns a single column Dataset with LongType and column named id, containing elements in a range from 0 to n (exclusive) with step value 1. There are several variations of this function, refer to Spark documentation.

createDataFrame() – This creates a DataFrame from a collection and an RDD

createDataset() – This creates a Dataset from the collection, DataFrame, and RDD.

emptyDataFrame() – Creates an empty DataFrame.

emptyDataset() – Creates an empty Dataset.

getActiveSession() – Returns an active Spark session for the current thread.

getDefaultSession() – Returns the default SparkSession that is returned by the builder.

implicits() – To access the nested Scala object.

read() – Returns an instance of DataFrameReader class, this is used to read records from CSV, Parquet, Avro, and more file formats into DataFrame.

readStream() – Returns an instance of DataStreamReader class, this is used to read streaming data. that can be used to read streaming data into DataFrame.

sparkContext() – Returns a SparkContext.

sql(String sql) – Returns a DataFrame after executing the SQL mentioned.

sqlContext() – Returns SQLContext.

stop() – Stop the current SparkContext.

table() – Returns a DataFrame of a table or view.

udf() – Creates a Spark UDF to use it on DataFrame, Dataset, and SQL.

6. FAQ’s on SparkSession

How to create SparkSession?

SparkSession is created using SparkSession.builder().master("master-details").appName("app-name").getOrCreate(); Here, getOrCreate() method returns SparkSession if already exists. If not, it creates a new SparkSession.

How many SparkSessions can I create?

You can create as many SparkSession as you want in a Spark application using either SparkSession.builder() or SparkSession.newSession(). Many Spark session objects are required when you want to keep Spark tables (relational entities) logically separated.

How to stop SparkSession?

To stop SparkSession in Apache Spark, you can use the stop() method of the SparkSession object. If you have spark as a SparkSession object then call spark.stop() to stop the session. Calling a stop() is important to do when you’re finished with your Spark application. This ensures that resources are properly released and the Spark application terminates gracefully.

How SparkSession is different from SparkContext?

SparkSession and SparkContext are two core components of Apache Spark. Though they sound similar, they serve different purposes and are used in different contexts within a Spark application.
SparkContext provides the connection to a Spark cluster and is responsible for coordinating and distributing the operations on that cluster. SparkContext is used for low-level RDD (Resilient Distributed Dataset) programming.
SparkSession was introduced in Spark 2.0 to provide a more convenient and unified API for working with structured data. It’s designed to work with DataFrames and Datasets, which provide more structured and optimized operations than RDDs.

Do we need to stop SparkSession?

It is recommended to end the Spark session after finishing the Spark job in order for the JVMs to close and free the resources. 

How do I know if my Spark session is active?

To check if your SparkSession is active, you can use the SparkSession object’s sparkContext attribute and check its isActive property. If you have spark as a SparkSession object then call spark.sparkContext.isActive. This returns true if it is active otherwise false.

7. Conclusion

In this Spark SparkSession article, you have learned what is Spark Session, its usage, how to create SparkSession programmatically, and learned some of the commonly used SparkSession methods. In summary

  • SparkSession was introduced in Spark 2.0 which is a unified API for working with structured data.
  • It combines SparkContext, SQLContext, and HiveContext. It’s designed to work with DataFrames and Datasets, which provide more structured and optimized operations than RDDs.
  • SparkSession natively supports SQL queries, structured streaming, and DataFrame-based machine learning APIs.
  • spark-shell, Databricks, and other tools provide spark variable as the default SparkSession object.

Happy Learning !!

Related Articles

Reference

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

This Post Has 2 Comments

  1. Anonymous

    Thanks so much for all

  2. Anonymous

    Thank you for your effort. I have found valuable stuff here.

You are currently viewing Spark – What is SparkSession Explained