SparkSession vs SparkContext

SparkSession vs SparkContext – Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster, Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset.

Here, I will mainly focus on explaining the difference between SparkSession and SparkContext by defining and describing how to create these two.instances and using it from spark-shell.

What is SparkContext

Spark SparkContext is an entry point to Spark and defined in org.apache.spark package since 1.x and used to programmatically create Spark RDD, accumulators and broadcast variables on the cluster. Since Spark 2.0 most of the functionalities (methods) available in SparkContext are also available in SparkSession. Its object sc is default available in spark-shell and it can be programmatically created using SparkContext class.

What is SparkSession

SparkSession introduced in version 2.0 and and is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame and DataSet. It’s object spark is default available in spark-shell and it can be created programmatically using SparkSession builder pattern.

1. SparkContext

SparkContext has been available since Spark 1.x versions and it’s an entry point to Spark when you wanted to program and use Spark RDD. Most of the operations/methods or functions we use in Spark are comes from SparkContext for example accumulators, broadcast variables, parallelize and more.

SparkContext in spark-shell

Be default Spark shell provides “sc” object which is an instance of SparkContext class. We can directly use this object where required.


  val rdd = sc.textFile("/src/main/resources/text/alice.txt")

Creating SparkContext from Scala program

When you do programming wither with Scala, PySpark or Java, first you need to create a SparkConf instance by assigning app name and setting master by using the SparkConf static methods setAppName() and setmaster() respectively and then pass SparkConf object as an argument to SparkContext constructor to create SparkContext.


  val conf = new SparkConf().setAppName("sparkbyexamples.com").setMaster("local[1]")
  val sparkContext = new SparkContext(conf)

Once you create a SparkContext object, use this to create Spark RDD.


  val rdd = sparkContext.textFile("/src/main/resources/text/alice.txt")

SparkContext complete program


package com.sparkbyexamples.spark.stackoverflow

import org.apache.spark.{SparkConf, SparkContext}

object SparkContextOld extends App{

  val conf = new SparkConf().setAppName("sparkbyexamples.com").setMaster("local[1]")
  val sparkContext = new SparkContext(conf)
  val rdd = sparkContext.textFile("/src/main/resources/text/alice.txt")

  sparkContext.setLogLevel("ERROR")

  println("First SparkContext:")
  println("APP Name :"+sparkContext.appName);
  println("Deploy Mode :"+sparkContext.deployMode);
  println("Master :"+sparkContext.master);
 // sparkContext.stop()
  
  val conf2 = new SparkConf().setAppName("sparkbyexamples.com-2").setMaster("local[1]")
  val sparkContext2 = new SparkContext(conf2)

  println("Second SparkContext:")
  println("APP Name :"+sparkContext2.appName);
  println("Deploy Mode :"+sparkContext2.deployMode);
  println("Master :"+sparkContext2.master);
  
}

2. SparkSession

With Spark 2.0 a new class org.apache.spark.sql.SparkSession has been introduced to use which is a combined class for all different contexts we used to have prior to 2.0 (SQLContext and HiveContext e.t.c) release hence SparkSession can be used in replace with SQLContext and HiveContext.

As mentioned in the beginning SparkSession is an entry point to Spark and creating a SparkSession instance would be the first statement you would write to program with RDD, DataFrame and Dataset and SparkSession will be created using SparkSession.builder() builder patterns.

Spark Session also includes all the APIs available in different contexts –

  • Spark Context,
  • SQL Context,
  • Streaming Context,
  • Hive Context.

SparkSession in spark-shell

Be default Spark shell provides “spark” object which is an instance of SparkSession class. We can directly use this object where required.


scala> val sqlcontext = spark.sqlContext

Similar to Spark shell, In most of the tools, the environment itself creates default SparkSession object for us to use.

Creating SparkSession from Scala program


    val spark = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate();

Conclusion

In this Spark SparkSession vs SparkContext article, you have learned differences between SparkSession and SparkContext. the version they are introduced in, how to create each from Spark Shell and Scala program.

Reference

Happy Learning !!

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing SparkSession vs SparkContext