You are currently viewing SparkSession vs SparkContext

SparkSession vs SparkContext – Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster, Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset.

Here, I will mainly focus on explaining the difference between SparkSession and SparkContext by defining and describing how to create these two.instances and using it from spark-shell.

What is SparkContext

Spark SparkContext is an entry point to Spark and defined in org.apache.spark package since 1.x and used to programmatically create Spark RDD, accumulators and broadcast variables on the cluster. Since Spark 2.0 most of the functionalities (methods) available in SparkContext are also available in SparkSession. Its object sc is default available in spark-shell and it can be programmatically created using SparkContext class.

What is SparkSession

SparkSession introduced in version 2.0 and and is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame and DataSet. It’s object spark is default available in spark-shell and it can be created programmatically using SparkSession builder pattern.

1. SparkContext

SparkContext has been available since Spark 1.x versions and it’s an entry point to Spark when you wanted to program and use Spark RDD. Most of the operations/methods or functions we use in Spark are comes from SparkContext for example accumulators, broadcast variables, parallelize and more.

SparkContext in spark-shell

Be default Spark shell provides “sc” object which is an instance of SparkContext class. We can directly use this object where required.


  val rdd = sc.textFile("/src/main/resources/text/alice.txt")

Creating SparkContext from Scala program

When you do programming wither with Scala, PySpark or Java, first you need to create a SparkConf instance by assigning app name and setting master by using the SparkConf static methods setAppName() and setmaster() respectively and then pass SparkConf object as an argument to SparkContext constructor to create SparkContext.


  val conf = new SparkConf().setAppName("sparkbyexamples.com").setMaster("local[1]")
  val sparkContext = new SparkContext(conf)

Once you create a SparkContext object, use this to create Spark RDD.


  val rdd = sparkContext.textFile("/src/main/resources/text/alice.txt")

SparkContext complete program


package com.sparkbyexamples.spark.stackoverflow

import org.apache.spark.{SparkConf, SparkContext}

object SparkContextOld extends App{

  val conf = new SparkConf().setAppName("sparkbyexamples.com").setMaster("local[1]")
  val sparkContext = new SparkContext(conf)
  val rdd = sparkContext.textFile("/src/main/resources/text/alice.txt")

  sparkContext.setLogLevel("ERROR")

  println("First SparkContext:")
  println("APP Name :"+sparkContext.appName);
  println("Deploy Mode :"+sparkContext.deployMode);
  println("Master :"+sparkContext.master);
 // sparkContext.stop()
  
  val conf2 = new SparkConf().setAppName("sparkbyexamples.com-2").setMaster("local[1]")
  val sparkContext2 = new SparkContext(conf2)

  println("Second SparkContext:")
  println("APP Name :"+sparkContext2.appName);
  println("Deploy Mode :"+sparkContext2.deployMode);
  println("Master :"+sparkContext2.master);
  
}

2. SparkSession

With Spark 2.0 a new class org.apache.spark.sql.SparkSession has been introduced to use which is a combined class for all different contexts we used to have prior to 2.0 (SQLContext and HiveContext e.t.c) release hence SparkSession can be used in replace with SQLContext and HiveContext.

As mentioned in the beginning SparkSession is an entry point to Spark and creating a SparkSession instance would be the first statement you would write to program with RDD, DataFrame and Dataset and SparkSession will be created using SparkSession.builder() builder patterns.

Spark Session also includes all the APIs available in different contexts –

  • Spark Context,
  • SQL Context,
  • Streaming Context,
  • Hive Context.

SparkSession in spark-shell

Be default Spark shell provides “spark” object which is an instance of SparkSession class. We can directly use this object where required.


scala> val sqlcontext = spark.sqlContext

Similar to Spark shell, In most of the tools, the environment itself creates default SparkSession object for us to use.

Creating SparkSession from Scala program


    val spark = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate();

Conclusion

In this Spark SparkSession vs SparkContext article, you have learned differences between SparkSession and SparkContext. the version they are introduced in, how to create each from Spark Shell and Scala program.

Reference

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium