What is SparkContext? Explained

SparkContext has been available since Spark 1.x (JavaSparkContext for Java) and it used to be an entry point to Spark and PySpark before introducing SparkSession in 2.0. Creating SparkContext is the first step to using RDD and connecting to Spark Cluster, In this article, you will learn how to create it using examples.

1. SparkContext in spark-shell

By default, Spark shell provides sc object which is an instance of the SparkContext class. We can directly use this object where required.


// 'sc' is a SparkContext variable in spark-shell
scala>>sc.appName

Yields below output.

Similar to the Spark shell, In most of the tools, notebooks, and Azure Databricks, the environment itself creates a default SparkContext object for us to use so you don’t have to worry about creating a spark context.

2. Spark 2.X – Create SparkContext using Scala Program

Since Spark 2.0, we mostly use SparkSession as most of the methods available in SparkContext are also present in SparkSession. Spark session internally creates the Spark Context and exposes the sparkContext variable to use.

At any given time only one SparkContext instance should be active per JVM. In case you want to create another you should stop existing SparkContext (using stop()) before creating a new one.


// Imports
import org.apache.spark.sql.SparkSession
object SparkSessionTest extends App {

  // Create SparkSession object
  val spark = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate();

  // Access spark context
  println(spark.sparkContext)
  println("Spark App Name : "+spark.sparkContext.appName)
}

// Output:
//org.apache.spark.SparkContext@2fdf17dc
//Spark App Name : SparkByExamples.com

As I explained in the SparkSession article, you can create any number of SparkSession objects however, for all those objects underlying there will be only one SparkContext.

3. Create RDD

Once you create a Spark Context object, use the below to create Spark RDD.


// Create RDD
val rdd = spark.sparkContext.range(1, 5)
rdd.collect().foreach(print)

// Create RDD from Text file
val rdd2 = spark.sparkContext.textFile("src/main/resources/text/alice.txt")

4. Stop SparkContext

You can stop the SparkContext by calling the stop() method. As explained above you can have only one SparkContext per JVM. If you want to create another, you need to shut down it first by using stop() method and create a new SparkContext.


// SparkContext stop() method
spark.sparkContext.stop()

When Spark executes this statement, it logs the message INFO SparkContext: Successfully stopped SparkContext to the console or to a log file.

5. Spark 1.X – Creating SparkContext using Scala Program

In Spark 1.x, first, you need to create a SparkConf instance by assigning the app name and setting the master by using the SparkConf static methods setAppName() and setMaster() respectively and then pass the SparkConf object as an argument to the SparkContext constructor to create Spark Context.


// Create SpakContext
import org.apache.spark.{SparkConf, SparkContext}

// Create SparkConf object
val sparkConf = new SparkConf().setAppName("sparkbyexamples.com").setMaster("local[1]")

// Create Spark context (deprecated)
val sparkContext = new SparkContext(sparkConf)

SparkContext constructor has been deprecated in 2.0 hence, the recommendation is to use a static method getOrCreate() that internally creates SparkContext. This function instantiates a SparkContext and registers it as a singleton object.


// Create Spark Context
val sc = SparkContext.getOrCreate(sparkConf)

6. SparkContext Commonly Used Methods

The following are the most commonly used methods of SparkContext. For the complete list, refer to Spark documentation.

longAccumulator() – It creates an accumulator variable of a long data type. Only a driver can access accumulator variables.

doubleAccumulator() – It creates an accumulator variable of a double data type. Only a driver can access accumulator variables.

applicationId – Returns a unique ID of a Spark application.

appName – Return an app name that was given when creating SparkContext

broadcast – read-only variable broadcast to the entire cluster. You can broadcast a variable to a Spark cluster only once.

emptyRDD – Creates an empty RDD

getPersistentRDDs – Returns all persisted RDDs

getOrCreate() – Creates or returns a SparkContext

hadoopFile – Returns an RDD of a Hadoop file

master()– Returns master that set while creating SparkContext

newAPIHadoopFile – Creates an RDD for a Hadoop file with a new API InputFormat.

sequenceFile – Get an RDD for a Hadoop SequenceFile with given key and value types.

setLogLevel – Change log level to debug, info, warn, fatal, and error

textFile – Reads a text file from HDFS, local or any Hadoop supported file systems, and returns an RDD

union – Union two RDDs

wholeTextFiles – Reads a text file in the folder from HDFS, local or any Hadoop supported file systems and returns an RDD of Tuple2. The first element of the tuple consists file name and the second element consists context of the text file.

7. SparkContext Example


// Complete example of SparkContext
import org.apache.spark.{SparkConf, SparkContext}

object SparkContextExample extends App{

  val conf = new SparkConf().setAppName("sparkbyexamples.com").setMaster("local[1]")
  val sparkContext = new SparkContext(conf)
  val rdd = sparkContext.textFile("src/main/resources/text/alice.txt")

  sparkContext.setLogLevel("ERROR")

  println("First SparkContext:")
  println("APP Name :"+sparkContext.appName)
  println("Deploy Mode :"+sparkContext.deployMode)
  println("Master :"+sparkContext.master)
 // sparkContext.stop()
  
  val conf2 = new SparkConf().setAppName("sparkbyexamples.com-2").setMaster("local[1]")
  val sparkContext2 = new SparkContext(conf2)

  println("Second SparkContext:")
  println("APP Name :"+sparkContext2.appName)
  println("Deploy Mode :"+sparkContext2.deployMode)
  println("Master :"+sparkContext2.master)
  
}

FAQ’s on SparkContext

What does SparkContext do?

SparkContext is entry point to spark application since spark 1.x. The SparkContext is the central entry point and controller for Spark applications. It manages resources, coordinates tasks, and provides the necessary infrastructure for distributed data processing in Spark. It plays a vital role in ensuring the efficient and fault-tolerant execution of Spark jobs.

How to create SparkContext?

SparkContext is created using SparkContext class. By default, A spark “driver” is an application that creates the SparkContext in order to execute the job or jobs of a cluster. You can access the spark context from spark spark session object spark.sparkContext. If you wanted to create spark context by yourself, use the below snippet.

// Create SpakContext
import org.apache.spark.{SparkConf, SparkContext}
val sparkConf = new SparkConf().setAppName(“sparkbyexamples.com”).setMaster(“local[1]”)
val sparkContext = new SparkContext(sparkConf)

How to stop SparkContext?

Once you have finished using Spark, you can stop the SparkContext using the stop() method. This will release all resources associated with the SparkContext and shut down the Spark application gracefully.

Can I have multiple SparkContext in Spark job?

There can only be one active SparkContext per JVM. Having multiple SparkContext instances in a single application can cause issues like resource conflicts, configuration conflicts, and unexpected behavior.

How to access SparkContex variable?

By default, A spark “driver” is an application that creates the SparkContext in order to execute the job or jobs of a cluster. You can access the spark context from spark spark session object spark.sparkContext.

8. Conclusion

In this Spark Context article, you have learned what is SparkContext, how to create in Spark 1.x and Spark 2.0, and using with few basic examples. In summary,

SparkContext is the entry point to any Spark functionality. It represents the connection to a Spark cluster and is responsible for coordinating and distributing the operations on that cluster.
It was the primary entry point for Spark applications before Spark 2.0.
SparkContext is used for low-level RDD (Resilient Distributed Dataset) operations, which were the core data abstraction in Spark before DataFrames and Datasets were introduced.
It is not thread-safe, so in a multi-threaded or multi-user environment, you need to be careful when using a single SparkContext instance.

Happy Learning !!

Reference

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala