Apache Spark Interview Questions

Note: This Interview questions page is in progress. I will finish this article ASAP. if you are looking for an answer to any question that I have not answered yet, please ask in a comment. I will try to reply within a day or so.

In this Apache Spark Basic or Core Interview questions, I will cover the most frequently asked questions along with answers and links to the article to learn more in detail. When you are looking for a job in Apache Spark it’s always good to have in-depth knowledge of the subject and I hope SparkByExamples.com provides you with the required knowledge to crack the interview. I wish you all the best.

One of the most asked Interview questions is What are the different issues you faced while running the Spark application? This question needs a very detailed answer based on your experiences hence I create a separate article discussing this.

1. Apache Spark Basic Interview Questions

What is Apache Spark?

Apache Spark is an Open source framework, an in-memory computing processing engine that processes data on the Hadoop Ecosystem. It processes both batch and real-time data in a parallel and distributed manner.

Difference between Spark and MapReduce?

MapReduce: MapReduce is I/O intensive read from and writes to disk. It is batch processing. MapReduce is written in java only. It is not iterative and interactive. MapReduce can process larger sets of data compared to spark.

Spark: Spark is a lighting-fast in-memory computing process engine, 100 times faster than MapReduce, 10 times faster to disk. Spark supports languages like Scala, Python, R, and Java. Spark Processes both batch as well as Real-Time data.

What are the components/modules of Apache Spark?

Apache Spark comes with SparkCore, Spark SQL, Spark Streaming, Spark MlLib, and GraphX

  • Spark Core
  • Spark SQL
  • Spark Streaming
  • MLib
  • GraphX

What are the different installation modes of Spark?

Spark can be installed in 3 different ways.

  • Standalone mode:
  • Pseudo-distribution mode:
  • Multi cluster mode:

What are the different cluster managers Spark supports?

Any Spark application can be divided into an independent set of processes and processed parallelly on a cluster, coordinated by the SparkContext object in the driver program. Especially, to run in a cluster SparkContext connects to different cluster managers, Standalone, Hadoop Yarn, and Apache Mesos to allocate resources across applications. once connected spark acquires Executors on worker nodes and executors, which are processes that process data and store data for the application. And SparkContext sends application code to Executors. Then finally, SparkContexts sends tasks to Executors for processing.

  • Standalone cluster manager:
  • Yarn
  • Mesos

What is SparkSession?

SparkSession was introduced in version Spark 2.0, It is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame, and DataSet. SparkSession’s object spark is the default variable available in spark-shell and it can be created programmatically using SparkSession builder pattern.

Can you create multiple SparkSession objects in a single application?

Yes, you can create as many SparkSession objects as you want in a Spark application. Many Spark session objects are required when you wanted to keep Spark tables (relational entities) logically separated.

How do you create SparkSession object?

To create SparkSession in Scala or Python, you need to use the builder pattern method builder() and calling getOrCreate() method. If SparkSession already exists it returns otherwise creates a new SparkSession.


val spark = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate();

What is SparkContext?

Can you create multiple SparkContext in an Application?

No. You can create only one active SparkContext per JVM. You should stop() the active SparkContext before creating a new one.

Difference Between SparkSession vs SparkContext?

Can you explain the difference between RDD vs DataFrame vs Dataset?

What are Spark Driver, Worker, and Executor?

What is coalesce vs repartition?

What is Spark shuffle?

What is Lazy Evaluation?

Can you name few DataFrame Actions & Transformations?

Explain Map() vs flatMap()?

Explain Map() vs foreach()?

What is Client vs Cluster deploy modes?

2. Apache Spark Intermediate Interview Questions

Here I will cover Spark intermediate interview questions

How do you debug your Spark application?

How do you kill running Spark Application?

How do you submit the Spark application?

3. Apache Spark Advanced Interview Questions

Here I will cover Spark advanced interview questions

4. Apache Spark Performance Interview Questions

Here I will cover Spark performance and optimization interview questions.

What are the few things you will check to improve Spark performance?

5. Conclusion

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

This Post Has 8 Comments

  1. vinay

    can you please complete this question Can you explain the difference between RDD vs DataFrame vs Dataset?

  2. Anonymous

    Hi, Please complete this article soon. This is really usefull.

  3. Anonymous

    Can you please write a detailed answer for ‘what are different types issues we face in Spark projects ,its really useful us

  4. Avinash

    Hi Sir Plese complete the interview questions and answers section as soon as possible

  5. Madhu

    Can you please write a detailed answer for ‘what are different types issues we face in Spark projects’

  6. Mujeeb

    Hi,
    Please help complete this article, this is really important article, and the way you explain everything , I am visiting everyday on this page.
    thanks a lot for help me.