Apache Spark Interview Questions

Note: This Interview questions page is in progress. I will finish this article ASAP. if you are looking for an answer to any question that I have not answered yet, please ask in a comment. I will try to reply within a day or so.

In this Apache Spark Basic or Core Interview questions, I will cover the most frequently asked questions along with answers and links to the article to learn more in detail. When you are looking for a job in Apache Spark it’s always good to have in-depth knowledge of the subject and I hope SparkByExamples.com provides you with the required knowledge to crack the interview. I wish you all the best.

One of the most asked Interview questions is What are the different issues you faced while running the Spark application? This question needs a very detailed answer based on your experiences hence I create a separate article discussing this.

1. Apache Spark Basic Interview Questions

What is Apache Spark?

Apache Spark is an Open source framework, an in-memory computing processing engine that processes data on the Hadoop Ecosystem. It processes both batch and real-time data in a parallel and distributed manner.

Difference between Spark and MapReduce?

MapReduce: MapReduce is I/O intensive read from and writes to disk. It is batch processing. MapReduce is written in java only. It is not iterative and interactive. MapReduce can process larger sets of data compared to spark.

Spark: Spark is a lighting-fast in-memory computing process engine, 100 times faster than MapReduce, 10 times faster to disk. Spark supports languages like Scala, Python, R, and Java. Spark Processes both batch as well as Real-Time data.

What are the components/modules of Apache Spark?

Apache Spark comes with SparkCore, Spark SQL, Spark Streaming, Spark MlLib, and GraphX

  • Spark Core
  • Spark SQL
  • Spark Streaming
  • MLib
  • GraphX

What are the different installation modes of Spark?

Spark can be installed in 3 different ways.

  • Standalone mode:
  • Pseudo-distribution mode:
  • Multi cluster mode:

What are the different cluster managers Spark supports?

Any Spark application can be divided into an independent set of processes and processed parallelly on a cluster, coordinated by the SparkContext object in the driver program. Especially, to run in a cluster SparkContext connects to different cluster managers, Standalone, Hadoop Yarn, and Apache Mesos to allocate resources across applications. once connected spark acquires Executors on worker nodes and executors, which are processes that process data and store data for the application. And SparkContext sends application code to Executors. Then finally, SparkContexts sends tasks to Executors for processing.

  • Standalone cluster manager:
  • Yarn
  • Mesos

What is SparkSession?

SparkSession was introduced in version Spark 2.0, It is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame, and DataSet. SparkSession’s object spark is the default variable available in spark-shell and it can be created programmatically using SparkSession builder pattern.

Can you create multiple SparkSession objects in a single application?

Yes, you can create as many SparkSession objects as you want in a Spark application. Many Spark session objects are required when you wanted to keep Spark tables (relational entities) logically separated.

How do you create SparkSession object?

To create SparkSession in Scala or Python, you need to use the builder pattern method builder() and calling getOrCreate() method. If SparkSession already exists it returns otherwise creates a new SparkSession.


val spark = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate();

What is SparkContext?

Can you create multiple SparkContext in an Application?

No. You can create only one active SparkContext per JVM. You should stop() the active SparkContext before creating a new one.

Difference Between SparkSession vs SparkContext?

Can you explain the difference between RDD vs DataFrame vs Dataset?

What are Spark Driver, Worker, and Executor?

What is coalesce vs repartition?

What is Spark shuffle?

What is Lazy Evaluation?

Can you name few DataFrame Actions & Transformations?

Explain Map() vs flatMap()?

Explain Map() vs foreach()?

What is Client vs Cluster deploy modes?

2. Apache Spark Intermediate Interview Questions

Here I will cover Spark intermediate interview questions

How do you debug your Spark application?

How do you kill running Spark Application?

How do you submit the Spark application?

3. Apache Spark Advanced Interview Questions

Here I will cover Spark advanced interview questions

4. Apache Spark Performance Interview Questions

Here I will cover Spark performance and optimization interview questions.

What are the few things you will check to improve Spark performance?

5. Conclusion

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

This Post Has 17 Comments

  1. Mujeeb

    Hi,
    Please help complete this article, this is really important article, and the way you explain everything , I am visiting everyday on this page.
    thanks a lot for help me.

  2. Madhu

    Can you please write a detailed answer for ‘what are different types issues we face in Spark projects’

  3. Avinash

    Hi Sir Plese complete the interview questions and answers section as soon as possible

  4. Anonymous

    Can you please write a detailed answer for ‘what are different types issues we face in Spark projects ,its really useful us

  5. Anonymous

    Hi, Please complete this article soon. This is really usefull.

  6. vinay

    can you please complete this question Can you explain the difference between RDD vs DataFrame vs Dataset?

  7. Gouse

    Hi In your Github Codes , Main class script is missing i think , Which calls other scripts .Could you please Add .
    Thanks in advance.

    1. NNK

      Hi, There is no main class. All examples in GitHub are independent which explains a specific functionality. You can run any Scala program without issues.

  8. Gouse

    Hi In your Github Codes , Main class script is missing i think , Which calls other scripts .Could you please Add .
    Thanks in advance.

    1. NNK

      Hi, There is no main class. All examples in GitHub are independent which explains a specific functionality. You can run any Scala program without issues.

  9. ChethanMg

    Hi NNK,

    Your articles are too good. Easy to understand and lot of information to take away. Appreciate all your efforts. Thanks a lot!!

    Yes, like most of the people, even I am waiting for answers related to spark interview questions.

    Regards,
    Chethan MG

    1. NNK

      Hi Chethan, Currently I am busy with other projects. will soon complete this article.
      Wondering if anyone from the community wanna help with this article. please reach out to me.
      Thanks,
      NNK

  10. ChethanMg

    Hi NNK,

    Your articles are too good. Easy to understand and lot of information to take away. Appreciate all your efforts. Thanks a lot!!

    Yes, like most of the people, even I am waiting for answers related to spark interview questions.

    Regards,
    Chethan MG

  11. GS

    I just discovered this website and I am now addicted to it! Nicely presented articles and very easy to understand.

  12. GS

    I just discovered this website and I am now addicted to it! Nicely presented articles and very easy to understand.