Note: This Interview questions page is in progress. I will finish this article ASAP. if you are looking for an answer to any question that I have not answered yet, please ask in a comment. I will try to reply within a day or so.
In this Apache Spark Basic or Core Interview questions, I will cover the most frequently asked questions along with answers and links to the article to learn more in detail. When you are looking for a job in Apache Spark it’s always good to have in-depth knowledge of the subject and I hope SparkByExamples.com provides you with the required knowledge to crack the interview. I wish you all the best.
One of the most asked Interview questions is What are the different issues you faced while running the Spark application? This question needs a very detailed answer based on your experiences hence I create a separate article discussing this.
1. Apache Spark Basic Interview Questions
What is Apache Spark?
Apache Spark is an Open source framework, an in-memory computing processing engine that processes data on the Hadoop Ecosystem. It processes both batch and real-time data in a parallel and distributed manner.
Difference between Spark and MapReduce?
MapReduce: MapReduce is I/O intensive read from and writes to disk. It is batch processing. MapReduce is written in java only. It is not iterative and interactive. MapReduce can process larger sets of data compared to spark.
Spark: Spark is a lighting-fast in-memory computing process engine, 100 times faster than MapReduce, 10 times faster to disk. Spark supports languages like Scala, Python, R, and Java. Spark Processes both batch as well as Real-Time data.
What are the components/modules of Apache Spark?
Apache Spark comes with SparkCore, Spark SQL, Spark Streaming, Spark MlLib, and GraphX
- Spark Core
- Spark SQL
- Spark Streaming
- MLib
- GraphX
What are the different installation modes of Spark?
Spark can be installed in 3 different ways.
- Standalone mode:
- Pseudo-distribution mode:
- Multi cluster mode:
What are the different cluster managers Spark supports?
Any Spark application can be divided into an independent set of processes and processed parallelly on a cluster, coordinated by the SparkContext object in the driver program. Especially, to run in a cluster SparkContext connects to different cluster managers, Standalone, Hadoop Yarn, and Apache Mesos to allocate resources across applications. once connected spark acquires Executors on worker nodes and executors, which are processes that process data and store data for the application. And SparkContext sends application code to Executors. Then finally, SparkContexts sends tasks to Executors for processing.
- Standalone cluster manager:
- Yarn
- Mesos
What is SparkSession?
SparkSession was introduced in version Spark 2.0, It is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame, and DataSet. SparkSession’s object spark
is the default variable available in spark-shell
and it can be created programmatically using SparkSession
builder pattern.
Can you create multiple SparkSession objects in a single application?
Yes, you can create as many SparkSession
 objects as you want in a Spark application. Many Spark session objects are required when you wanted to keep Spark tables (relational entities) logically separated.
How do you create SparkSession object?
To create SparkSession in Scala or Python, you need to use the builder pattern method builder()
and calling getOrCreate()
method. If SparkSession already exists it returns otherwise creates a new SparkSession.
val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate();
What is SparkContext?
Can you create multiple SparkContext in an Application?
No. You can create only one active SparkContext per JVM. You should stop() the active SparkContext before creating a new one.
Difference Between SparkSession vs SparkContext?
Can you explain the difference between RDD vs DataFrame vs Dataset?
What are Spark Driver, Worker, and Executor?
What is coalesce vs repartition?
What is Spark shuffle?
What is Lazy Evaluation?
Can you name few DataFrame Actions & Transformations?
Explain Map() vs flatMap()?
Explain Map() vs foreach()?
What is Client vs Cluster deploy modes?
2. Apache Spark Intermediate Interview Questions
Here I will cover Spark intermediate interview questions
How do you debug your Spark application?
How do you kill running Spark Application?
How do you submit the Spark application?
3. Apache Spark Advanced Interview Questions
Here I will cover Spark advanced interview questions
4. Apache Spark Performance Interview Questions
Here I will cover Spark performance and optimization interview questions.
What are the few things you will check to improve Spark performance?
Hi,
Please help complete this article, this is really important article, and the way you explain everything , I am visiting everyday on this page.
thanks a lot for help me.
Can you please write a detailed answer for ‘what are different types issues we face in Spark projects’
Sure. Will write.
Here it is Different Types of Issues While Running Spark Applications
Hi Sir Plese complete the interview questions and answers section as soon as possible
Can you please write a detailed answer for ‘what are different types issues we face in Spark projects ,its really useful us
Here it is Different Types of Issues While Running Spark Applications
Hi, Please complete this article soon. This is really usefull.
can you please complete this question Can you explain the difference between RDD vs DataFrame vs Dataset?
Hi In your Github Codes , Main class script is missing i think , Which calls other scripts .Could you please Add .
Thanks in advance.
Hi, There is no main class. All examples in GitHub are independent which explains a specific functionality. You can run any Scala program without issues.
Hi In your Github Codes , Main class script is missing i think , Which calls other scripts .Could you please Add .
Thanks in advance.
Hi, There is no main class. All examples in GitHub are independent which explains a specific functionality. You can run any Scala program without issues.
Hi NNK,
Your articles are too good. Easy to understand and lot of information to take away. Appreciate all your efforts. Thanks a lot!!
Yes, like most of the people, even I am waiting for answers related to spark interview questions.
Regards,
Chethan MG
Hi Chethan, Currently I am busy with other projects. will soon complete this article.
Wondering if anyone from the community wanna help with this article. please reach out to me.
Thanks,
NNK
Hi NNK,
Your articles are too good. Easy to understand and lot of information to take away. Appreciate all your efforts. Thanks a lot!!
Yes, like most of the people, even I am waiting for answers related to spark interview questions.
Regards,
Chethan MG
I just discovered this website and I am now addicted to it! Nicely presented articles and very easy to understand.
I just discovered this website and I am now addicted to it! Nicely presented articles and very easy to understand.