Note: This Interview questions page is in progress. I will finish this article ASAP. if you are looking for an answer to any question that I have not answered yet, please ask in a comment. I will try to reply within a day or so.
In this Apache Spark Basic or Core Interview questions, I will cover the most frequently asked questions along with answers and links to the article to learn more in detail. When you are looking for a job in Apache Spark it’s always good to have in-depth knowledge of the subject and I hope SparkByExamples.com provides you with the required knowledge to crack the interview. I wish you all the best.
1. Apache Spark Basic Interview Questions
What is Apache Spark?
Apache Spark is an Open source framework, an in-memory computing processing engine that processes data on the Hadoop Ecosystem. It processes both batch and real-time data in a parallel and distributed manner.
Difference between Spark and MapReduce?
MapReduce: MapReduce is I/O intensive read from and writes to disk. It is batch processing. MapReduce is written in java only. It is not iterative and interactive. MapReduce can process larger sets of data compared to spark.
Spark: Spark is a lighting-fast in-memory computing process engine, 100 times faster than MapReduce, 10 times faster to disk. Spark supports languages like Scala, Python, R, and Java. Spark Processes both batch as well as Real-Time data.
What are the components/modules of Apache Spark?
Apache Spark comes with SparkCore, Spark SQL, Spark Streaming, Spark MlLib, and GraphX
- Spark Core: Spark core is fundamental to all spark applications. It is responsible for Scheduling, dispatching, and monitoring jobs. It performs basic I/O functionalities. It has the fundamental data sets called RDD(Resilient Distributed datasets) that are immutable and fault-tolerant. RDD handles partitioning data across all the nodes in a cluster. RDDs have no schema.
- Spark SQL: Spark SQL is a framework to process the data in an optimized way. It runs on top of Spark Core. Spark SQL gets more information about the data being processed. Spark SQL processes structured data. Gives tighter integration between relational and procedural processing through DataFrame and Dataset APIs. Spark SQL provides SQL queries on DataFrames.
- Spark Streaming: Spark Streaming module is low latency and fault-tolerant, to process the streaming data. A continuous stream of data is divided into discretized streams data called
Dstreamsand Dstreams are processed parallelly and sent processed streams to File systems, external data storage, or live dashboards. Data ingestion can be done from many sources like Kafka, Apache Flume, Amazon Kinesis or TCP sockets and processing can be done using complex algorithms that are expressed with high-level functions like map, reduce, join and window.
What are the different installation modes of Spark?
Spark can be installed in 3 different ways.
- Standalone mode:
- Pseudo-distribution mode:
- Multi cluster mode:
What are the different cluster managers Spark supports?
Any Spark application can be divided into an independent set of processes and processed parallelly on a cluster, coordinated by the SparkContext object in the driver program. Especially, to run in a cluster SparkContext connects to different cluster managers, Standalone, Hadoop Yarn, and Apache Mesos to allocate resources across applications. once connected spark acquires Executors on worker nodes and executors, which are processes that process data and store data for the application. And SparkContext sends application code to Executors. Then finally, SparkContexts sends tasks to Executors for processing.
- Standalone cluster manager:
What is SparkSession?
SparkSession was introduced in version Spark 2.0, It is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame, and DataSet. SparkSession’s object
spark is the default variable available in
spark-shell and it can be created programmatically using
SparkSession builder pattern.
Can you create multiple SparkSession objects in a single application?
Yes, you can create as many
SparkSession objects as you want in a Spark application. Many Spark session objects are required when you wanted to keep Spark tables (relational entities) logically separated.
How do you create SparkSession object?
To create SparkSession in Scala or Python, you need to use the builder pattern method
builder() and calling
getOrCreate() method. If SparkSession already exists it returns otherwise creates a new SparkSession.
val spark = SparkSession.builder() .master("local") .appName("SparkByExamples.com") .getOrCreate();
What is SparkContext?
Can you create multiple SparkContext in an Application?
No. You can create only one active SparkContext per JVM. You should stop() the active SparkContext before creating a new one.
Difference Between SparkSession vs SparkContext?
Can you explain the difference between RDD vs DataFrame vs Dataset?
What are Spark Driver, Worker, and Executor?
What is coalesce vs repartition?
What is Spark shuffle?
What is Lazy Evaluation?
Can you name few DataFrame Actions & Transformations?
Explain Map() vs flatMap()?
Explain Map() vs foreach()?
What is Client vs Cluster deploy modes?
2. Apache Spark Intermediate Interview Questions
Here I will cover Spark intermediate interview questions
How do you debug your Spark application?
How do you kill running Spark Application?
How do you submit the Spark application?
3. Apache Spark Advanced Interview Questions
Here I will cover Spark advanced interview questions
4. Apache Spark Performance Interview Questions
Here I will cover Spark performance and optimization interview questions.
What are the few things you will check to improve Spark performance?