Question 1: How do
you differentiate between Hadoop and Spark?
Answer: Hadoop is
more of an eco-system, which provides a platform for storing (different data
formats), compute engine, cluster management, distributed file system (HDFS). Whereas Spark is more of a compute
engine only, which can well integrated with the Hadoop. Even, you can say Spark
work as a one of the compute engine for Hadoop.
Spark does not have its own storage engine, but it can
connect to various other storage like HDFS, Local File System, and RDBMS etc.

Question 2: What are
the basic functionality of the Spark Core?
Answer: Spark
core is the heart of the entire Spark engine, which provide various
functionality as below.
·
Managing memory pool
·
Scheduling task on the cluster.
·
Recovering from the failed jobs
·
Can integrate various storage like RDBMS, HDFS,
AWS S3 etc.
·
Provides the RDD APIs, which are the basis for
the higher-level API.

Spark core abstract the native API or lower level technicalities
for the end user.
Question 3: If you
have your existing queries created using Hive, what all changes you need to do
to run them on Spark?
Answer: SparkSQL module has the capability to run
queries in Spark, and also Spark SQL can use Hive Meta store as well, so there
is no need to change anything to run hive queries in Spark SQL, even it can use
UDFs defined by the Hive.
Question 4: What is
the Driver Program in Spark?
Answer: This is
one of the main program in Spark application, in case of REPL (spark shell will
be driver program). Driver program create the instance of SparkSession(Spark
2.0) or SparkContext. It is the responsibility of the Driver program to
communicate with the Cluster
manager to distribute the task to cluster worker nodes.

Question 5: What is
the role of cluster manager and what all cluster managers are supported by Spark?
Answer: Cluster
manager, helps in managing the entire cluster worker nodes and communicate with
the Driver Node as well. Spark support below three cluster manager.
·
YARN : Yet another resource negotiator (Part of
Hadoop eco-system) : Learn
more
·
Mesos
·
Standalone cluster manager: This comes with the
Spark itself, and good for testing and POC.
Question 6: Which all
are daemon process are part of the standalone cluster manager?
Answer: As we
have mentioned, standalone cluster manager is not good for the production use
cases. You should consider this only for testing and POC purpose. So, when you
start Spark with the standalone cluster manager then it creates the two daemon
processes.
·
Master Node
·
Worker Node
Question 7: How do
you define the Worker Node in Spark cluster?
Answer: You can
say that, they are slave nodes, and actual your processing on data happens on
worker node. Worker node always communicate with the Master Node and inform
about the availability of the resources. Generally on each node, in your
cluster one worker node is started, which is responsible for running your
application on that particular node and also monitor that application.
Question 8: What is
the executors in Spark?
Answer: Each
Spark application, you submit has its own executors process, executors exists
only until your application runs. Driver program uses these executors to run
the tasks on the worker nodes and also help in keeping data in memory and when
required can be spilled on disk.
Question 9: What is
the Task in Spark?
Answer: Task is
the unit of work, which is sent to the executor running on the worker node. It
is a command which will be sent to the executor by a driver program by
serializing your Function object. It is the responsibility of the executor
process to de-serialize this Function object and execute on the data which is
partitioned (one of the RDD partition) and exists on that node.
Question 10: Define
some use of the SparkContext object?
Answer: It is one
of the entry point for your Spark cluster, and once you have hold of
SparkContext you can create new RDDs from existing data, can create
accumulators, and broadcast variables on that cluster. Since Spark 2.0, a more
general entry point for Spark system is done through SparkSession object.
Question 11: How many
SparkContext object, you should create per JVM?
Answer: You can
create as many as you want, but it is always preferred that you create only one
SparkContext object per JVM. So suppose, you have already created one of the
SparkContext object in one JVM, you need to first stop it by calling stop()
method on it. As we have mentioned previously, when you start spark shell
(REPL) then by default one of the SparkContext object is created and assigned
to a variable named sc. So you
should not create new SparkContext object in that shell.
Question 12: Which
object represents the primary entry point into the Spark 2.0 System?
Answer:
SparkSession is the primary object, you should use Spark 2.0 onwards. Because
it has all the capabilities which other specific objects have like SQLContext,
HiveContext, SparkContext. In the REPL, Spark command line utility, it is
available as a spark object.
Question 13: When do
you create new SparkContext object?
Answer: We should
consider to creating new SparkContext object whenever you have standalone spark
application written in either Java, Scala, Python or R language.
Question 14: Which
script you will use to submit your standalone Spark application?
Answer: spark-submit
script will be used to submit the Spark application.
Question 15: When do
you use client mode and cluster mode to submit your application?
Answer: There are
two deployment mode exists for the Spark application
·
Client
Mode: You should use this mode when your gateway machine is collocated with
the worker machine. In this mode driver application is launched as a part of
the spark submit process and act as client for the cluster.
·
Cluster
Mode: This is used when your application is far away from the worker
machines, like your own laptop or computer which is not part of the Spark
cluster.
Question 16: How do
you construct a new RDD?
Answer: There are
mainly two way by which you can create an RDD
·
If you have collection of data e.g. Java
collections, you can parallelize it to create an RDD. (This is good for testing
and prototyping)
|
sc.parallelize(Array("Hadoop",
"Exam", "Learning", "Resources",
"HadoopExam.com")) |
·
You can create RDD which refers the data in
external filesystem like HDFS, Local Filesystem, RDBMS SQL query output etc.
|
sc.textFile("hadoopexam.txt") |
Question 17: Which
all are the standard ways are available to pass the functions in Spark
framework, using Scala?
Answer: There are
mainly two ways, by which functions can be passed.
·
Anonymous
Functions (Lambda functions): This is a good option, when you have to pass
some small functionality, and quite simple. See below example, in which we are
splitting data and returning splitted values.
|
val he_training = hadoopexamDataFile.map(he_course =>
he_course.split(",")) |
·
Static
Singleton methods: When you need to do some complex operations on the data,
then you should use this. If you know Java, then this (static) methods are
associated with the class and not to the object.
|
object sampleFunc{
def dataSplit(s:String): Array[String]={s.split(“,”)}} sampleRDD.flatMap(sampleFunc.
dataSplit(_)) |
Question 18: You have
a huge dataset, but you want to take out the sample, then there is a sample
function as below.
|
Sample(withReplacement,
fraction, seed) |
How does
withReplacement argument affect the output?
Answer: Whenever
you want to get sample data from huge dataset, you can use this sample method.
However, when you generate multiple sample output then first argument affect
the output.
If withReplacement-> True then, both the generated sample
will not be related. First sample will not affect the second sample, which says
covariance between these two samples is zero.
Question 19: What is
PairRDD?
Answer: PairRDD
represent Key-value based data. You can assume it as a tuple of two values like
(x,y).
·
Hadoop Professional Training
·
Spark professional Training in
Scala
·
Till now more than 10,000
learners have used this material to prove in their carrier
Question 20: What is
the main advantage of PairRDD?
Answer: The main
advantage of key-value pair is that you can operate on data belonging to a
particular key in parallel, like joining, aggregation etc.
Question 21: Which of
the function (Scala and Python), you will be using to create PairRDD?
Answer: We can
use map () function to create PairRDD. For example x has two values, then
PairRDD can be created as below.
|
map(x =>
(x(0), x(1))) |
Question22: Please
explain, how combineByKey() works?
Answer: It’s an
important question, many learners faced this during their interview process. Lets
understand first that combineByKey() has three arguments. We need to understand
use of each one.
·
CreateCombiner:
It works locally on each node, on each individual key’s value part.
·
MergeValue:
It combines all the values locally on each node.
·
MergeCombiner:
It combines all the values across the nodes, to generate final expected result.
Let’s take an example, we have following dataset (three
different keys, total 9 key-value pair)
(hadoop, 5000), (spark,7000),(NiFi,6000), (hadoop,
7000), (spark,9000),(NiFi,4000) , (hadoop, 7500), (spark,9500),(NiFi,4500)
Now calculate the average of each key. We need to sum all
the values for same key, as well as we need count/occurrence of each key.
We wanted to calculate the average for each key. In this
case, need to group/combine all the values from same key. Hence, final result
should be as below (key, (totalForEachDistinctKey,count)).
(hadoop, (19500,3))
(spark, (25500,3))
(NiFi, (14500, 3))
This is what is expected, when you execute combine by key on
given dataset.
Keep in mind that, Spark works on cluster, hence each node
independently work on different subset of data. Hence, locally data needs to be first combined
on each node. Let’s assume we have two node cluster and data is partitioned as
below.
|
Node-1 |
Node-2 |
|
(hadoop,
5000), (spark,7000),(NiFi,6000), (hadoop, 7000), (spark,9000) |
(NiFi,4000) ,
(hadoop, 7500), (spark,9500),(NiFi,4500) |
Step-1: On each
node combiner should work locally
On Node-1, local combiner should generate data something
like that, this is the responsibility of the first argument, which is createCombiner (lambda x: (x, 1)), Here x, is the value part of the key-value pair.
|
Node-1 |
Node-2 |
|
(hadoop, (5000,1)) (spark,
(7000,1)) (NiFi,
(6000,1)) (hadoop,
(7000,1)) (spark,(9000,1)) |
(NiFi,(4000,1)) (hadoop,
(7500,1)) (spark,(9500,1)) (NiFi,(4500,1)) |
Step-2: In next
step merge all the values, locally on each node as below. You can use below
lambda function for that
|
lambda
valuecountpair , val => (valuecountpair[0]+val, valuecountpair[1]+1) |
|
Node-1 |
Node-2 |
|
(hadoop, (12000,2)) (spark, (16000,2)) (NiFi,
(6000,1)) |
(NiFi,(8500,2)) (hadoop,
(7500,1)) (spark,(9500,1)) |
Step-3: As, you
can see that, now all values are locally on each node generated. We need to
aggregate these values, across the node. We will use the MergeCombiner for
that, which will combine all the values across the nodes. Lambda function would
be used as below.
|
Lambda
valuecountpair,nextvaluecountpair => ((valuecountpair[0]+nextvaluecountpair[0])
,( valuecountpair[1]+nextvaluecountpair[1])) |
|
(hadoop, (19500,3)) (spark, (25500,3)) (NiFi, (14500,3)) |
Question 23: Which
all are the shared variables provided by the Spark framework?
Answer: In Spark
shared variables means, the variables which provides data sharing globally
across the nodes. This can be implemented using below two variables, each has
different purpose.
·
Broadcast variables: Read only
variables cached on each node on the cluster. This variables cannot be updated
on individual node. It is more of a same data, you want to share across the
nodes, during data processing.
·
Accumulators: This variable can be
updated on each individual node. However, final value will be aggregated, which
is sent by each individual node.
Question 24: Please
give us the scenario, in which case you will be using broadcast and accumulator
shared variable?
Answer:
Broadcast variables: You can use it
as a cached data on each node. So whenever we need most frequently used small
dataset which does entire data processing. Then ask Spark to cache this small
dataset on each node, this can be done using broadcast variable and during calculation,
you can refer this cached
data.
You can set the broadcast variable using driver program, and
will be retrieved by the worker node on the cluster. Remember, broadcast
variable will be retrieved and cached only when first read request is sent.
Accumulator: You
can consider them more as a global counter. Remember they are not read-only
variables, on each worker node, executor will update the counter independently.
Then driver program will accumulate all the accumulator from worker node and
generate aggregated result.
So you can use them, when you need to do some counting like
how many messages were not processed correctly. So using accumulator on each
node individual count will be generated for the messages which are not
processed, and at the last at driver side all the count will be accumulated,
and you will get to know, which all messages are not processed.
Question 25: How do
you define ETL process?
Answer: ETL
extends to extraction, transformation and loading. This is where, we create
data pipelines for data movement and transformation. In short there are three
stages (now a days order of ETL steps can be re-ordered and sometime it could
be ELT)
·
Extract:
You will extract data from most of the source systems like RDBMS, FlatFiles,
Social Networking feed, web log files etc. Data can be in various formats like
XML, CSV, JSON, Parquet, AVRO, also frequency of the data retrieval can also be
defined as daily, hourly etc.
·
Transform:
In this step you will be transforming data as per your downstream system
expect. For example from text file, you can create JSON file. Like changing the
file formats, similarly you can filter valid and invalid data. In this step you
would do many sub-steps to clean your data as next step expected.
·
Loading:
This step refer to send the data in the sink, where you have defined. In hadoop
world it could be HDFS, Hive tables, HDFS etc. In case of RDBMS it could be
MySQL, Oracle and for NoSQL it could be Cassandra, MongoDB
However, please note that, Spark is not an ETL tool, you can
have some ETL job done using entirely Spark framework.
Question 26: How do
you save data from an RDD to a text file?
Answer: You have
to use RDD’s method saveAsTextFile(“destination_path”).
Similarly for other file formats various other methods are available.
Question 27: What is
Spark DataFrame and what are its basic properties?
Answer: Spark
DataFrame, you can visualize as a table in Relational Databases. It has
following features as well.
·
It is distributed over the Spark Clustered
Nodes.
·
Data organized in columns.
·
It is immutable (to modify it, you have to
create new DataFrame)
·
It is in-memory
·
You can applies schema to this data.
·
They also help you to have Domain Specific
language (DSL)
·
They are evaluated
lazily.
In one line you can say, DataFrames is an immutable
distributed collection of data organized into named columns. DataFrame helps
you take away the RDD’s complexity.

Question 28: What are
the main difference between DataSet and DataFrame?
Answer: As you
remember before Spark 1.6 DataFrame and DataSet were separate API, and they
were unified in Spark 2.0. DataSet API is type-safe object and it can operate
on to the compiled lambda function. DataFrame has un-typed
objects, it means syntax error you can catch during compile time. But if there
is any type
mismatch, then it can only be caught during run-time.
DataSet, as it has strongly
typed objects, it means both syntax as well as type mismatch error can be
caught during compile time only. (If
you know Java Generics, it is easy to understand concept)
Question 29: How do
you read json file in Spark?
Answer: JSON is
semi-structured data, and Spark provides the easy wat to read json data as
below
|
spark.read.json(“hadoopexam.json”) |
Question 30: What is
Sequence File?
Answer: The best
place to learn all the Hadoop file format is http://hadoopexam.com
big data on-demand training, go
and subscribe now.
SequenceFiles are
also key-value pair, but they are having their key and values in binary format.
And this is one of the most used Hadoop based FileFormat. Spark also provides
API to conveniently using this FileFormat. SequenceFile contains header as
well. Sync marker, helps reader to
synchronize to a record boundary from any position in the File. Compression,
you can enable two types of compression on SequenceFile, it could be either
block level or record level compression. Recently parquet
file format became more popular.
Question 31: What is
the use of sync() and seek() method of the SequenceFile?
Answer: Using
seek() method, reader can position to pointer to a given point in the file.
However, please note that if position is not a record boundary, then calling
next() method on that will fail. Hence,
it is always required that you synchronized with the record boundaries.
Sync() method, is another way to find a record boundary, as
soon as you call the sync() method on sequence file reader will connect to the
next sync point (Learn
more from Module-7).
Question 32: Which is
the API method is available in Spark to load sequence file?
Answer: You can
use sparkcontext’s sequenceFile(key,value) method to load sequence file. Both
key and value datatypes are subclass of the Hadoop writable interface. Similarly
to save sequence file you can use below RDD’s method
rdd.saveAsSequenceFile(“path_to_hadoopexam_sequence_file”)
Question 33: What is
Kyro?
Answer: This is a serialization framework, usually if
you use Java’s default serialization mechanism then it is quite slower. Hence,
other serialization frameworks are available and Spark can support them. One of
this is Kyro. So when you work with the object file, you should consider using
Kyro.
Question 34: which of
the FileSystem are supported by the Spark out of the box?
Answer: HDFS, S3, Local file system
Question 35: While
connecting to HDFS filesystem, which information you need for the Spark API?
Answer: We need
two main information name node URL and port
·
NameNode URL. Port and file path for example “hdfs://hadoopexam.com:8020/data/he_file.txt”
Question 36: What all
you need to load data from AWS S3 bucket in Spark?
Answer: You can
download the data from AWS (Amazon Web Service) S3 bucket. You need following
three things
·
URL for file stored in a bucket
·
AWS Accesss Key ID
·
AWS Secret Access Keys
Once, you have this info, you can load data from S3 bucket
as below.
|
sc.textFile("s3n://hadoopexam_bucket/data_file.txt") |
You can pass the key explicitly as well
|
sc.textFile("s3n://AWSAccessKey:AWSSecretKey@svr/filepath") |
Question 37: What all
are the possible ways, of working with SparkSQL?
Answer: You can
interact with SparkSQL using SQL, DataFrame and DataSet API. Whatever mechanism
you use, underline execution engine remain same. However, since Spark 2.0
DataSet API is preferred way.
Question 38: Which
all SQL interaction are supported by SaprkSQL?
Answer: You can
use both basic ANSI SQL as well as HiveQL with the SparkSQL. And, you can use
both SQL and HiveQL to read data stored in Hive. You can also use command-line
or JDBC/ODBC to interact with Spark SQL interface.
Question 39: What all
are possible sources for creating DataFrame?
Answer: You can
create DataFrame with variety of sources, structured data files, Hive tables,
external databases (SQL and No-SQL), existing RDDs can be converted to
DataFrame.
·
Hadoop Professional Training
·
Spark professional Training in
Scala
·
Till now more than 10,000
learners have used this material to prove in their carrier
Question 40: What is
DataSet API?
Answer: DataSet
was added in Spark 1.6, it tries to provide benefits of RDDs (RDDs are strongly
typed and can use lambda function) as well as SQL interface, which uses the
same underline SQL engine. You can create DataSet using JVM object, once
constructed you can apply functional transformation on it like map, flatMap,
filter etc.
DataSet is also distributed collection, remember as of Spark
2.3.0 , DataSet API is only available for Scala and Java, and it is not available for Python.
Question 41: How do
you relate to SparkContext, SQLContext and HiveContext?
Answer: SparkContext
provide the entry point for Spark system, to create SQLContext you need
SparkContext object. HiveContext provides a superset functionality provided by
the basic SQLContext. However, since Spark 2.0 there is a SparkSession object,
which is preferred to enter into Spark system. SparkSession is unification for
all these three SparkContext, SQLContext and HiveContext.
Question 42: Why do
you prefer to use HiveContext?
Answer: You can
use HiveContext for all the functionality provided by SQLContext, as well it
has additional capabilities like you can write Queries using HiveQL, you can
use Hive UDF, and you can directly read data from Hive tables.
Question 43: To use
HiveContext, do you need hive setup?
Answer: No, You
don’t need Hive setup in place. If there is no Hive setup, you can use HiveContext
as a SQLContext. Even, HiveContext was recommended than SQLContext.
Question 44: What are
the advantages of using SparkSession (Spark 2.0)?
Answer: SparkSession
has built in support for Hive queries, access to Hive UDFs, and ability to read
data from Hive tables. To use this feature you don’t need existing Hive setup.
Question 45: How do
you relate DataFrame and DataSet?
Answer: You can
assume DataFrame is a DataSet organized into named columns. In Scala and Java
you can represent DataFrame as a collection of generic object DataSet[Row], in
this Row is un-typed object.
|
Scala |
dataframe ->
Dataset[Row] |
|
Java |
Dataset<Row> |
Question 46: Can you
please explain DataSet in more detail?
Answer: It is
better to take example to understand in detail. It is very important thing you
need to learn.
I will take example in Scala to explain it.
Data in
HadoopExam.json file.
|
{"course_id": 100, "course_name":
"Spark Training", "Fee": "6000",
"Location": "Newyork"} {"course_id": 101, "course_name":
"Hadoop Training", "Fee": "7000",
"Location": "Mumbai"} {"course_id": 102, "course_name":
"NiFi Training", "Fee": "8000",
"Location": "Pune"} |
Create Case Class, which represent above data, each row.
|
case class
HadoopExam(course_id: Long,
course_name: String, Fee: Long,
location:String) |
Read the json file and create the dataset.
Use case class HadoopExam
Dataset is now a collection of JVM Scala objects
HadoopExam
|
val dataset =
spark.read.json(“hadoopexam.json”).as[HadoopExam] |
Three things happen here under the hood in the code above:
·
Spark reads the JSON, infers the schema, and
creates a collection of DataFrames.
·
At this point, Spark converts your data into
DataFrame = Dataset[Row], a collection of generic Row object, since it does not
know the exact type.
·
Now, Spark converts the Dataset[Row] ->
Dataset[HadoopExam] type-specific Scala JVM object, as dictated by the class
HadoopExam.
With Dataset as a collection of Dataset[ElementType] typed
objects, you seamlessly get both compile-time safety and custom view for
strongly-typed JVM objects. And your resulting strongly-typed Dataset[T] from
above code can be easily displayed or processed with high-level methods.
Question 47: In what
use cases, Apache Spark fits well?
Answer: Spark is
good for both Interactive as well as batch mode.
Question 48: Can you
describe which all projects of Spark you have used?
Answer: Spark has
many other project than Spark Core as below
·
Spark SQL:
This project help you to work with structured data, you can mix both SQL
queries and Spark programming API, for your expected results.
·
Spark
Streaming: It is good for processing streaming data. It can help you to
create fault-tolerant streaming applications. Spark streaming improved and new
structured streaming is created which uses SparkSQL engine, more detail you
will find in later question.
·
MLib:
This API is quite reach for writing machine learning applications. You can use
Python, Scala, R language for writing Spark Machine Library.
·
GraphX:
API for graphs and graph parallel computations.
Question 49: What is
the difference, when you run Spark Applications either on YARN or standalone
cluster manager?
Answer: When you
run Spark Applications using YARN then Application processes are managed by the
YARN Resource Manager and Node Manager.
Similarly when you run on Spark standalone, then application
processes are managed by Spark Master and Worker Nodes.
Question 50: Can you
write simple word count application using Apache Spark either Scala or Python,
as you have hands-on experience?
Answer: It is
sometime asked to write very simple Spark applications, to check a person
actually worked on Spark or not. It is not mandatory that you have written
correct syntax.
Example for Scala code
|
val hadoopExamData
= sc.textFile("hdfs://hadoopexam:8020/quicktechie.txt") val counts
= hadoopExamData.flatMap(line => line.split(" ")).map(word =>
(word, 1)).reduceByKey(_ + _) counts.
saveAsTextFile("hdfs://hadoopexam:8020/output.txt") |
Example in Python
|
hadoopExamData =
sc.textFile("hdfs://hadoopexam:8020/quicktechie.txt") counts = myfile.flatMap(lambda line: line.split("
")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2) counts.saveAsTextFile("hdfs://hadoopexam:8020/output.txt") |
Question 51: How do
you compare MapReduce and Spark Application?
Answer: Spark has
many advantages over the Hadoop job, let’s describe each one
MapReduce: The
highest level unit of computation in MapReduce is a Job. Jobs responsibility
includes to load data, applies map function and then shuffles it, after that
run reduce function and finally write data to persistence storage.
Spark Application:
Highest-level unit of computation is an application. A Spark application can be
used for a single batch job, an interactive session with multiple jobs, or a
long-lived server continually satisfying requests. So a Spark application can
consists of more than just a single MapReduce job.
MapReduce starts a process for each task. In contrast, a
Spark application can have processes running on its behalf even when it is not
running any application. And in case of Spark, multiple tasks can run within
the same executor. Both by combining extremely fast task startup as well as
in-memory data storage, resulting in orders of magnitude faster performance
over MapReduce.

Question 52: Please
explain the Spark execution model?
Answer: Spark
execution model have following concepts
·
Driver:
An application maps to a single driver process. Driver process manages the job
flow and schedule tasks and is available the entire time the application is
running. Typically, this driver process is the same as the client process used
to initiate the job, although when run on YARN, the driver can run in the
cluster. In interactive mode, the shell itself is the driver process.
·
Executor:
For a single application/driver set of executor processes are distributed
across the hosts in a cluster. The executors are responsible for performing
work, in the form of tasks, as well as for storing any data that
you cache. Executor lifetime depends on whether dynamic allocation is
enabled. An executor has a number of slots for running tasks, and will run many
concurrently throughout its lifetime.

Figure 1:To-do
Update this image
·
Stage:
A stage is a collection of tasks that run the same code, each on a
different subset of the data.
Question 53: What is
Spark Streaming?
Answer: Spark
streaming is an extension of core Spark that enables scalable, high-throughput,
fault-tolerant processing of data streams. Spark streaming receives input data
streams and divides the data into batches called DStreams.
Question 54: How do
you define DStream or what is DStream?
Answer: You can
create DStream from sources like Kafka, Flume, and Kinesis or by applying
operations on the other DStreams. Every DStream is associated with a Receiver,
which receives the data from source and stores it in executor memory.
Question 55: What is
Dynamic Allocation?
Answer: Dynamic
allocation allows Spark (Only on YARN) to dynamically scale the cluster
resources allocated to your application based on the workload. When dynamic
allocation is enabled and a Spark application has a backlog of pending tasks,
it can request executors. When the application becomes idle, its executors are
released and can be acquired by other applications.
When Spark dynamic resource allocation is enabled, all
resources are allocated to the first submitted job available causing subsequent
applications to be queued up. To allow applications to acquire resources in
parallel, allocate resources to pools and run the applications in those pools
and enable applications running in pools to be preempted.
Question 56: How
Spark Streaming applications are impacted with Dynamic Allocation?
Answer: When
Dynamic Allocation is enabled in Spark, which means that executors are removed
when they are idle. However, Dynamic allocation is not effective in case of
Spark Streaming. In Spark Streaming, data comes in every batch, and executors
will run whenever data is available. If the executor idle timeout is less than
the batch duration, executors are constantly added and removed. If executor
idle timeout is greater than the batch duration, executors are never removed.
Hence, it is recommended that you disable the Dynamic Allocation for Spark
streaming by setting “spark.dynamicAllocation.enabled” to flase.
Question 57: When you
submit Spark streaming application on local mode (not on Hadoop YARN), then it
is must to have two threads, why?
Answer: As we
have discussed previously, when Spark Streaming application is executed, it
require at least two threads, one for receive data and one for processing that
data.
Question 58: How do
you enable Fault-tolerant data processing in Spark streaming?
Answer: If the
Driver host for a Spark Streaming application fails, it can lose data that had
been received but not yet processed. To ensure that no data is lost, you can
use Spark Streaming recovery. Spark writes incoming data to HDFS as it is
received and uses this data to recover state if a failure occurs.
Question 59: When you
use Spark Streaming with the AWS S3 (or cloud) which storage is recommended?
Answer: When
using Spark Streaming application, with the cloud services as the underline
storage layer, use ephemeral HDFS on the cluster to store checkpoints, instead
of the cloud store such as Amazon S3 or Microsoft ADLS.
Question 60: If you
have structure data than, which Spark components you can use?
Answer: To work
with structured data, we should use SparkSQL or DataFrame/DataSet API.
Question 61: What you
can do with the SQLContext object?
Answer: SQLCOntext
is an entry point for the Spark SQL functionality, you create SQLContext object
using SparkContext. Using SQLContext, you can create a DataFrame from an RDD, a
Hive table, or a data source.
Question 62: Should
we use the SQLContext or HiveContext for SQL functionality?
Answer: You
should use HiveContext, whether you are accessing data from Hive or Impala or
not. With the HiveContext you can access Hive or Impala tables represented in
the metastore database.
Question 63: Can Hive
SQL syntax be used with the Spark SQL?
Answer: Hive and
Impala tables and related SQL syntax are interchangeable in most of the cases.
Because Spark uses the underlying Hive infrastructure, with Spark SQL whatever
DDL statements, DML statements, and queries using the HiveQL syntax. For
interactive query performance, you can access the same tables through Impala
using impala-shell or the Impala JDBC and ODBC interfaces.
Question 64: In
Spark-shell, how do you access the HiveContext?
Answer: In
spark-shell, HiveContext is already created and available as a sqlContext (not
SQLContext). However, in your spark application, you have to explicitly create
the object of HiveContext using SparkContext as below.
|
sqlContext =
HiveContext(sc) |
Question 65: What is
the minimum requirement (of the node from where Spark application submitted),
if you are using Cloudera Infrastructure for the Spark application?
Answer: If you
are using CDH cluster, then the host from where you are submitting
applications, or runs spark-shell or pyspark then you must have Hive Gateway role
defined in Cloudera Manager and client configurations deployed. Also remember,
if Spark application uses the Hive view then it must have permissions to read
that data else it will get empty result.
Hadoop
4-Node setup Admin training is available here.
Question 66: Can you
use SQL to read data which are outside the Hive or Impala table?
Answer: If
datafiles are outside of the Hive or Impala table, then you can use SQL directly
to read JSON or Parquet data into DataFrame.
|
df =
sqlContext.sql("SELECT * FROM json/parquet_file_data_dir") |
Question 67: Which all
storage are supported by Spark give few examples, you have used till now?
Answer: Spark cam
access all the storage sources supported by Hadoop, including local filesystem,
HDFS, HBSE, Amazon S3, and Microsoft ADLS.
Question 68: Give
example of file types which are supported by Spark?
Answer: Spark
support many file types, including text files, RCFiles, SequenceFiles, Hadoop
InputFormat, Avro, Parquet and various compression of supported file types.
Question 69: Give
sample API code to read/write files using Spark?
Answer: You can
read compressed files using one of the following methods:
·
textFile(path)
·
hadoopFile(path,outputFormatClass)
You can save compressed files using one of the following
methods:
·
saveAsTextFile(path, compressionCodecClass="codec_class")
·
saveAsHadoopFile(path,outputFormatClass,
compressionCodecClass="codec_class")
Where codec_class is one of the classes
in Compression Types.
Question 70: How do
you access the data stored in AWS S3?
Answer: To access
data stored in Amazon S3 from Spark applications, you use Hadoop file API for
reading and writing RDDs. You need S3 bucket URL for that.
·
SparkContext.hadoopFile
·
JavaHadoopRDD.saveAsHadoopFile
·
SparkContext.newAPIHadoopRDD
·
JavaHadoopRDD.saveAsNewAPIHadoopFile
Question 71: What is
Avro files?
Answer: Watch
the video from this training.
This is a serialization system, with binary encoding. One of
the best feature is that Avro is a language independent, however, it is
expected that you have provided .avro as a file extension. By default Avro data
files are not compressed, but it is recommended enabling compression to reduce
disk usage and increase read and write performance. Avro data files
support Deflate and Snappy
compression. Snappy is faster, but Deflate is slightly more compact.
You do not need to specify configuration to read a
compressed Avro data file. However, to write an Avro data file, you must
specify the type of compression. How you specify compression depends on the
component.
Question 72: What are
the limitations, when you process Avro files, using Spark?
Answer: As we
know, when we read data in Spark, it can convert data types, which are specific
to Spark framework, hence below are very important points when you work with
the Avro file using Spark framework, as we know avro also has schema assigned
for its data:
·
Avro
contains Enumerated types, will be erased by Spark- Avro enumerated types
become strings when they are read into Spark, because Spark does not support
enumerated types.
·
All the
out will be Unions - Spark writes everything as unions of the given type
along with a null option.
·
Schema of
the Avro will be updated/changed - Spark reads everything into an internal
representation. Even if you just read and then write the avro data, the schema
for the output is different and would be as per the Spark specifications.
·
Spark
schema reordering - Spark reorders the elements in its schema when writing
them to disk so that the elements being partitioned on are the last
elements.
Question 73: How do
you read and write parquet files in Apache
Spark?
Answer: To read
and write parquet files are quite simple in Spark, you can use SQLContext and
DataFrame to read and write parquet files.
|
Read -> SQLContext.read.parquet(“parquet
file input path”) |
|
Write -> DataFrame.write.parquet(“parquet
output file path”) |
Question 74: How can
you create application, when code is written using Java or Scala?
Answer: You can
use Maven build tool, to create application written in Java or Scala. For
Scala, you can use SBT.
Question 75: How do
you submit Spark applications?
Answer: To submit
Spark application, you have to use “spark-submit” utility. Common use will be
as below to submit using YARN.
|
spark-submit \ --class com.hadoopexam.analytics.WordCount \ --master yarn \ --deploy-mode cluster \ --conf "spark.eventLog.dir=hdfs://hadoopexam:8020/user/spark/hadoopexamlog"
\ lib/hadoopexam-example.jar \ 10 |
Question 76: When you
submit the Spark application, with the configuration, what is the precedence
order is followed?
Answer: The order
of configuration properties are as below.
1.
Properties passed to SparkConf.
2.
Arguments passed to spark-submit, spark-shell,
or pyspark.
3.
Properties set in spark-defaults.conf.
Question 77: While
submitting your Spark application, you are getting “Task not serializable”
exception, what is the reason and resolution for it?
Answer: Because
of a limitation in the way Scala compiles code, some applications with nested
definitions running in an interactive shell may encounter a Task not
serializable exception. It is recommended that submitting these as applications,
so entire jar can be transferred over the worker node.
Question 78: What are
the advantages of running Spark on YARN cluster manager?
Answer: There are
many advantages of running Spark on YARN cluster manager, as below
1.
Sharing
cluster resources dynamically: You
can dynamically share and centrally configure the same pool of cluster
resources among all frameworks that run on YARN.
2.
Scheduling:
You can use all the features of YARN schedulers for categorizing,
isolating, and prioritizing workloads.
3.
Executors:
You choose the number of executors to use; in contrast, Spark Standalone
requires each application to run an executor on every host in the cluster.
4.
Security:
Spark can run against Kerberos-enabled Hadoop clusters and use secure
authentication between its processes.
Question 79: What all
steps are followed when you submit the Spark Application on YARN cluster
manager?
Answer: Spark orchestrates
its operations through the driver program. When the driver program is run, the
Spark framework initializes executor processes on the cluster hosts that
process your data. The following occurs when you submit a Spark application to
a cluster:
1.
Launch
Main Method: The driver is launched and invokes the main method
in the Spark application.
2.
Resource Request:
The driver requests resources from the cluster manager to launch executors.
3.
Launch
Executors: The cluster manager launches executors on behalf of the driver
program.
4.
Submitting
tasks: The driver runs the application. Based on the transformations and
actions in the application, the driver sends tasks to executors.
5.
Task Execution:
Tasks are run on executors to compute and save results.
6.
Effect of
dynamic allocation: If dynamic allocation is enabled, after
executors are idle for a specified period, they are released.
7.
Stop SparkContext:
When driver's main method exits or calls SparkContext.stop, it
terminates any outstanding executors and releases resources from the cluster
manager.
Question 80: When you
run Spark Application on YARN cluster mode, what happens?
Answer: When you
submit Spark application to YARN cluster, the Spark Driver program will run in
the 0 on a YARN cluster hosts. A single
process in a YARN container is responsible for both driving the application and
requesting resources from YARN. The client that launches the application does
not need to run for the lifetime of the application.

Question 81: When is
the cluster mode is not suitable for the Spark application?
Answer: Cluster
mode is not well suited for the application, which require to be interactive
for example if application require user input, such as spark-shell and pyspark.
Hence, driver program needs to run as part of client application process, which
initiates the Spark application.
Question 82: What is
the client deployment mode?
Answer: When
application submitted in this mode, Spark driver will run on the host where the
Job is submitted. The ApplicationMaster is responsible only for requesting
executor containers from YARN. After the container start, the client
communicates with the containers to schedule work.

Question 83: How do
you run Spark shell on YARN?
Answer: To run
spark-shell or pyspark client on YARN, use the --master yarn --deploy-mode
client, when you start the application.
Question 84: How can
you monitor and debug the Spark application submitted over YARN?
Answer: To obtain
information about Spark application behavior you can for logging information go
to YARN logs and the Spark web application UI.
Question 85: What advantages
you will get when you consider Java/Scala for running your Spark application,
rather than Pyspark?
Answer: As you
know, Scala also uses the Java/JVM, and both has the same advantages. Also,
best thing is Spark framework is written using Scala. So Spark with Java and
Scala offers many advantages: platform independence by running inside the JVM,
self-contained packaging of code and its dependencies into single JAR files,
and higher performance because Spark itself runs in the JVM. You lose these
advantages when using the Spark Python API.
Question 86: What
general problem, you see when you need to run pyspark?
Answer: One of
the major problem while using PySpark is managing third party dependencies and
making them available for Python jobs on a cluster can be difficult. To
determine which dependencies are required on the cluster, you must understand
that Spark code applications run in Spark executor processes distributed
throughout the cluster. If the Python transformations you define use any
third-party libraries, such as NumPy, Spark executors require access to
those libraries when they run on remote executors.
Question 87: In which
scenario, python is preferred than Scala and Java for Spark application?
Answer: Apache
Spark provides APIs in non-JVM languages such as Python. Many data scientists
use Python because it has a rich variety of numerical libraries with a
statistical, machine-learning, or optimization focus. So you can say, while
working with Machine learning, you can prefer PySpark.
Question 88: If you
need a single shared file, while running your program on various nodes in
cluster, how can you provide that to your application?
Answer: If you
need only single shared file, which needs to transferred each node, you can use
--py-files option, while submitting the application using spark-submit and
specify the local path to the file. Or you can use programmatic option,
sc.addPyFiles() function.
Question 89: When you
need functionality from multiple files, then how do you provide?
Answer: If you
use functionality from multiple python files, then you can create an egg/zip
for the package, because --py-files flag also accepts a path to an egg file.
Question 90: What can
be a problem of distributing egg files?
Answer: sending egg
(Collection of python files) files is problematic because packages that contain
native code must be compiled for the specific host on which it will run. When
doing distributed computing with industry-standard hardware, you must assume is
that the hardware is heterogeneous. However, because of the required C
compilation, a Python egg built on a client host is specific to the client CPU
architecture. Therefore, distributing an egg for complex, compiled packages
like NumPy, SciPy, and pandas often fails. Instead of distributing egg files
you should install the required Python packages on each host of the cluster and
specify the path to the Python binaries for the worker hosts to use.
Question 91: What is
Spark dataset partition?
Answer: Spark
Dataset comprises a fixed number of partitions, and each partition will have
number of records in it.
Question 92: What is
done during the shuffling?
Answer: Spark
performs shuffle, which transfer data around the cluster and results in a new
stage with a new set of partitions.
Question 93: You have
been given below code snippet, do you think that data shuffling will be applied
on partitions across the nodes?
|
sc.textFile("hadoopexam.txt").map(mapFunc).flatMap(flatMapFunc).filter(filterFunc) |
Answer: When data
in one partition is not depend on partition, as well as transformation, which
does not require another partition from different node. Then shuffling will not
be applied. Out of all three transformation map, flatMap and filter does not
require shuffling of data across partitions.
Question 94: Why does
having more stage, will impact on the performance of Spark Application?
Answer: If you
have written a Spark code, which has more stages then it suffer from
performance. Because on each stage data will be persisted (either in cache or
disk), also possibility of shuffling data across the partitions. As you know,
wherever these two I/O (network and disk), comers into the picture there will
be huge impact on performance. And stage boundary in a Spark job will cause
this.
Question 95: What is
the impact of having/changing numberOfPartitions during transformation, will
cause the performance?
Answer: Transformations
that can trigger a stage boundary typically accept
a numPartitions argument, which specifies into how many partitions to
split the data in the child stage. Just as the number of reducers is an
important parameter in MapReduce jobs, the number of partitions at stage
boundaries can determine an application's performance.
Question 96: Which
all methods should be avoided, so less amount of data shuffling happens across
the partitions?
Answer: When
choosing an arrangement of transformations, minimize the number of shuffles and
the amount of data shuffled. Shuffles are expensive operations. All shuffle
data must be written to disk and then transferred over the network. repartition, join, cogroup
, and any of the *By or *ByKey transformations can result
in shuffles. Not all these transformations are equal.
Question 97: What is
the difference between below two Spark steps and which is good for performance
perspective?
·
rdd.groupByKey().mapValues(_.sum)
·
rdd.reduceByKey(_ + _)
Answer: Both the
above operations results in the same result. However, the first one, as we have
already discussed in previous question that entire dataset will be transferred
across the network, while in case of second one computes the data locally, and
sums for each key in each partition and combines those local sums into larger
sums after shuffling.
Question 98: What
performance issue, you will observe with the below code segment and which one
you suggest as a better alternative?
|
rdd.map(kv
=> (kv._1, new Set[String]() + kv._2)).reduceByKey(_ ++ _) |
Answer: reduceByKey when
the input and output value types are different as given in above
scenario a transformation that finds all the unique strings corresponding to
each key. You could use map to transform each element into
a Set and then combine the Sets with reduceByKey().
This results in unnecessary object creation because a new
set must be allocated for each record. Instead, use aggregateByKey, which
performs the map-side aggregation more efficiently:
|
val
zero = new collection.mutable.Set[String]() rdd.aggregateByKey(zero)((set,
v) => set += v,(set1, set2) => set1 ++= set2) |
Question 99: If you
have a small dataset, which needs to be joined with another bigger dataset,
what approach you will use in this case?
Answer: As you
mentioned one dataset is smaller and other is very big. Then we will consider
using broadcast variable, which will help in improving the overall performance.
To avoid shuffles when joining two datasets, you can use broadcast
variables. When one of the datasets is small enough to fit in memory in a
single executor, it can be loaded into a hash table on the driver and then
broadcast to every executor. A map transformation can then reference the hash
table to do lookups.
Question 100: When it
is advantageous to have shuffle?
Answer: When you
are working with huge volume of data and more processing power is also
available. And application is compute intensive, hence we need to use shuffling
in this case. So that data can be processed in parallel using all the available
CPUs. Another use case is aggregation, if a huge volume of data and you want to
apply aggregate function on that, then single thread of the driver will become
bottleneck. You should shuffle data across the nodes and then apply the
aggregate functions on that data locally on each node. So that data can be aggregated
parallel first and then final aggration will be done on the driver program.
·
Hadoop Professional Training
·
Spark professional Training in
Scala
·
Till now more than 10,000
learners have used this material to prove in their carrier
Question 101: Which
of the two resources used by the Spark application, but cannot be managed by neither
YARN nor Spark?
Answer: The two
main resources that Spark and YARN manage are CPU and memory. Disk and network
I/O affect Spark performance as well, but neither Spark nor YARN actively
manage them.
Question 102: When
you deploy Spark on YARN cluster manager, how does ApplicationMaster memory comes
into the picture?
Answer: The
ApplicationMaster, which is a non-executor container that can request
containers from YARN, requires memory and CPU that must be accounted for.
In client deployment mode, they default to 1024 MB and one core.
In cluster deployment mode, the ApplicationMaster runs the Spark
application driver, so consider augmenting its resources with
the --driver-memory and --driver-cores flags.
Question 103: What is
Parquet file?
Answer: Parquet
is an open source, column-oriented binary file format for Hadoop that
supports very efficient compression and encoding schemes (Spark 2.0 also
started using extensively). Parquet allows compression schemes to be
specified on a per-column level, and allows adding more encodings as they are
invented and implemented. Encoding and compression are separated, allowing
Parquet consumers to implement operators that work directly on encoded data
without paying a decompression and decoding penalty, when possible.
Question 104: When
you submit your application to the Spark cluster, do you provide Hadoop and
Spark jar with it?
Answer: No, we
don’t have to include Spark and Hadoop jars with the application jar because
they are available during runtime.
Question 105: What is
Beeswax application?
Answer: Beeswax
is a Hue application, using that you can perform queries on Hive, even you can
create tables, load data, and run and manager Hive queries.
Question 106: How do
you define a cluster manager?
Answer: Cluster
Manager is an external service, which will help you on acquiring resources on
the cluster like Spark standalone or YARN.
Question107: Which
are the common compression techniques are used by Apache Hadoop?
Answer: Commonly
used compression tool within Apache Hadoop are gzip, bzip2, Snappy, and LZO
Question 108: How do
you define dataset in Spark?
Answer: We might
have already covered that question previously, but this answer will be add on
that
Dataset is a collection of records, similar to a relational
database table. Records are similar to table rows, but the columns can contain
not only strings or numbers, but also nested data structures such as lists,
maps, and other records.
Question 109: How do
you define the driver on Spark?
Answer: In Apache
Spark, a process that represents an application session. The driver is
responsible for converting the application to a directed graph of individual
steps to execute on the cluster. There is one driver per application.
Question 106: Define
the executor process in Apache Spark?
Answer: A process
that serves a Spark application. An executor runs multiple tasks over its
lifetime, and multiple tasks concurrently. A host may have several Spark
executors and there are many hosts running Spark executors for each
application.

Question 107:
Describe LZO compression?
Answer: A free,
open source compression library. LZO compression provides a good balance
between data size and speed of compression. The LZO compression algorithm is
the most efficient of the codecs, using very little CPU. Its compression ratios
are not as good as others, but its compression is still significant compared to
the uncompressed file sizes. Unlike some other formats, LZO compressed files
are splittable, enabling MapReduce to process splits in parallel.
Question 108: Define
MapReduce algorithm?
Answer: A
distributed processing framework for processing and generating large data sets
and an implementation that runs on large clusters machines.
The processing model defines two types of functions: a map
function that processes a key-value pair to generate a set of intermediate
key-value pairs, and a reduce function that merges all intermediate values
associated with the same intermediate key.
A MapReduce job partitions the input data set into
independent chunks that are processed by the map functions in a parallel
manner. The framework sorts the outputs of the maps, which are then input to
the reduce functions. Typically both the input and the output of the job are
stored in a distributed filesystem.
The implementation provides an API for configuring and
submitting jobs and job scheduling and management services; a library of
search, sort, index, inverted index, and word co-occurrence algorithms; and the
runtime. The runtime system partitions the input data, schedules the program's
execution across a set of machines, handles machine failures, and manages the
required inter-machine communication.
Question 109: Define
a stage in Spark?
Answer: In Spark,
a collection of tasks that all execute the same code, each on a different
partition. Each stage contains a sequence of transformations that can be
completed without shuffling the data.
Question 110: Define
a task in Apache Spark application?
Answer: A unit of
work on a partition of an RDD.
Question 111: What are
the main components of the YARN architecture?
Answer: A general
architecture for running distributed applications. YARN specifies the following
components:
·
ResourceManager
- A master daemon that authorizes submitted jobs to run, assigns an
ApplicationMaster to them, and enforces resource limits.
·
ApplicationMaster
- A supervisory task that requests the resources needed for executor tasks. An
ApplicationMaster runs on a different NodeManager for each application. The
ApplicationMaster requests containers, which are sized by the resources a
task requires to run.
·
NodeManager
- A worker daemon that launches and monitors the ApplicationMaster and task
containers.
·
JobHistory
Server - Keeps track of completed applications.
The ApplicationMaster negotiates with the ResourceManager
for cluster resources—described in terms of a number of containers, each with a
certain memory limit—and then runs application-specific processes in those containers.
The containers are overseen by NodeManagers running on cluster nodes, which
ensure that the application does not use more resources than it has been
allocated.
Question 112: What is
Zookeeper service?
Answer: A
centralized service for maintaining configuration information, naming, and
providing distributed synchronization and group services. In a CDH cluster,
ZooKeeper coordinates the activities of high-availability services, including
HDFS, Oozie, Hive, Solr, YARN, HBase, and Hue.
Question 113: How you
can create DataFrame object?
Answer: There are
many ways by which you can create DataFrame, but the below sources are commonly
used.
·
Structured (csv, tsv) data files.
·
Hadoop Hive tables.
·
RDBMS tables, SQL Queries output
·
From an RDD
·
It also support Avro, Parquet data formats. (Learn
more about avro in module-39 here HadoopExam.com BigData Hadoop Training)
·
You can enhance to use your custom format as
well.
Question 114: Please
describe something about Spark DataSet?
Answer: DataSet
API was added in Spark 1.6, DataSet provides the benefit of both RDDs and the
SparkSQL optimizer. You can create DataSet from a Java Objects and apply
functional transformation on that using function like map, filter etc.
DataSet is a collection of strongly-typed objects, which are
defined using a user-defined case classes.
Question 115: Can you
provide the difference between DataFrame and DataSet?
Answer: As we
have seen previously, DataFrame’s can be seen as a DataSet[Row], where Row is a
generic un-typed object. While DataSet is a collection of strongly-typed
objects specified using a user-defined case classes.
DataFrame has un-typed objects, it means syntax error you
can catch during compile time. But if there is any type mismatch, then it can
only be caught during run-time.
DataSet, as it has strongly typed objects, it means both
syntax as well as type mismatch error can be caught during compile time only. (If
you know Java Generics, it is easy to understand concept)
Question 116: What is
the Case classes, and why do you use it in Spark?
Answer: Case
classes are Scala way of creating Java Pojo’s. It is implicitly create getter
and setter method for the Object member. This is used in Spark to assign Schema
to a DataSet/DataFrame objects.
Question 117: What is
a Catalyst optimizer?
Answer: It is a
core part of SparkSQL, which is written using Scala. This helps Spark in
·
Schema inference from JSON data
You can say that, it is helping Spark SQL to generate query
plan, which can be easily converted to the Direct Acyclic Graph of RDD’s. Once
DAG is created it is ready to execute. It does lot of things before creating
optimized query plan. Main purpose of this optimizer, is to create optimized
DAG.
Both the query plans and optimized query plans are
internally represented as trees. Catalyst optimizer has various other
libraries, which help in working on this trees.
Question 118: What
all types of plans can be created by the Catalyst optimizer?
Answer: Catalyst
optimizer can create following two query plans
·
Logical plan:
It defines the computations on the DataSets, without defining how to carry out
the specific computations.
·
Physical
plan: It defines computation of the datasets, which can be executed to get
expected result. Generally multiple physical plans are generated by the
optimizer and then later on using Cost-based optimizer, less costly plan will
be selected to execute the query.
Question 119: How do
you print the all the plans created by Catalyst optimizer for running a query?
Answer: We have
to use explain(Boolean) method. Something
similar to below
|
dataframe1.join(dataframe2,
"region").selectExpr("count(*)").explain(true) |
Question 120: What is
Project Tungsten?
Answer: You can
say that, this is one of the largest execution engine for Spark. It has more
focus on observing CPU and Memory, rather than I/O and network. In Spark CPU
and Memory was the major bottleneck for performance.
Before Spark 2.0, a most of the CPU cycles were wasted,
rather than using for computation, they were used for read/write of
intermediate data to CPU cache.
Project tungsten helped in improving efficiency of memory
and CPU, so that maximum hardware limits can be used.
Question 121: Do you
see any pain point with regards to Spark DStream?
Answer: There are
some common issues with the Spark DStream
·
Timestamp:
It consider timestamp, when event entered into the Spark system, rather than
attached timestamp of the event.
·
API:
You have to write different code for both batch and steam processing.
·
Failure
condition: Developer has to manage various failure conditions.
Question 122: Please
list down some advantages of the Structured Streaming compare to DStream?
Answer: As we
have seen in previous question DStream has various issues, that’s the reason
Structured Streaming introduced in the Apache Spark
·
It is Fast
·
Fault-tolerant
·
Exactly-once stream processing approach
·
Input data can be thought as a append only table
(grows continuously)
·
Trigger: It can specify a trigger, which can
check for input data in a defined time interval.
·
API: The high level API is built on the SparkSQL
engine and is tightly integrated with SQL queries, DataFrame and DataSet APIs.
Question 124: Can you
give some examples of Special purpose engines, which can work as a data source
for Spark and also what kind of functionality they can provide?
Answer: Spark
support below Special purpose engines
·
ElasticSearch
: Good for srearch
·
Kafka
: Good for messaging system
·
Redis
: Good for caching
Question 125: You
need to select small dataset from stored data, however, it is not recommended
that you use Spark for that, why?
Answer: You
should not use Spark for such use cases, because, Spark has to go through the
all stored files and then find your result from it. You should consider using
RDBMS or some other storage which index particular columns of the data and data
retrieval would be faster in that case.
Question 126: You
have created DataFrame from multiple input files and want to save into single
output file, what SQL you use?
Answer: You have
to use select * and coalesce command
of DataFrame.
Question 127: You are
accessing data stored in AWS S3, in Spark and on bucket TLS (Data Encryption)
is enabled, then what you need?
Answer: If S3
bucket is TLS enabled and you are using custom jssecacerts truststore, make
sure that your trustore includes the root Certificate Authority (CA)
certificates that signed the Amazon S3 certificate.
Question 128: If your
Spark is installed on EC2 instances, and trying to access data stored in S3
bucket, but you don’t want to provide the credential at the same time, how can
you do that?
Answer: As we
know, both EC2 and S3 are Amazon services, we can leverage IAM roles (Learn
Amazon AWS Solution Architect from here), in this mode of operation
associates the authorization with individual EC2 instances instead of with each
Spark app or the entire cluster.
Run EC2 instances with instance profiles associated with IAM
roles that have the permissions you want. Requests from a machine with such a
profile authenticate without credentials.
Question 129:
Cloudera also provides a way to store AWS bucket credential to provide
system-wide AWS access to a single predefined bucket, what is that?
Answer: Cloudera
recommends that you use the Hadoop Credential Provider to set up AWS access
because it provides system-wide AWS access to a single predefined bucket,
without exposing the secret key in a configuration file or having to specify it
at runtime.
Question 130: What
are the ways, by which AWS bucket access can be controlled?
Answer: AWS
access for users can be set up in two ways. You can either provide a global
credential provider file that will allow all Spark users to submit S3
jobs, or have each user submit their own credentials every
time they submit a job.
Question 131: You
have got a complaint from the security department that AWS bucket access
credentials are visible in log file, where is the mistake?
Answer: It is
because, credentials are stored in non-recommended way. You might have either
of the below method used for providing credential for access S3 bucket
·
Specified credentials during runtime, using
configuration properties, something like that
|
sc.hadoopConfiguration.set("fs.s3a.access.key",
"...") |
·
Or you have configured these credentials in the
core-site.xml file.
Both of the above configurations are not recommended, if you
want complete security to your data. Rather use Hadoop Credential Provider.








No comments:
Post a Comment