Frequently Asked Questions

Question 1: How do you differentiate between Hadoop and Spark?

Answer: Hadoop is more of an eco-system, which provides a platform for storing (different data formats), compute engine, cluster management, distributed file system (HDFS). Whereas Spark is more of a compute engine only, which can well integrated with the Hadoop. Even, you can say Spark work as a one of the compute engine for Hadoop.

Spark does not have its own storage engine, but it can connect to various other storage like HDFS, Local File System, and RDBMS etc.

Question 2: What are the basic functionality of the Spark Core?

Answer: Spark core is the heart of the entire Spark engine, which provide various functionality as below.

· Managing memory pool

· Scheduling task on the cluster.

· Recovering from the failed jobs

· Can integrate various storage like RDBMS, HDFS, AWS S3 etc.

· Provides the RDD APIs, which are the basis for the higher-level API.

Spark core abstract the native API or lower level technicalities for the end user.

Question 3: If you have your existing queries created using Hive, what all changes you need to do to run them on Spark?

Answer: SparkSQL module has the capability to run queries in Spark, and also Spark SQL can use Hive Meta store as well, so there is no need to change anything to run hive queries in Spark SQL, even it can use UDFs defined by the Hive.

Question 4: What is the Driver Program in Spark?

Answer: This is one of the main program in Spark application, in case of REPL (spark shell will be driver program). Driver program create the instance of SparkSession(Spark 2.0) or SparkContext. It is the responsibility of the Driver program to communicate with the Cluster manager to distribute the task to cluster worker nodes.

Question 5: What is the role of cluster manager and what all cluster managers are supported by Spark?

Answer: Cluster manager, helps in managing the entire cluster worker nodes and communicate with the Driver Node as well. Spark support below three cluster manager.

· YARN : Yet another resource negotiator (Part of Hadoop eco-system) : Learn more

· Mesos

· Standalone cluster manager: This comes with the Spark itself, and good for testing and POC.

Question 6: Which all are daemon process are part of the standalone cluster manager?

Answer: As we have mentioned, standalone cluster manager is not good for the production use cases. You should consider this only for testing and POC purpose. So, when you start Spark with the standalone cluster manager then it creates the two daemon processes.

· Master Node

· Worker Node

Question 7: How do you define the Worker Node in Spark cluster?

Answer: You can say that, they are slave nodes, and actual your processing on data happens on worker node. Worker node always communicate with the Master Node and inform about the availability of the resources. Generally on each node, in your cluster one worker node is started, which is responsible for running your application on that particular node and also monitor that application.

Question 8: What is the executors in Spark?

Answer: Each Spark application, you submit has its own executors process, executors exists only until your application runs. Driver program uses these executors to run the tasks on the worker nodes and also help in keeping data in memory and when required can be spilled on disk.

Question 9: What is the Task in Spark?

Answer: Task is the unit of work, which is sent to the executor running on the worker node. It is a command which will be sent to the executor by a driver program by serializing your Function object. It is the responsibility of the executor process to de-serialize this Function object and execute on the data which is partitioned (one of the RDD partition) and exists on that node.

Question 10: Define some use of the SparkContext object?

Answer: It is one of the entry point for your Spark cluster, and once you have hold of SparkContext you can create new RDDs from existing data, can create accumulators, and broadcast variables on that cluster. Since Spark 2.0, a more general entry point for Spark system is done through SparkSession object.

Question 11: How many SparkContext object, you should create per JVM?

Answer: You can create as many as you want, but it is always preferred that you create only one SparkContext object per JVM. So suppose, you have already created one of the SparkContext object in one JVM, you need to first stop it by calling stop() method on it. As we have mentioned previously, when you start spark shell (REPL) then by default one of the SparkContext object is created and assigned to a variable named sc. So you should not create new SparkContext object in that shell.

Question 12: Which object represents the primary entry point into the Spark 2.0 System?

Answer: SparkSession is the primary object, you should use Spark 2.0 onwards. Because it has all the capabilities which other specific objects have like SQLContext, HiveContext, SparkContext. In the REPL, Spark command line utility, it is available as a spark object.

Question 13: When do you create new SparkContext object?

Answer: We should consider to creating new SparkContext object whenever you have standalone spark application written in either Java, Scala, Python or R language.

Question 14: Which script you will use to submit your standalone Spark application?

Answer: spark-submit script will be used to submit the Spark application.

Question 15: When do you use client mode and cluster mode to submit your application?

Answer: There are two deployment mode exists for the Spark application

· Client Mode: You should use this mode when your gateway machine is collocated with the worker machine. In this mode driver application is launched as a part of the spark submit process and act as client for the cluster.

· Cluster Mode: This is used when your application is far away from the worker machines, like your own laptop or computer which is not part of the Spark cluster.

Question 16: How do you construct a new RDD?

Answer: There are mainly two way by which you can create an RDD

· If you have collection of data e.g. Java collections, you can parallelize it to create an RDD. (This is good for testing and prototyping)

sc.parallelize(Array("Hadoop", "Exam", "Learning", "Resources", "HadoopExam.com"))

· You can create RDD which refers the data in external filesystem like HDFS, Local Filesystem, RDBMS SQL query output etc.

sc.textFile("hadoopexam.txt")

Question 17: Which all are the standard ways are available to pass the functions in Spark framework, using Scala?

Answer: There are mainly two ways, by which functions can be passed.

· Anonymous Functions (Lambda functions): This is a good option, when you have to pass some small functionality, and quite simple. See below example, in which we are splitting data and returning splitted values.

val he_training = hadoopexamDataFile.map(he_course => he_course.split(","))

· Static Singleton methods: When you need to do some complex operations on the data, then you should use this. If you know Java, then this (static) methods are associated with the class and not to the object.

object sampleFunc{ def dataSplit(s:String): Array[String]={s.split(“,”)}}

sampleRDD.flatMap(sampleFunc. dataSplit(_))

Question 18: You have a huge dataset, but you want to take out the sample, then there is a sample function as below.

Sample(withReplacement, fraction, seed)

How does withReplacement argument affect the output?

Answer: Whenever you want to get sample data from huge dataset, you can use this sample method. However, when you generate multiple sample output then first argument affect the output.

If withReplacement-> True then, both the generated sample will not be related. First sample will not affect the second sample, which says covariance between these two samples is zero.

Question 19: What is PairRDD?

Answer: PairRDD represent Key-value based data. You can assume it as a tuple of two values like (x,y).

· Hadoop Professional Training

· Spark professional Training in Scala

· 111 solve problems scenario for CCA175 Certification, It will come with selected complimentary videos to help you to do setup and how to solve all the given questions.

· Till now more than 10,000 learners have used this material to prove in their carrier

Question 20: What is the main advantage of PairRDD?

Answer: The main advantage of key-value pair is that you can operate on data belonging to a particular key in parallel, like joining, aggregation etc.

Question 21: Which of the function (Scala and Python), you will be using to create PairRDD?

Answer: We can use map () function to create PairRDD. For example x has two values, then PairRDD can be created as below.

map(x => (x(0), x(1)))

Question22: Please explain, how combineByKey() works?

Answer: It’s an important question, many learners faced this during their interview process. Lets understand first that combineByKey() has three arguments. We need to understand use of each one.

· CreateCombiner: It works locally on each node, on each individual key’s value part.

· MergeValue: It combines all the values locally on each node.

· MergeCombiner: It combines all the values across the nodes, to generate final expected result.

Let’s take an example, we have following dataset (three different keys, total 9 key-value pair)

(hadoop, 5000), (spark,7000),(NiFi,6000), (hadoop, 7000), (spark,9000),(NiFi,4000) , (hadoop, 7500), (spark,9500),(NiFi,4500)

Now calculate the average of each key. We need to sum all the values for same key, as well as we need count/occurrence of each key.

We wanted to calculate the average for each key. In this case, need to group/combine all the values from same key. Hence, final result should be as below (key, (totalForEachDistinctKey,count)).

(hadoop, (19500,3))

(spark, (25500,3))

(NiFi, (14500, 3))

This is what is expected, when you execute combine by key on given dataset.

Keep in mind that, Spark works on cluster, hence each node independently work on different subset of data. Hence, locally data needs to be first combined on each node. Let’s assume we have two node cluster and data is partitioned as below.

Node-1	Node-2
(hadoop, 5000), (spark,7000),(NiFi,6000), (hadoop, 7000), (spark,9000)	(NiFi,4000) , (hadoop, 7500), (spark,9500),(NiFi,4500)

Step-1: On each node combiner should work locally

On Node-1, local combiner should generate data something like that, this is the responsibility of the first argument, which is createCombiner (lambda x: (x, 1)), Here x, is the value part of the key-value pair.

Node-1

Node-2

(hadoop, (5000,1))

(spark, (7000,1))

(NiFi, (6000,1))

(hadoop, (7000,1))

(spark,(9000,1))

(NiFi,(4000,1))

(hadoop, (7500,1))

(spark,(9500,1))

(NiFi,(4500,1))

Step-2: In next step merge all the values, locally on each node as below. You can use below lambda function for that

lambda valuecountpair , val => (valuecountpair[0]+val, valuecountpair[1]+1)

Node-1

Node-2

(hadoop, (12000,2))

(spark, (16000,2))

(NiFi, (6000,1))

(NiFi,(8500,2))

(hadoop, (7500,1))

(spark,(9500,1))

Step-3: As, you can see that, now all values are locally on each node generated. We need to aggregate these values, across the node. We will use the MergeCombiner for that, which will combine all the values across the nodes. Lambda function would be used as below.

Lambda valuecountpair,nextvaluecountpair => ((valuecountpair[0]+nextvaluecountpair[0])

,( valuecountpair[1]+nextvaluecountpair[1]))

(hadoop, (19500,3))

(spark, (25500,3))

(NiFi, (14500,3))

Question 23: Which all are the shared variables provided by the Spark framework?

Answer: In Spark shared variables means, the variables which provides data sharing globally across the nodes. This can be implemented using below two variables, each has different purpose.

· Broadcast variables: Read only variables cached on each node on the cluster. This variables cannot be updated on individual node. It is more of a same data, you want to share across the nodes, during data processing.

· Accumulators: This variable can be updated on each individual node. However, final value will be aggregated, which is sent by each individual node.

Question 24: Please give us the scenario, in which case you will be using broadcast and accumulator shared variable?

Answer:

Broadcast variables: You can use it as a cached data on each node. So whenever we need most frequently used small dataset which does entire data processing. Then ask Spark to cache this small dataset on each node, this can be done using broadcast variable and during calculation, you can refer this cached data.

You can set the broadcast variable using driver program, and will be retrieved by the worker node on the cluster. Remember, broadcast variable will be retrieved and cached only when first read request is sent.

Accumulator: You can consider them more as a global counter. Remember they are not read-only variables, on each worker node, executor will update the counter independently. Then driver program will accumulate all the accumulator from worker node and generate aggregated result.

So you can use them, when you need to do some counting like how many messages were not processed correctly. So using accumulator on each node individual count will be generated for the messages which are not processed, and at the last at driver side all the count will be accumulated, and you will get to know, which all messages are not processed.

Question 25: How do you define ETL process?

Answer: ETL extends to extraction, transformation and loading. This is where, we create data pipelines for data movement and transformation. In short there are three stages (now a days order of ETL steps can be re-ordered and sometime it could be ELT)

· Extract: You will extract data from most of the source systems like RDBMS, FlatFiles, Social Networking feed, web log files etc. Data can be in various formats like XML, CSV, JSON, Parquet, AVRO, also frequency of the data retrieval can also be defined as daily, hourly etc.

· Transform: In this step you will be transforming data as per your downstream system expect. For example from text file, you can create JSON file. Like changing the file formats, similarly you can filter valid and invalid data. In this step you would do many sub-steps to clean your data as next step expected.

· Loading: This step refer to send the data in the sink, where you have defined. In hadoop world it could be HDFS, Hive tables, HDFS etc. In case of RDBMS it could be MySQL, Oracle and for NoSQL it could be Cassandra, MongoDB

However, please note that, Spark is not an ETL tool, you can have some ETL job done using entirely Spark framework.

Question 26: How do you save data from an RDD to a text file?

Answer: You have to use RDD’s method saveAsTextFile(“destination_path”). Similarly for other file formats various other methods are available.

Question 27: What is Spark DataFrame and what are its basic properties?

Answer: Spark DataFrame, you can visualize as a table in Relational Databases. It has following features as well.

· It is distributed over the Spark Clustered Nodes.

· Data organized in columns.

· It is immutable (to modify it, you have to create new DataFrame)

· It is in-memory

· You can applies schema to this data.

· They also help you to have Domain Specific language (DSL)

· They are evaluated lazily.

In one line you can say, DataFrames is an immutable distributed collection of data organized into named columns. DataFrame helps you take away the RDD’s complexity.

Question 28: What are the main difference between DataSet and DataFrame?

Answer: As you remember before Spark 1.6 DataFrame and DataSet were separate API, and they were unified in Spark 2.0. DataSet API is type-safe object and it can operate on to the compiled lambda function. DataFrame has un-typed objects, it means syntax error you can catch during compile time. But if there is any type mismatch, then it can only be caught during run-time.

DataSet, as it has strongly typed objects, it means both syntax as well as type mismatch error can be caught during compile time only. (If you know Java Generics, it is easy to understand concept)

Question 29: How do you read json file in Spark?

Answer: JSON is semi-structured data, and Spark provides the easy wat to read json data as below

spark.read.json(“hadoopexam.json”)

Question 30: What is Sequence File?

Answer: The best place to learn all the Hadoop file format is http://hadoopexam.com big data on-demand training, go and subscribe now.

SequenceFiles are also key-value pair, but they are having their key and values in binary format. And this is one of the most used Hadoop based FileFormat. Spark also provides API to conveniently using this FileFormat. SequenceFile contains header as well. Sync marker, helps reader to synchronize to a record boundary from any position in the File. Compression, you can enable two types of compression on SequenceFile, it could be either block level or record level compression. Recently parquet file format became more popular.

Question 31: What is the use of sync() and seek() method of the SequenceFile?

Answer: Using seek() method, reader can position to pointer to a given point in the file. However, please note that if position is not a record boundary, then calling next() method on that will fail. Hence, it is always required that you synchronized with the record boundaries.

Sync() method, is another way to find a record boundary, as soon as you call the sync() method on sequence file reader will connect to the next sync point (Learn more from Module-7).

Question 32: Which is the API method is available in Spark to load sequence file?

Answer: You can use sparkcontext’s sequenceFile(key,value) method to load sequence file. Both key and value datatypes are subclass of the Hadoop writable interface. Similarly to save sequence file you can use below RDD’s method rdd.saveAsSequenceFile(“path_to_hadoopexam_sequence_file”)

Question 33: What is Kyro?

Answer: This is a serialization framework, usually if you use Java’s default serialization mechanism then it is quite slower. Hence, other serialization frameworks are available and Spark can support them. One of this is Kyro. So when you work with the object file, you should consider using Kyro.

Question 34: which of the FileSystem are supported by the Spark out of the box?

Answer: HDFS, S3, Local file system

Question 35: While connecting to HDFS filesystem, which information you need for the Spark API?

Answer: We need two main information name node URL and port

· NameNode URL. Port and file path for example “hdfs://hadoopexam.com:8020/data/he_file.txt”

Question 36: What all you need to load data from AWS S3 bucket in Spark?

Answer: You can download the data from AWS (Amazon Web Service) S3 bucket. You need following three things

· URL for file stored in a bucket

· AWS Accesss Key ID

· AWS Secret Access Keys

Once, you have this info, you can load data from S3 bucket as below.

sc.textFile("s3n://hadoopexam_bucket/data_file.txt")

You can pass the key explicitly as well

sc.textFile("s3n://AWSAccessKey:AWSSecretKey@svr/filepath")

Question 37: What all are the possible ways, of working with SparkSQL?

Answer: You can interact with SparkSQL using SQL, DataFrame and DataSet API. Whatever mechanism you use, underline execution engine remain same. However, since Spark 2.0 DataSet API is preferred way.

Question 38: Which all SQL interaction are supported by SaprkSQL?

Answer: You can use both basic ANSI SQL as well as HiveQL with the SparkSQL. And, you can use both SQL and HiveQL to read data stored in Hive. You can also use command-line or JDBC/ODBC to interact with Spark SQL interface.

Question 39: What all are possible sources for creating DataFrame?

Answer: You can create DataFrame with variety of sources, structured data files, Hive tables, external databases (SQL and No-SQL), existing RDDs can be converted to DataFrame.

· Hadoop Professional Training

· Spark professional Training in Scala

· 111 solve problems scenario for CCA175 Certification, It will come with selected complimentary videos to help you to do setup and how to solve all the given questions.

· Till now more than 10,000 learners have used this material to prove in their carrier

Question 40: What is DataSet API?

Answer: DataSet was added in Spark 1.6, it tries to provide benefits of RDDs (RDDs are strongly typed and can use lambda function) as well as SQL interface, which uses the same underline SQL engine. You can create DataSet using JVM object, once constructed you can apply functional transformation on it like map, flatMap, filter etc.

DataSet is also distributed collection, remember as of Spark 2.3.0 , DataSet API is only available for Scala and Java, and it is not available for Python.

Question 41: How do you relate to SparkContext, SQLContext and HiveContext?

Answer: SparkContext provide the entry point for Spark system, to create SQLContext you need SparkContext object. HiveContext provides a superset functionality provided by the basic SQLContext. However, since Spark 2.0 there is a SparkSession object, which is preferred to enter into Spark system. SparkSession is unification for all these three SparkContext, SQLContext and HiveContext.

Question 42: Why do you prefer to use HiveContext?

Answer: You can use HiveContext for all the functionality provided by SQLContext, as well it has additional capabilities like you can write Queries using HiveQL, you can use Hive UDF, and you can directly read data from Hive tables.

Question 43: To use HiveContext, do you need hive setup?

Answer: No, You don’t need Hive setup in place. If there is no Hive setup, you can use HiveContext as a SQLContext. Even, HiveContext was recommended than SQLContext.

Question 44: What are the advantages of using SparkSession (Spark 2.0)?

Answer: SparkSession has built in support for Hive queries, access to Hive UDFs, and ability to read data from Hive tables. To use this feature you don’t need existing Hive setup.

Question 45: How do you relate DataFrame and DataSet?

Answer: You can assume DataFrame is a DataSet organized into named columns. In Scala and Java you can represent DataFrame as a collection of generic object DataSet[Row], in this Row is un-typed object.

Scala	dataframe -> Dataset[Row]
Java	Dataset<Row>

Question 46: Can you please explain DataSet in more detail?

Answer: It is better to take example to understand in detail. It is very important thing you need to learn.

I will take example in Scala to explain it.

Data in HadoopExam.json file.

{"course_id": 100, "course_name": "Spark Training", "Fee": "6000", "Location": "Newyork"}

{"course_id": 101, "course_name": "Hadoop Training", "Fee": "7000", "Location": "Mumbai"}

{"course_id": 102, "course_name": "NiFi Training", "Fee": "8000", "Location": "Pune"}

Create Case Class, which represent above data, each row.

case class HadoopExam(course_id: Long, course_name: String, Fee: Long, location:String)

Read the json file and create the dataset.

Use case class HadoopExam

Dataset is now a collection of JVM Scala objects HadoopExam

val dataset = spark.read.json(“hadoopexam.json”).as[HadoopExam]

Three things happen here under the hood in the code above:

· Spark reads the JSON, infers the schema, and creates a collection of DataFrames.

· At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.

· Now, Spark converts the Dataset[Row] -> Dataset[HadoopExam] type-specific Scala JVM object, as dictated by the class HadoopExam.

With Dataset as a collection of Dataset[ElementType] typed objects, you seamlessly get both compile-time safety and custom view for strongly-typed JVM objects. And your resulting strongly-typed Dataset[T] from above code can be easily displayed or processed with high-level methods.

Question 47: In what use cases, Apache Spark fits well?

Answer: Spark is good for both Interactive as well as batch mode.

Question 48: Can you describe which all projects of Spark you have used?

Answer: Spark has many other project than Spark Core as below

· Spark SQL: This project help you to work with structured data, you can mix both SQL queries and Spark programming API, for your expected results.

· Spark Streaming: It is good for processing streaming data. It can help you to create fault-tolerant streaming applications. Spark streaming improved and new structured streaming is created which uses SparkSQL engine, more detail you will find in later question.

· MLib: This API is quite reach for writing machine learning applications. You can use Python, Scala, R language for writing Spark Machine Library.

· GraphX: API for graphs and graph parallel computations.

Question 49: What is the difference, when you run Spark Applications either on YARN or standalone cluster manager?

Answer: When you run Spark Applications using YARN then Application processes are managed by the YARN Resource Manager and Node Manager.

Similarly when you run on Spark standalone, then application processes are managed by Spark Master and Worker Nodes.

Question 50: Can you write simple word count application using Apache Spark either Scala or Python, as you have hands-on experience?

Answer: It is sometime asked to write very simple Spark applications, to check a person actually worked on Spark or not. It is not mandatory that you have written correct syntax.

Example for Scala code

val hadoopExamData = sc.textFile("hdfs://hadoopexam:8020/quicktechie.txt")

val counts = hadoopExamData.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts. saveAsTextFile("hdfs://hadoopexam:8020/output.txt")

Example in Python

hadoopExamData = sc.textFile("hdfs://hadoopexam:8020/quicktechie.txt")

counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)

counts.saveAsTextFile("hdfs://hadoopexam:8020/output.txt")

Question 51: How do you compare MapReduce and Spark Application?

Answer: Spark has many advantages over the Hadoop job, let’s describe each one

MapReduce: The highest level unit of computation in MapReduce is a Job. Jobs responsibility includes to load data, applies map function and then shuffles it, after that run reduce function and finally write data to persistence storage.

Spark Application: Highest-level unit of computation is an application. A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived server continually satisfying requests. So a Spark application can consists of more than just a single MapReduce job.

MapReduce starts a process for each task. In contrast, a Spark application can have processes running on its behalf even when it is not running any application. And in case of Spark, multiple tasks can run within the same executor. Both by combining extremely fast task startup as well as in-memory data storage, resulting in orders of magnitude faster performance over MapReduce.

Question 52: Please explain the Spark execution model?

Answer: Spark execution model have following concepts

· Driver: An application maps to a single driver process. Driver process manages the job flow and schedule tasks and is available the entire time the application is running. Typically, this driver process is the same as the client process used to initiate the job, although when run on YARN, the driver can run in the cluster. In interactive mode, the shell itself is the driver process.

· Executor: For a single application/driver set of executor processes are distributed across the hosts in a cluster. The executors are responsible for performing work, in the form of tasks, as well as for storing any data that you cache. Executor lifetime depends on whether dynamic allocation is enabled. An executor has a number of slots for running tasks, and will run many concurrently throughout its lifetime.

Figure 1:To-do Update this image

· Stage: A stage is a collection of tasks that run the same code, each on a different subset of the data.

Question 53: What is Spark Streaming?

Answer: Spark streaming is an extension of core Spark that enables scalable, high-throughput, fault-tolerant processing of data streams. Spark streaming receives input data streams and divides the data into batches called DStreams.

Question 54: How do you define DStream or what is DStream?

Answer: You can create DStream from sources like Kafka, Flume, and Kinesis or by applying operations on the other DStreams. Every DStream is associated with a Receiver, which receives the data from source and stores it in executor memory.

Question 55: What is Dynamic Allocation?

Answer: Dynamic allocation allows Spark (Only on YARN) to dynamically scale the cluster resources allocated to your application based on the workload. When dynamic allocation is enabled and a Spark application has a backlog of pending tasks, it can request executors. When the application becomes idle, its executors are released and can be acquired by other applications.

When Spark dynamic resource allocation is enabled, all resources are allocated to the first submitted job available causing subsequent applications to be queued up. To allow applications to acquire resources in parallel, allocate resources to pools and run the applications in those pools and enable applications running in pools to be preempted.

Question 56: How Spark Streaming applications are impacted with Dynamic Allocation?

Answer: When Dynamic Allocation is enabled in Spark, which means that executors are removed when they are idle. However, Dynamic allocation is not effective in case of Spark Streaming. In Spark Streaming, data comes in every batch, and executors will run whenever data is available. If the executor idle timeout is less than the batch duration, executors are constantly added and removed. If executor idle timeout is greater than the batch duration, executors are never removed. Hence, it is recommended that you disable the Dynamic Allocation for Spark streaming by setting “spark.dynamicAllocation.enabled” to flase.

Question 57: When you submit Spark streaming application on local mode (not on Hadoop YARN), then it is must to have two threads, why?

Answer: As we have discussed previously, when Spark Streaming application is executed, it require at least two threads, one for receive data and one for processing that data.

Question 58: How do you enable Fault-tolerant data processing in Spark streaming?

Answer: If the Driver host for a Spark Streaming application fails, it can lose data that had been received but not yet processed. To ensure that no data is lost, you can use Spark Streaming recovery. Spark writes incoming data to HDFS as it is received and uses this data to recover state if a failure occurs.

Question 59: When you use Spark Streaming with the AWS S3 (or cloud) which storage is recommended?

Answer: When using Spark Streaming application, with the cloud services as the underline storage layer, use ephemeral HDFS on the cluster to store checkpoints, instead of the cloud store such as Amazon S3 or Microsoft ADLS.

Question 60: If you have structure data than, which Spark components you can use?

Answer: To work with structured data, we should use SparkSQL or DataFrame/DataSet API.

Question 61: What you can do with the SQLContext object?

Answer: SQLCOntext is an entry point for the Spark SQL functionality, you create SQLContext object using SparkContext. Using SQLContext, you can create a DataFrame from an RDD, a Hive table, or a data source.

Question 62: Should we use the SQLContext or HiveContext for SQL functionality?

Answer: You should use HiveContext, whether you are accessing data from Hive or Impala or not. With the HiveContext you can access Hive or Impala tables represented in the metastore database.

Question 63: Can Hive SQL syntax be used with the Spark SQL?

Answer: Hive and Impala tables and related SQL syntax are interchangeable in most of the cases. Because Spark uses the underlying Hive infrastructure, with Spark SQL whatever DDL statements, DML statements, and queries using the HiveQL syntax. For interactive query performance, you can access the same tables through Impala using impala-shell or the Impala JDBC and ODBC interfaces.

Question 64: In Spark-shell, how do you access the HiveContext?

Answer: In spark-shell, HiveContext is already created and available as a sqlContext (not SQLContext). However, in your spark application, you have to explicitly create the object of HiveContext using SparkContext as below.

sqlContext = HiveContext(sc)

Question 65: What is the minimum requirement (of the node from where Spark application submitted), if you are using Cloudera Infrastructure for the Spark application?

Answer: If you are using CDH cluster, then the host from where you are submitting applications, or runs spark-shell or pyspark then you must have Hive Gateway role defined in Cloudera Manager and client configurations deployed. Also remember, if Spark application uses the Hive view then it must have permissions to read that data else it will get empty result.

Hadoop 4-Node setup Admin training is available here.

Question 66: Can you use SQL to read data which are outside the Hive or Impala table?

Answer: If datafiles are outside of the Hive or Impala table, then you can use SQL directly to read JSON or Parquet data into DataFrame.

df = sqlContext.sql("SELECT * FROM json/parquet_file_data_dir")

Question 67: Which all storage are supported by Spark give few examples, you have used till now?

Answer: Spark cam access all the storage sources supported by Hadoop, including local filesystem, HDFS, HBSE, Amazon S3, and Microsoft ADLS.

Question 68: Give example of file types which are supported by Spark?

Answer: Spark support many file types, including text files, RCFiles, SequenceFiles, Hadoop InputFormat, Avro, Parquet and various compression of supported file types.

Question 69: Give sample API code to read/write files using Spark?

Answer: You can read compressed files using one of the following methods:

· textFile(path)

· hadoopFile(path,outputFormatClass)

You can save compressed files using one of the following methods:

· saveAsTextFile(path, compressionCodecClass="codec_class")

· saveAsHadoopFile(path,outputFormatClass, compressionCodecClass="codec_class")

Where codec_class is one of the classes in Compression Types.

Question 70: How do you access the data stored in AWS S3?

Answer: To access data stored in Amazon S3 from Spark applications, you use Hadoop file API for reading and writing RDDs. You need S3 bucket URL for that.

· SparkContext.hadoopFile

· JavaHadoopRDD.saveAsHadoopFile

· SparkContext.newAPIHadoopRDD

· JavaHadoopRDD.saveAsNewAPIHadoopFile

Question 71: What is Avro files?

Answer: Watch the video from this training.

This is a serialization system, with binary encoding. One of the best feature is that Avro is a language independent, however, it is expected that you have provided .avro as a file extension. By default Avro data files are not compressed, but it is recommended enabling compression to reduce disk usage and increase read and write performance. Avro data files support Deflate and Snappy compression. Snappy is faster, but Deflate is slightly more compact.

You do not need to specify configuration to read a compressed Avro data file. However, to write an Avro data file, you must specify the type of compression. How you specify compression depends on the component.

Question 72: What are the limitations, when you process Avro files, using Spark?

Answer: As we know, when we read data in Spark, it can convert data types, which are specific to Spark framework, hence below are very important points when you work with the Avro file using Spark framework, as we know avro also has schema assigned for its data:

· Avro contains Enumerated types, will be erased by Spark- Avro enumerated types become strings when they are read into Spark, because Spark does not support enumerated types.

· All the out will be Unions - Spark writes everything as unions of the given type along with a null option.

· Schema of the Avro will be updated/changed - Spark reads everything into an internal representation. Even if you just read and then write the avro data, the schema for the output is different and would be as per the Spark specifications.

· Spark schema reordering - Spark reorders the elements in its schema when writing them to disk so that the elements being partitioned on are the last elements.

Question 73: How do you read and write parquet files in Apache Spark?

Answer: To read and write parquet files are quite simple in Spark, you can use SQLContext and DataFrame to read and write parquet files.

Read -> SQLContext.read.parquet(“parquet file input path”)

Write -> DataFrame.write.parquet(“parquet output file path”)

Question 74: How can you create application, when code is written using Java or Scala?

Answer: You can use Maven build tool, to create application written in Java or Scala. For Scala, you can use SBT.

Question 75: How do you submit Spark applications?

Answer: To submit Spark application, you have to use “spark-submit” utility. Common use will be as below to submit using YARN.

spark-submit \

--class com.hadoopexam.analytics.WordCount \

--master yarn \

--deploy-mode cluster \

--conf "spark.eventLog.dir=hdfs://hadoopexam:8020/user/spark/hadoopexamlog" \

lib/hadoopexam-example.jar \

Question 76: When you submit the Spark application, with the configuration, what is the precedence order is followed?

Answer: The order of configuration properties are as below.

1. Properties passed to SparkConf.

2. Arguments passed to spark-submit, spark-shell, or pyspark.

3. Properties set in spark-defaults.conf.

Question 77: While submitting your Spark application, you are getting “Task not serializable” exception, what is the reason and resolution for it?

Answer: Because of a limitation in the way Scala compiles code, some applications with nested definitions running in an interactive shell may encounter a Task not serializable exception. It is recommended that submitting these as applications, so entire jar can be transferred over the worker node.

Question 78: What are the advantages of running Spark on YARN cluster manager?

Answer: There are many advantages of running Spark on YARN cluster manager, as below

1. Sharing cluster resources dynamically: You can dynamically share and centrally configure the same pool of cluster resources among all frameworks that run on YARN.

2. Scheduling: You can use all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.

3. Executors: You choose the number of executors to use; in contrast, Spark Standalone requires each application to run an executor on every host in the cluster.

4. Security: Spark can run against Kerberos-enabled Hadoop clusters and use secure authentication between its processes.

Question 79: What all steps are followed when you submit the Spark Application on YARN cluster manager?

Answer: Spark orchestrates its operations through the driver program. When the driver program is run, the Spark framework initializes executor processes on the cluster hosts that process your data. The following occurs when you submit a Spark application to a cluster:

1. Launch Main Method: The driver is launched and invokes the main method in the Spark application.

2. Resource Request: The driver requests resources from the cluster manager to launch executors.

3. Launch Executors: The cluster manager launches executors on behalf of the driver program.

4. Submitting tasks: The driver runs the application. Based on the transformations and actions in the application, the driver sends tasks to executors.

5. Task Execution: Tasks are run on executors to compute and save results.

6. Effect of dynamic allocation: If dynamic allocation is enabled, after executors are idle for a specified period, they are released.

7. Stop SparkContext: When driver's main method exits or calls SparkContext.stop, it terminates any outstanding executors and releases resources from the cluster manager.

Question 80: When you run Spark Application on YARN cluster mode, what happens?

Answer: When you submit Spark application to YARN cluster, the Spark Driver program will run in the 0 on a YARN cluster hosts. A single process in a YARN container is responsible for both driving the application and requesting resources from YARN. The client that launches the application does not need to run for the lifetime of the application.

Question 81: When is the cluster mode is not suitable for the Spark application?

Answer: Cluster mode is not well suited for the application, which require to be interactive for example if application require user input, such as spark-shell and pyspark. Hence, driver program needs to run as part of client application process, which initiates the Spark application.

Question 82: What is the client deployment mode?

Answer: When application submitted in this mode, Spark driver will run on the host where the Job is submitted. The ApplicationMaster is responsible only for requesting executor containers from YARN. After the container start, the client communicates with the containers to schedule work.

Question 83: How do you run Spark shell on YARN?

Answer: To run spark-shell or pyspark client on YARN, use the --master yarn --deploy-mode client, when you start the application.

Question 84: How can you monitor and debug the Spark application submitted over YARN?

Answer: To obtain information about Spark application behavior you can for logging information go to YARN logs and the Spark web application UI.

Question 85: What advantages you will get when you consider Java/Scala for running your Spark application, rather than Pyspark?

Answer: As you know, Scala also uses the Java/JVM, and both has the same advantages. Also, best thing is Spark framework is written using Scala. So Spark with Java and Scala offers many advantages: platform independence by running inside the JVM, self-contained packaging of code and its dependencies into single JAR files, and higher performance because Spark itself runs in the JVM. You lose these advantages when using the Spark Python API.

Question 86: What general problem, you see when you need to run pyspark?

Answer: One of the major problem while using PySpark is managing third party dependencies and making them available for Python jobs on a cluster can be difficult. To determine which dependencies are required on the cluster, you must understand that Spark code applications run in Spark executor processes distributed throughout the cluster. If the Python transformations you define use any third-party libraries, such as NumPy, Spark executors require access to those libraries when they run on remote executors.

Question 87: In which scenario, python is preferred than Scala and Java for Spark application?

Answer: Apache Spark provides APIs in non-JVM languages such as Python. Many data scientists use Python because it has a rich variety of numerical libraries with a statistical, machine-learning, or optimization focus. So you can say, while working with Machine learning, you can prefer PySpark.

Question 88: If you need a single shared file, while running your program on various nodes in cluster, how can you provide that to your application?

Answer: If you need only single shared file, which needs to transferred each node, you can use --py-files option, while submitting the application using spark-submit and specify the local path to the file. Or you can use programmatic option, sc.addPyFiles() function.

Question 89: When you need functionality from multiple files, then how do you provide?

Answer: If you use functionality from multiple python files, then you can create an egg/zip for the package, because --py-files flag also accepts a path to an egg file.

Question 90: What can be a problem of distributing egg files?

Answer: sending egg (Collection of python files) files is problematic because packages that contain native code must be compiled for the specific host on which it will run. When doing distributed computing with industry-standard hardware, you must assume is that the hardware is heterogeneous. However, because of the required C compilation, a Python egg built on a client host is specific to the client CPU architecture. Therefore, distributing an egg for complex, compiled packages like NumPy, SciPy, and pandas often fails. Instead of distributing egg files you should install the required Python packages on each host of the cluster and specify the path to the Python binaries for the worker hosts to use.

Question 91: What is Spark dataset partition?

Answer: Spark Dataset comprises a fixed number of partitions, and each partition will have number of records in it.

Question 92: What is done during the shuffling?

Answer: Spark performs shuffle, which transfer data around the cluster and results in a new stage with a new set of partitions.

Question 93: You have been given below code snippet, do you think that data shuffling will be applied on partitions across the nodes?

sc.textFile("hadoopexam.txt").map(mapFunc).flatMap(flatMapFunc).filter(filterFunc)

Answer: When data in one partition is not depend on partition, as well as transformation, which does not require another partition from different node. Then shuffling will not be applied. Out of all three transformation map, flatMap and filter does not require shuffling of data across partitions.

Question 94: Why does having more stage, will impact on the performance of Spark Application?

Answer: If you have written a Spark code, which has more stages then it suffer from performance. Because on each stage data will be persisted (either in cache or disk), also possibility of shuffling data across the partitions. As you know, wherever these two I/O (network and disk), comers into the picture there will be huge impact on performance. And stage boundary in a Spark job will cause this.

Question 95: What is the impact of having/changing numberOfPartitions during transformation, will cause the performance?

Answer: Transformations that can trigger a stage boundary typically accept a numPartitions argument, which specifies into how many partitions to split the data in the child stage. Just as the number of reducers is an important parameter in MapReduce jobs, the number of partitions at stage boundaries can determine an application's performance.

Question 96: Which all methods should be avoided, so less amount of data shuffling happens across the partitions?

Answer: When choosing an arrangement of transformations, minimize the number of shuffles and the amount of data shuffled. Shuffles are expensive operations. All shuffle data must be written to disk and then transferred over the network. repartition, join, cogroup , and any of the *By or *ByKey transformations can result in shuffles. Not all these transformations are equal.

Question 97: What is the difference between below two Spark steps and which is good for performance perspective?

· rdd.groupByKey().mapValues(_.sum)

· rdd.reduceByKey(_ + _)

Answer: Both the above operations results in the same result. However, the first one, as we have already discussed in previous question that entire dataset will be transferred across the network, while in case of second one computes the data locally, and sums for each key in each partition and combines those local sums into larger sums after shuffling.

Question 98: What performance issue, you will observe with the below code segment and which one you suggest as a better alternative?

rdd.map(kv => (kv._1, new Set[String]() + kv._2)).reduceByKey(_ ++ _)

Answer: reduceByKey when the input and output value types are different as given in above scenario a transformation that finds all the unique strings corresponding to each key. You could use map to transform each element into a Set and then combine the Sets with reduceByKey().

This results in unnecessary object creation because a new set must be allocated for each record. Instead, use aggregateByKey, which performs the map-side aggregation more efficiently:

val zero = new collection.mutable.Set[String]()

rdd.aggregateByKey(zero)((set, v) => set += v,(set1, set2) => set1 ++= set2)

Question 99: If you have a small dataset, which needs to be joined with another bigger dataset, what approach you will use in this case?

Answer: As you mentioned one dataset is smaller and other is very big. Then we will consider using broadcast variable, which will help in improving the overall performance. To avoid shuffles when joining two datasets, you can use broadcast variables. When one of the datasets is small enough to fit in memory in a single executor, it can be loaded into a hash table on the driver and then broadcast to every executor. A map transformation can then reference the hash table to do lookups.

Question 100: When it is advantageous to have shuffle?

Answer: When you are working with huge volume of data and more processing power is also available. And application is compute intensive, hence we need to use shuffling in this case. So that data can be processed in parallel using all the available CPUs. Another use case is aggregation, if a huge volume of data and you want to apply aggregate function on that, then single thread of the driver will become bottleneck. You should shuffle data across the nodes and then apply the aggregate functions on that data locally on each node. So that data can be aggregated parallel first and then final aggration will be done on the driver program.

· Hadoop Professional Training

· Spark professional Training in Scala

· 111 solve problems scenario for CCA175 Certification, It will come with selected complimentary videos to help you to do setup and how to solve all the given questions.

· Till now more than 10,000 learners have used this material to prove in their carrier

Question 101: Which of the two resources used by the Spark application, but cannot be managed by neither YARN nor Spark?

Answer: The two main resources that Spark and YARN manage are CPU and memory. Disk and network I/O affect Spark performance as well, but neither Spark nor YARN actively manage them.

Question 102: When you deploy Spark on YARN cluster manager, how does ApplicationMaster memory comes into the picture?

Answer: The ApplicationMaster, which is a non-executor container that can request containers from YARN, requires memory and CPU that must be accounted for. In client deployment mode, they default to 1024 MB and one core. In cluster deployment mode, the ApplicationMaster runs the Spark application driver, so consider augmenting its resources with the --driver-memory and --driver-cores flags.

Question 103: What is Parquet file?

Answer: Parquet is an open source, column-oriented binary file format for Hadoop that supports very efficient compression and encoding schemes (Spark 2.0 also started using extensively). Parquet allows compression schemes to be specified on a per-column level, and allows adding more encodings as they are invented and implemented. Encoding and compression are separated, allowing Parquet consumers to implement operators that work directly on encoded data without paying a decompression and decoding penalty, when possible.

Question 104: When you submit your application to the Spark cluster, do you provide Hadoop and Spark jar with it?

Answer: No, we don’t have to include Spark and Hadoop jars with the application jar because they are available during runtime.

Question 105: What is Beeswax application?

Answer: Beeswax is a Hue application, using that you can perform queries on Hive, even you can create tables, load data, and run and manager Hive queries.

Question 106: How do you define a cluster manager?

Answer: Cluster Manager is an external service, which will help you on acquiring resources on the cluster like Spark standalone or YARN.

Question107: Which are the common compression techniques are used by Apache Hadoop?

Answer: Commonly used compression tool within Apache Hadoop are gzip, bzip2, Snappy, and LZO

Question 108: How do you define dataset in Spark?

Answer: We might have already covered that question previously, but this answer will be add on that

Dataset is a collection of records, similar to a relational database table. Records are similar to table rows, but the columns can contain not only strings or numbers, but also nested data structures such as lists, maps, and other records.

Question 109: How do you define the driver on Spark?

Answer: In Apache Spark, a process that represents an application session. The driver is responsible for converting the application to a directed graph of individual steps to execute on the cluster. There is one driver per application.

Question 106: Define the executor process in Apache Spark?

Answer: A process that serves a Spark application. An executor runs multiple tasks over its lifetime, and multiple tasks concurrently. A host may have several Spark executors and there are many hosts running Spark executors for each application.

Question 107: Describe LZO compression?

Answer: A free, open source compression library. LZO compression provides a good balance between data size and speed of compression. The LZO compression algorithm is the most efficient of the codecs, using very little CPU. Its compression ratios are not as good as others, but its compression is still significant compared to the uncompressed file sizes. Unlike some other formats, LZO compressed files are splittable, enabling MapReduce to process splits in parallel.

Question 108: Define MapReduce algorithm?

Answer: A distributed processing framework for processing and generating large data sets and an implementation that runs on large clusters machines.

The processing model defines two types of functions: a map function that processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

A MapReduce job partitions the input data set into independent chunks that are processed by the map functions in a parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce functions. Typically both the input and the output of the job are stored in a distributed filesystem.

The implementation provides an API for configuring and submitting jobs and job scheduling and management services; a library of search, sort, index, inverted index, and word co-occurrence algorithms; and the runtime. The runtime system partitions the input data, schedules the program's execution across a set of machines, handles machine failures, and manages the required inter-machine communication.

Question 109: Define a stage in Spark?

Answer: In Spark, a collection of tasks that all execute the same code, each on a different partition. Each stage contains a sequence of transformations that can be completed without shuffling the data.

Question 110: Define a task in Apache Spark application?

Answer: A unit of work on a partition of an RDD.

Question 111: What are the main components of the YARN architecture?

Answer: A general architecture for running distributed applications. YARN specifies the following components:

· ResourceManager - A master daemon that authorizes submitted jobs to run, assigns an ApplicationMaster to them, and enforces resource limits.

· ApplicationMaster - A supervisory task that requests the resources needed for executor tasks. An ApplicationMaster runs on a different NodeManager for each application. The ApplicationMaster requests containers, which are sized by the resources a task requires to run.

· NodeManager - A worker daemon that launches and monitors the ApplicationMaster and task containers.

· JobHistory Server - Keeps track of completed applications.

The ApplicationMaster negotiates with the ResourceManager for cluster resources—described in terms of a number of containers, each with a certain memory limit—and then runs application-specific processes in those containers. The containers are overseen by NodeManagers running on cluster nodes, which ensure that the application does not use more resources than it has been allocated.

Question 112: What is Zookeeper service?

Answer: A centralized service for maintaining configuration information, naming, and providing distributed synchronization and group services. In a CDH cluster, ZooKeeper coordinates the activities of high-availability services, including HDFS, Oozie, Hive, Solr, YARN, HBase, and Hue.

Question 113: How you can create DataFrame object?

Answer: There are many ways by which you can create DataFrame, but the below sources are commonly used.

· Structured (csv, tsv) data files.

· Hadoop Hive tables.

· RDBMS tables, SQL Queries output

· From an RDD

· It also support Avro, Parquet data formats. (Learn more about avro in module-39 here HadoopExam.com BigData Hadoop Training)

· You can enhance to use your custom format as well.

Question 114: Please describe something about Spark DataSet?

Answer: DataSet API was added in Spark 1.6, DataSet provides the benefit of both RDDs and the SparkSQL optimizer. You can create DataSet from a Java Objects and apply functional transformation on that using function like map, filter etc.

DataSet is a collection of strongly-typed objects, which are defined using a user-defined case classes.

Question 115: Can you provide the difference between DataFrame and DataSet?

Answer: As we have seen previously, DataFrame’s can be seen as a DataSet[Row], where Row is a generic un-typed object. While DataSet is a collection of strongly-typed objects specified using a user-defined case classes.

DataFrame has un-typed objects, it means syntax error you can catch during compile time. But if there is any type mismatch, then it can only be caught during run-time.

DataSet, as it has strongly typed objects, it means both syntax as well as type mismatch error can be caught during compile time only. (If you know Java Generics, it is easy to understand concept)

Question 116: What is the Case classes, and why do you use it in Spark?

Answer: Case classes are Scala way of creating Java Pojo’s. It is implicitly create getter and setter method for the Object member. This is used in Spark to assign Schema to a DataSet/DataFrame objects.

Question 117: What is a Catalyst optimizer?

Answer: It is a core part of SparkSQL, which is written using Scala. This helps Spark in

· Schema inference from JSON data

You can say that, it is helping Spark SQL to generate query plan, which can be easily converted to the Direct Acyclic Graph of RDD’s. Once DAG is created it is ready to execute. It does lot of things before creating optimized query plan. Main purpose of this optimizer, is to create optimized DAG.

Both the query plans and optimized query plans are internally represented as trees. Catalyst optimizer has various other libraries, which help in working on this trees.

Question 118: What all types of plans can be created by the Catalyst optimizer?

Answer: Catalyst optimizer can create following two query plans

· Logical plan: It defines the computations on the DataSets, without defining how to carry out the specific computations.

· Physical plan: It defines computation of the datasets, which can be executed to get expected result. Generally multiple physical plans are generated by the optimizer and then later on using Cost-based optimizer, less costly plan will be selected to execute the query.

Question 119: How do you print the all the plans created by Catalyst optimizer for running a query?

Answer: We have to use explain(Boolean) method. Something similar to below

dataframe1.join(dataframe2, "region").selectExpr("count(*)").explain(true)

Question 120: What is Project Tungsten?

Answer: You can say that, this is one of the largest execution engine for Spark. It has more focus on observing CPU and Memory, rather than I/O and network. In Spark CPU and Memory was the major bottleneck for performance.

Before Spark 2.0, a most of the CPU cycles were wasted, rather than using for computation, they were used for read/write of intermediate data to CPU cache.

Project tungsten helped in improving efficiency of memory and CPU, so that maximum hardware limits can be used.

Question 121: Do you see any pain point with regards to Spark DStream?

Answer: There are some common issues with the Spark DStream

· Timestamp: It consider timestamp, when event entered into the Spark system, rather than attached timestamp of the event.

· API: You have to write different code for both batch and steam processing.

· Failure condition: Developer has to manage various failure conditions.

Question 122: Please list down some advantages of the Structured Streaming compare to DStream?

Answer: As we have seen in previous question DStream has various issues, that’s the reason Structured Streaming introduced in the Apache Spark

· It is Fast

· Fault-tolerant

· Exactly-once stream processing approach

· Input data can be thought as a append only table (grows continuously)

· Trigger: It can specify a trigger, which can check for input data in a defined time interval.

· API: The high level API is built on the SparkSQL engine and is tightly integrated with SQL queries, DataFrame and DataSet APIs.

Question 124: Can you give some examples of Special purpose engines, which can work as a data source for Spark and also what kind of functionality they can provide?

Answer: Spark support below Special purpose engines

· ElasticSearch : Good for srearch

· Kafka : Good for messaging system

· Redis : Good for caching

Question 125: You need to select small dataset from stored data, however, it is not recommended that you use Spark for that, why?

Answer: You should not use Spark for such use cases, because, Spark has to go through the all stored files and then find your result from it. You should consider using RDBMS or some other storage which index particular columns of the data and data retrieval would be faster in that case.

Question 126: You have created DataFrame from multiple input files and want to save into single output file, what SQL you use?

Answer: You have to use select * and coalesce command of DataFrame.

Question 127: You are accessing data stored in AWS S3, in Spark and on bucket TLS (Data Encryption) is enabled, then what you need?

Answer: If S3 bucket is TLS enabled and you are using custom jssecacerts truststore, make sure that your trustore includes the root Certificate Authority (CA) certificates that signed the Amazon S3 certificate.

Question 128: If your Spark is installed on EC2 instances, and trying to access data stored in S3 bucket, but you don’t want to provide the credential at the same time, how can you do that?

Answer: As we know, both EC2 and S3 are Amazon services, we can leverage IAM roles (Learn Amazon AWS Solution Architect from here), in this mode of operation associates the authorization with individual EC2 instances instead of with each Spark app or the entire cluster.

Run EC2 instances with instance profiles associated with IAM roles that have the permissions you want. Requests from a machine with such a profile authenticate without credentials.

Question 129: Cloudera also provides a way to store AWS bucket credential to provide system-wide AWS access to a single predefined bucket, what is that?

Answer: Cloudera recommends that you use the Hadoop Credential Provider to set up AWS access because it provides system-wide AWS access to a single predefined bucket, without exposing the secret key in a configuration file or having to specify it at runtime.

Question 130: What are the ways, by which AWS bucket access can be controlled?

Answer: AWS access for users can be set up in two ways. You can either provide a global credential provider file that will allow all Spark users to submit S3 jobs, or have each user submit their own credentials every time they submit a job.

Question 131: You have got a complaint from the security department that AWS bucket access credentials are visible in log file, where is the mistake?

Answer: It is because, credentials are stored in non-recommended way. You might have either of the below method used for providing credential for access S3 bucket

· Specified credentials during runtime, using configuration properties, something like that

sc.hadoopConfiguration.set("fs.s3a.access.key", "...")

· Or you have configured these credentials in the core-site.xml file.

Both of the above configurations are not recommended, if you want complete security to your data. Rather use Hadoop Credential Provider.

Frequently Asked Questions

Saturday, 26 September 2020

No comments:

State Counter