UNIVERSITÉ DE SOUSSE
École Nationale d’Ingénieurs de Sousse A.U. 2019-2020
DS Big-Data
Filière : GTE2
Nom : ………………………… Prénom :……….………………
Groupe : …… Salle: …… …… No place: ……………
=================================================================================
For each question circle the right answer(s).
1. What are the 4Vs of Big Data? (Please select the FOUR that apply)
A) Veracity
B) Velocity
C) Variety
D) Value
E) Volume
F) Visualization
2. What is meant by Data at rest?
A) Data in a file that has experied
B) Data encrypted
C) Data that is no changing
D) A file that has been processed by hadoop
3. What are the three types of Big Data? (Please select the THREE that apply)
A) Natural Language
B) Semi-structured
C) Graph-based
D) Structured
E) Machine-Generated
F) Unstructured
4. Which two factors in a Hadoop cluster increase performance most significantly? Select the TWO
answers that apply
A) solid state disks
B) immediate failover of failed disks
C) data redundancy on management nodes
D) high speed networking between nodes
E) parallel reading of large data files
F) large number of small data files
5. What is the default number of replicas in a Hadoop system?
A) 5
B) 4
C) 3
6. True or False: At least 2 Name Nodes are required for a standalone Hadoop cluster.
A) TRUE
B) FALSE
7. Which computing technology provides Hadoop's high performance? sélectionnez une répponse
A) Online Analytical Processing
B) Parallel Processing
C) RAID-0
D) Online Transactional Processing
8. Centralized handling of job control flow is one of the the limitations of MR v1.
A) TRUE
B) FALSE
9. The Job Tracker in MR1 is replaced by which component(s) in YARN?
A) ResourceMaster
B) ApplicationMaster
C) ApplicationManager
D) ResourceManager
10. What is an advantage of the ORC file format?
A. Efficient compression
B. Big SQL can exploit advanced features
C. Supported by multiple I/O engines
D. Data interchange outside Hadoop
11. What is the default directory in HDFS where tables are stored?
A. /apps/hive/warehouse/schema
B. /apps/hive/warehouse/
C. /apps/hive/warehouse/data
D. /apps/hive/warehouse/bigsql
12. Which three programming languages are directly supported by Apache Spark?
A) Scala
B) C++
C) C#
D) Java
E) Python
F) .NET
13. Under the YARN/MRv2 framework, which daemon arbitrates the execution of tasks among all the
applications in the system?
A. ScheduleManager
B. ApplicationMaster
C. JobMaster
D. ResourceManager
14. Which Apache Hadoop component can potentially replace an RDBMS as a large Hadoop datastore
and is particularly good for "sparse data"?
A. MapReduce
B. HBase
C. Spark
D. Ambari
15. Which statement is true about Hortonworks Data Platform (HDP)?
A. It is a Hadoop distribution based on a centralized architecture with YARN at its core.
B. It is a powerful platform for managing large volumes of structured data.
C. It is engineered and developed by IBM's BigInsights team.
D. It is designed specifically for IBM Big Data customers.
16. What are two primary limitations of MapReduce v1?
A) Workloads limited to MapReduce
B) Resource utilization
C) Scalability
D) TaskTrackers can be a bottleneck to MapReduce jobs
E) Number of TaskTrackers limited to 1,000
17. Which statement is true about MapReduce v1 APIs?
A) MapReduce v1 APIs define how MapReduce jobs are executed.
B) MapReduce v1 APIs are implemented by applications which are largely independent of the
execution environment.
C) MapReduce v1 APIs cannot be used with YARN.
D) MapReduce v1 APIs provide a flexible execution environment to run MapReduce.
18. Apache Spark provides a single, unifying platform for which three of the following types of
operations?
A. graph operations
B. record locking
C. batch processing
D. machine learning
E. ACID transactions
19. Which statement is true about the Combiner phase of the MapReduce architecture?
A. It determines the size and distribution of data split in the Map phase.
B. It reduces the amount of data that is sent to the Reducer task nodes.
C. It aggregates all input data before it goes through the Map phase.
D. It is performed after the Reducer phase to produce the final output.
20. Which component of the Spark Unified Stack allows developers to intermix structured database
queries with Spark's programming language?
A. Mesos
B. Spark SQL
C. Java
D. Mllib
21. Which Spark Core function provides the main element of Spark API?
A. RDD
B. MLlib
C. YARN
D. Mesos
22. Which two factors in a Hadoop cluster increase performance most significantly?
A. large number of small data files
B. data redundancy on management nodes
C. high-speed networking between nodes
D. solid state disks
E. immediate failover of failed disks
F. parallel reading of large data files
23. Hadoop uses which two Google technologies as its foundation?
A. YARN
B. Google File System
C. Ambari
D. HBase
E. MapReduce
24. Under the MapReduce v1 programming model, which shows the proper order of the full set of
MapReduce phases?
A. Map -> Split -> Reduce -> Combine
B. Map -> Combine -> Reduce -> Shuffle
C. Split -> Map -> Combine -> Reduce
D. Map -> Combine -> Shuffle -> Reduce
25. Under the YARN/MRv2 framework, which daemon is tasked with negotiating with the
NodeManager(s) to execute and monitor tasks?
A. ApplicationMaster
B. TaskManager
C. ResourceManager
D. JobMaster
26. What is the default directory in HDFS where tables are stored?
A. /apps/hive/warehouse/bigsql
B. /apps/hive/warehouse/
C. /apps/hive/warehouse/data
D. /apps/hive/warehouse/schema
27. Which Apache Hadoop component can potentially replace an RDBMS as a large Hadoop
datastore and is particularly good for "sparse data"?
A. Ambari
B. HBase
C. MapReduce
D. Spark
28. Which component of an Hadoop system is the primary cause of poor performance?
A. network
B. disk latency
C. CPU
D. RAM
29. The number of MAPS is determined by
A) The Input data
B) The output data
C) The Cluster
30. What are the benefits of using Spark?(Please select the THREE that apply)
A) Generality
B) Versality
C) Speed
D) Ease of use
31. Resilient Distributed Dataset (RDD) is the primary abstraction of Spark.
A) TRUE
B) FALSE
32. Which Spark RDD operations creat an acyclic graph through a lazy execution
A) Actions
B) Map-Reduce
C) Count
D) Transformations
33. What would you need to do in a Spark application that you would not need to do in a Spark shell
to start using Spark?
A) Extract the necessary libraries to load the SparkContext
B) Export the necessary libraries to load the SparkContext
C) Delete the necessary libraries to load the SparkContext
D) Import the necessary libraries to load the SparkContext
34. True or False: NoSQL database is designed for those that do not want to use SQL.
A) TRUE
B) FALSE
35. Which database is a columnar storage database?
A) SQL
B) Hive
C) HBase
36. Which file format has the highest performance
A) Sequence
B) ORC
C) Parquet
D) Delimited
37. What is an advantage of the ORC file format?
A. Efficient compression
B. Data interchange outside Hadoop
C. Supported by multiple I/O engines
D. Big SQL can exploit advanced features
38. You are creating a new table and need to format it with parquet. Which partial SQL statement
would create the table in parquet format?
A. STORED AS parquet
B. CREATE AS parquetfile
C. STORED AS parquetfile
D. CREATE AS parquet
39. Which Hadoop ecosystem tool can import data into a Hadoop cluster from a DB2, MySQL, or
other databases?
A. Sqoop
B. HBase
C. Accumulo
D. Oozie
40. Which data encoding format supports exact storage of all data in binary representations ?
A. Parquet
B. RCFile
C. SequenceFiles
D. Flat