0% found this document useful (0 votes)
28 views6 pages

DS QCM BigData 2021

This document is an examination paper for a Big Data course at the Université de Sousse, covering various topics related to Big Data, Hadoop, and Spark. It consists of multiple-choice questions that assess knowledge on concepts such as the 4Vs of Big Data, Hadoop components, and programming languages supported by Apache Spark. The questions also explore the advantages of different file formats and the performance factors in Hadoop clusters.

Uploaded by

raed touil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views6 pages

DS QCM BigData 2021

This document is an examination paper for a Big Data course at the Université de Sousse, covering various topics related to Big Data, Hadoop, and Spark. It consists of multiple-choice questions that assess knowledge on concepts such as the 4Vs of Big Data, Hadoop components, and programming languages supported by Apache Spark. The questions also explore the advantages of different file formats and the performance factors in Hadoop clusters.

Uploaded by

raed touil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

UNIVERSITÉ DE SOUSSE

École Nationale d’Ingénieurs de Sousse A.U. 2019-2020

DS Big-Data
Filière : GTE2
Nom : ………………………… Prénom :……….………………
Groupe : …… Salle: …… …… No place: ……………
=================================================================================
For each question circle the right answer(s).

1. What are the 4Vs of Big Data? (Please select the FOUR that apply)
A) Veracity
B) Velocity
C) Variety
D) Value
E) Volume
F) Visualization

2. What is meant by Data at rest?


A) Data in a file that has experied
B) Data encrypted
C) Data that is no changing
D) A file that has been processed by hadoop

3. What are the three types of Big Data? (Please select the THREE that apply)
A) Natural Language
B) Semi-structured
C) Graph-based
D) Structured
E) Machine-Generated
F) Unstructured

4. Which two factors in a Hadoop cluster increase performance most significantly? Select the TWO
answers that apply
A) solid state disks
B) immediate failover of failed disks
C) data redundancy on management nodes
D) high speed networking between nodes
E) parallel reading of large data files
F) large number of small data files

5. What is the default number of replicas in a Hadoop system?


A) 5
B) 4
C) 3
6. True or False: At least 2 Name Nodes are required for a standalone Hadoop cluster.
A) TRUE
B) FALSE

7. Which computing technology provides Hadoop's high performance? sélectionnez une répponse
A) Online Analytical Processing
B) Parallel Processing
C) RAID-0
D) Online Transactional Processing

8. Centralized handling of job control flow is one of the the limitations of MR v1.
A) TRUE
B) FALSE

9. The Job Tracker in MR1 is replaced by which component(s) in YARN?


A) ResourceMaster
B) ApplicationMaster
C) ApplicationManager
D) ResourceManager

10. What is an advantage of the ORC file format?


A. Efficient compression
B. Big SQL can exploit advanced features
C. Supported by multiple I/O engines
D. Data interchange outside Hadoop

11. What is the default directory in HDFS where tables are stored?
A. /apps/hive/warehouse/schema
B. /apps/hive/warehouse/
C. /apps/hive/warehouse/data
D. /apps/hive/warehouse/bigsql

12. Which three programming languages are directly supported by Apache Spark?
A) Scala
B) C++
C) C#
D) Java
E) Python
F) .NET
13. Under the YARN/MRv2 framework, which daemon arbitrates the execution of tasks among all the
applications in the system?
A. ScheduleManager
B. ApplicationMaster
C. JobMaster
D. ResourceManager

14. Which Apache Hadoop component can potentially replace an RDBMS as a large Hadoop datastore
and is particularly good for "sparse data"?
A. MapReduce
B. HBase
C. Spark
D. Ambari

15. Which statement is true about Hortonworks Data Platform (HDP)?


A. It is a Hadoop distribution based on a centralized architecture with YARN at its core.
B. It is a powerful platform for managing large volumes of structured data.
C. It is engineered and developed by IBM's BigInsights team.
D. It is designed specifically for IBM Big Data customers.

16. What are two primary limitations of MapReduce v1?


A) Workloads limited to MapReduce
B) Resource utilization
C) Scalability
D) TaskTrackers can be a bottleneck to MapReduce jobs
E) Number of TaskTrackers limited to 1,000

17. Which statement is true about MapReduce v1 APIs?


A) MapReduce v1 APIs define how MapReduce jobs are executed.
B) MapReduce v1 APIs are implemented by applications which are largely independent of the
execution environment.
C) MapReduce v1 APIs cannot be used with YARN.
D) MapReduce v1 APIs provide a flexible execution environment to run MapReduce.

18. Apache Spark provides a single, unifying platform for which three of the following types of
operations?
A. graph operations
B. record locking
C. batch processing
D. machine learning
E. ACID transactions

19. Which statement is true about the Combiner phase of the MapReduce architecture?
A. It determines the size and distribution of data split in the Map phase.
B. It reduces the amount of data that is sent to the Reducer task nodes.
C. It aggregates all input data before it goes through the Map phase.
D. It is performed after the Reducer phase to produce the final output.
20. Which component of the Spark Unified Stack allows developers to intermix structured database
queries with Spark's programming language?
A. Mesos
B. Spark SQL
C. Java
D. Mllib

21. Which Spark Core function provides the main element of Spark API?
A. RDD
B. MLlib
C. YARN
D. Mesos

22. Which two factors in a Hadoop cluster increase performance most significantly?
A. large number of small data files
B. data redundancy on management nodes
C. high-speed networking between nodes
D. solid state disks
E. immediate failover of failed disks
F. parallel reading of large data files

23. Hadoop uses which two Google technologies as its foundation?


A. YARN
B. Google File System
C. Ambari
D. HBase
E. MapReduce

24. Under the MapReduce v1 programming model, which shows the proper order of the full set of
MapReduce phases?
A. Map -> Split -> Reduce -> Combine
B. Map -> Combine -> Reduce -> Shuffle
C. Split -> Map -> Combine -> Reduce
D. Map -> Combine -> Shuffle -> Reduce

25. Under the YARN/MRv2 framework, which daemon is tasked with negotiating with the
NodeManager(s) to execute and monitor tasks?
A. ApplicationMaster
B. TaskManager
C. ResourceManager
D. JobMaster

26. What is the default directory in HDFS where tables are stored?
A. /apps/hive/warehouse/bigsql
B. /apps/hive/warehouse/
C. /apps/hive/warehouse/data
D. /apps/hive/warehouse/schema
27. Which Apache Hadoop component can potentially replace an RDBMS as a large Hadoop
datastore and is particularly good for "sparse data"?
A. Ambari
B. HBase
C. MapReduce
D. Spark

28. Which component of an Hadoop system is the primary cause of poor performance?
A. network
B. disk latency
C. CPU
D. RAM

29. The number of MAPS is determined by


A) The Input data
B) The output data
C) The Cluster

30. What are the benefits of using Spark?(Please select the THREE that apply)
A) Generality
B) Versality
C) Speed
D) Ease of use

31. Resilient Distributed Dataset (RDD) is the primary abstraction of Spark.


A) TRUE
B) FALSE

32. Which Spark RDD operations creat an acyclic graph through a lazy execution
A) Actions
B) Map-Reduce
C) Count
D) Transformations

33. What would you need to do in a Spark application that you would not need to do in a Spark shell
to start using Spark?
A) Extract the necessary libraries to load the SparkContext
B) Export the necessary libraries to load the SparkContext
C) Delete the necessary libraries to load the SparkContext
D) Import the necessary libraries to load the SparkContext

34. True or False: NoSQL database is designed for those that do not want to use SQL.
A) TRUE
B) FALSE

35. Which database is a columnar storage database?


A) SQL
B) Hive
C) HBase
36. Which file format has the highest performance
A) Sequence
B) ORC
C) Parquet
D) Delimited

37. What is an advantage of the ORC file format?


A. Efficient compression
B. Data interchange outside Hadoop
C. Supported by multiple I/O engines
D. Big SQL can exploit advanced features

38. You are creating a new table and need to format it with parquet. Which partial SQL statement
would create the table in parquet format?
A. STORED AS parquet
B. CREATE AS parquetfile
C. STORED AS parquetfile
D. CREATE AS parquet

39. Which Hadoop ecosystem tool can import data into a Hadoop cluster from a DB2, MySQL, or
other databases?
A. Sqoop
B. HBase
C. Accumulo
D. Oozie

40. Which data encoding format supports exact storage of all data in binary representations ?
A. Parquet
B. RCFile
C. SequenceFiles
D. Flat

You might also like