Big Data Processing Concepts Lecture 10:
Chapter 6 Part 1 Questions
1. What is the primary purpose of the MapReduce programming
model?
A) Data visualization
B) Sequential data processing
C) Parallel data processing
D) Network configuration
Answer: C) Parallel data processing
2. In MapReduce, what do both input and output data consist of?
A) Tables and graphs
B) Key-Value pairs
C) Images and videos
D) Text documents
Answer: B) Key-Value pairs
3. Which function in MapReduce is responsible for dividing input data
into partitions?
A) Reduce
B) Shuffle
C) Map
D) Sort
Answer: C) Map
4. What is the role of the Reduce function in MapReduce?
A) Encrypt data
B) Aggregate data
C) Visualize data
D) Store data
Answer: B) Aggregate data
5. What is the purpose of the shuffle phase in MapReduce?
A) Delete unnecessary data
B) Transfer and merge sorted key-value pairs
C) Encrypt data for security
D) Visualize data
Answer: B) Transfer and merge sorted key-value pairs
6. Which of the following is NOT a component of Apache Hadoop?
A) HDFS
B) YARN
C) MapReduce
D) Oracle Database
Answer: D) Oracle Database
7. What does HDFS stand for in the Hadoop ecosystem?
A) Hadoop Data File System
B) Hadoop Distributed File System
C) Hadoop Distributed Framework System
D) Hadoop Data Framework System
Answer: B) Hadoop Distributed File System
8. What is the role of YARN in Hadoop?
A) Data encryption
B) Resource management
C) Data visualization
D) Network security
Answer: B) Resource management
9. Which of the following ensures fault tolerance in Hadoop?
A) Data replication
B) Data encryption
C) Data visualization
D) Data compression
Answer: A) Data replication
10. In MapReduce, what does the Map function output?
A) Final results
B) Encrypted data
D) User interface components
D) Intermediate key-value pairs
Answer: D) Intermediate key-value pairs
11. What is data locality in the context of MapReduce?
A) Encrypting data before processing
B) Storing data in remote locations
C) Processing data where it is stored
D) Visualizing data in charts
Answer: C) Processing data where it is stored
12. Which of the following is a benefit of using MapReduce?
A) Improved user interface design
B) Increased data redundancy
C) Scalability and parallelism
D) Reduced data security
Answer: C) Scalability and parallelism
13. How does Hadoop handle large datasets efficiently?
A) By storing data in a single location
B) By encrypting all data
C) By distributing tasks across multiple nodes
D) By visualizing data in real-time
Answer: C) By distributing tasks across multiple nodes
14. What is the purpose of input splits in MapReduce?
A) To delete data
B) To define the work for each Map task
C) To encrypt data
D) To visualize data
Answer: B) To define the work for each Map task
15. Which phase in MapReduce involves grouping intermediate
key-value pairs by key?
A) Map
B) Reduce
C) Sort
D) Encrypt
Answer: C) Sort
16. What is the default replication factor in HDFS?
A) 3
B) 1
C) 2
D) 5
Answer: A) 3
17. In the MapReduce workflow, what follows the Map phase?
A) Data visualization
B) Reduce phase
C) Sort and Shuffle phase
D) Data encryption
Answer: C) Sort and Shuffle phase
18. Which of the following is a key difference between Hadoop and
MapReduce?
A) Hadoop is a programming model; MapReduce is a framework
B) Hadoop is a framework; MapReduce is a programming model
C) Both are frameworks
D) Both are programming models
Answer: B) Hadoop is a framework; MapReduce is a programming
model
19. What does the term "commodity hardware" refer to in
Hadoop?
A) Expensive, specialized hardware
B) Hardware used for encryption
C) Hardware used for data visualization
D) Affordable, commonly available hardware
Answer: D) Affordable, commonly available hardware
20. Which of the following is NOT a feature of Apache Hadoop?
A) Fault tolerance
B) Scalability
C) Distributed storage
D) Real-time data visualization
Answer: D) Real-time data visualization
21. What is the primary goal of the Reduce function in
MapReduce?
A) Visualize data
B) Aggregate and process data
C) Store data securely
D) Encrypt data
Answer: B) Aggregate and process data
22. Which component of Hadoop is responsible for job
scheduling?
A) HDFS
B) YARN
C) MapReduce
D) SQL Server
Answer: B) YARN
23. What is the key advantage of using MapReduce for big data
processing?
A) Parallel processing and scalability
B) Enhanced data encryption
C) Simplified user interface
D) Reduced data storage costs
Answer: A) Parallel processing and scalability
24. How does Hadoop achieve data redundancy?
A) By compressing data
B) By deleting duplicate data
C) By replicating data across nodes
D) By encrypting data
Answer: C) By replicating data across nodes
25. Which phase in MapReduce is responsible for reducing
network overhead?
A) Map
B) Sort
C) Shuffle
D) Encrypt
Answer: C) Shuffle
26. What is the purpose of the master node in a MapReduce
workflow?
A) To schedule and assign tasks
B) To store data
C) To encrypt data
D) To process data
Answer: A) To schedule and assign tasks
27. Which of the following describes the Map function's output?
A) Encrypted data
B) Intermediate key-value pairs
C) Final results
D) User interface components
Answer: B) Intermediate key-value pairs
28. What happens during the sorting phase in MapReduce?
A) Data is encrypted
B) Data is visualized
C) Data is deleted
D) Data is grouped by key
Answer: D) Data is grouped by key
29. Why is MapReduce considered a divide-and-conquer
approach?
A) It divides data into small tasks and conquers them in parallel
B) It encrypts data before processing
C) It deletes unnecessary data
D) It visualizes data in charts
Answer: A) It divides data into small tasks and conquers them in parallel
30. What is the role of worker nodes in a MapReduce cluster?
A) To store data
B) To process assigned tasks
C) To encrypt data
D) To visualize data
Answer: B) To process assigned tasks
31. Which of the following is an example of a MapReduce
operation?
A) User interface design
B) Network security
C) Word count
D) Data encryption
Answer: C) Word count
32. What is the significance of intermediate key-value pairs in
MapReduce?
A) They are used for encryption
B) They are the final results
C) They are processed by the Reduce function
D) They are deleted after use
Answer: C) They are processed by the Reduce function
33. Which of the following best describes Hadoop's architecture?
A) Encrypted storage and processing
B) Centralized storage and processing
C) Visualized storage and processing
D) Distributed storage and processing
Answer: D) Distributed storage and processing
34. What is the primary function of the MapReduce library in a
user program?
A) Split input files and start tasks
B) Encrypt data
C) Visualize data
D) Delete unnecessary files
Answer: A) Split input files and start tasks
35. How does MapReduce handle large datasets?
A) By storing them in a single location
B) By distributing them across multiple nodes
C) By encrypting them
D) By visualizing them
Answer: B) By distributing them across multiple nodes
36. What is the purpose of the partitioning function in
MapReduce?
A) To encrypt data
B) To visualize data
C) To delete unnecessary data
D) To divide data into regions
Answer: D) To divide data into regions
37. Which of the following is a characteristic of commodity
hardware used in Hadoop?
A) Common availability
B) Encrypted storage
C) High cost
D) Specialized components
Answer: A) Common availability
38. What does the MapReduce framework do after all map tasks
are completed?
A) Encrypts the data
B) Sorts and shuffles intermediate data
C) Deletes unnecessary files
D) Visualizes the results
Answer: B) Sorts and shuffles intermediate data
39. How does the Reduce function in MapReduce produce the final
output?
A) By encrypting data
B) By visualizing data
C) By aggregating values for each key
D) By deleting unnecessary data
Answer: C) By aggregating values for each key
40. Which of the following is NOT a phase in the MapReduce
workflow?
A) Map
B) Encrypt
C) Shuffle
D) Reduce
Answer: B) Encrypt
Big Data Processing Concepts Lecture 11:
Chapter 6 Part 2 Questions
1. What assumption does the default scheduler in original Hadoop
MapReduce make about computing nodes?
A) They are heterogeneous
B) They are homogeneous
C) They are always idle
D) They have equal processing power
Answer: B) They are homogeneous
2. Why is an efficient scheduling mechanism critical in MapReduce?
A) It minimizes data replication
B) It reduces network latency
C) It enhances runtime performance
D) It simplifies code structure
Answer: C) It enhances runtime performance
3. What is a significant challenge when implementing iterative
algorithms in MapReduce?
A) They require more memory
B) They are complex to implement in a single job
C) They cannot be parallelized
D) They are not supported by Hadoop
Answer: B) They are complex to implement in a single job
4. What does the original MapReduce model primarily focus on?
A) Real-time processing
B) Batch-oriented offline processing
C) Interactive processing
D) Data streaming
Answer: B) Batch-oriented offline processing
5. What hardware capability is often underutilized in original
MapReduce?
A) Disk space
B) Network bandwidth
C) Multi-core CPUs and GPUs
D) Memory
Answer: C) Multi-core CPUs and GPUs
6. What is a major challenge for participants in MapReduce clusters?
A) High data transfer speeds
B) Complex configuration parameters
C) Lack of available resources
D) Limited application support
Answer: B) Complex configuration parameters
7. What authentication mechanisms does the original MapReduce
runtime provide?
A) Password-based authentication
B) Token-based and Kerberos-based
C) OAuth
D) Biometric authentication
Answer: B) Token-based and Kerberos-based
8. What is YARN primarily designed to improve?
A) Resource negotiation and scheduling
B) Data storage
C) User interface
D) Data processing speed
Answer: A) Resource negotiation and scheduling
9. What role does the Resource Manager play in YARN?
A) It executes MapReduce jobs
B) It monitors data integrity
C) It performs data transformations
D) It schedules resources for applications
Answer: D) It schedules resources for applications
10. How does YARN achieve backward compatibility?
A) By rewriting all existing applications
B) By incorporating MapReduce as a framework
C) By using a different programming model
D) By limiting resource requests
Answer: B) By incorporating MapReduce as a framework
11. What does the Application Master do in YARN?
A) It stores data
B) It runs MapReduce jobs directly
C) It negotiates resources from the Resource Manager
D) It monitors network traffic
Answer: C) It negotiates resources from the Resource Manager
12. Which of the following is NOT a component of YARN?
A) Resource Manager
B) Node Manager
C) Data Node
D) Application Master
Answer: C) Data Node
13. What type of resource requests can applications make in
YARN?
A) Only CPU requests
B) Generic resource requests
C) Only memory requests
D) Only disk space requests
Answer: B) Generic resource requests
14. Which of the following best describes the resource model in
YARN?
A) General and flexible
B) Fixed and rigid
C) Simple and straightforward
D) Complex and inefficient
Answer: A) General and flexible
15. What is a primary advantage of YARN over the original
Hadoop MapReduce?
A) Improved data storage
B) Simplified programming model
C) Decreased job execution time
D) Enhanced resource utilization
Answer: D) Enhanced resource utilization
16. What is a key limitation of the original MapReduce model?
A) It cannot handle large datasets
B) It is not suitable for offline processing
C) It struggles with real-time processing
D) It requires expensive hardware
Answer: C) It struggles with real-time processing
17. In YARN, what does the Resource Manager optimize for?
A) Job execution speed
B) Cluster utilization
C) Data integrity
D) User experience
Answer: B) Cluster utilization
18. What is one of the main challenges of using MapReduce in
cloud environments?
A) High costs
B) Lack of scalability
C) Complex authentication and authorization
D) Limited data storage
Answer: C) Complex authentication and authorization
19. Which scheduling algorithm can be plugged into the Resource
Manager?
A) Round-robin scheduling
B) FIFO scheduling
C) Fair scheduling
D) All of the above
Answer: D) All of the above
20. What is the primary function of the Node Manager in YARN?
A) To manage data storage
B) To execute and monitor containers
C) To negotiate resource requests
D) To schedule jobs
Answer: B) To execute and monitor containers
21. Which of the following statements about MapReduce
applications is true?
A) They are designed for single-user environments.
B) They can only run on local machines.
C) They are typically used in cluster environments.
D) They do not require configuration.
Answer: C) They are typically used in cluster environments.
22. What is a significant drawback of the original MapReduce
when dealing with high-speed data streams?
A) It is designed for batch processing.
B) It requires too much memory.
C) It cannot handle large datasets.
D) It is not compatible with cloud computing.
Answer: A) It is designed for batch processing.
23. What does the term "straggler tasks" refer to in MapReduce?
A) Tasks that complete quickly
B) Tasks that fail completely
C) Tasks that are not executed
D) Tasks that are delayed or take longer than expected
Answer: D) Tasks that are delayed or take longer than expected
24. What is the primary purpose of the Application Master in
YARN?
A) To execute MapReduce jobs
B) To monitor system performance
C) To negotiate and manage resources for applications
D) To store application data
Answer: C) To negotiate and manage resources for applications
25. Which of the following is a challenge with the original
MapReduce's approach to data processing?
A) It is too simple.
B) It requires too much manual intervention.
C) It cannot handle iterative tasks efficiently.
D) It is not scalable.
Answer: C) It cannot handle iterative tasks efficiently.
26. What does YARN stand for?
A) Yet Another Resource Network
B) Yet Another Resource Negotiator
C) Your Application Resource Network
D) Your Application Resource Negotiator
Answer: B) Yet Another Resource Negotiator
27. In YARN, what does the Resource Manager primarily focus
on?
A) Executing jobs
B) Managing data
C) Scheduling resources
D) Monitoring applications
Answer: C) Scheduling resources
28. Which component of YARN is responsible for executing the
actual tasks?
A) Resource Manager
B) Node Manager
C) Application Master
D) Job Tracker
Answer: B) Node Manager
29. What is a common optimization strategy for improving
MapReduce performance?
A) Reducing data replication
B) Using fewer nodes
C) Limiting resource requests
D) Simulating MapReduce contexts
Answer: D) Simulating MapReduce contexts
30. Which of the following is a new service introduced with
YARN?
A) Job Tracker
B) Data Node
C) Task Tracker
D) Resource Manager
Answer: D) Resource Manager
31. What type of algorithms does YARN support for scheduling?
A) Fixed algorithms only
B) Dynamic and pluggable algorithms
C) Simple algorithms only
D) No algorithms
Answer: B) Dynamic and pluggable algorithms
32. How does YARN handle resource requests from applications?
A) Randomly assigns resources
B) Based on a first-come, first-served basis
C) Through negotiation with the Resource Manager
D) Automatically assigns maximum resources
Answer: C) Through negotiation with the Resource Manager
33. What is a potential advantage of using GPUs in a MapReduce
context?
A) They can handle linear tasks more efficiently
B) They simplify the programming model
C) They enhance data storage capabilities
D) They are not useful in MapReduce
Answer: A) They can handle linear tasks more efficiently
34. What is a limitation of the original MapReduce regarding job
execution?
A) It does not support large datasets.
B) All tasks are executed linearly.
C) It cannot run multiple jobs simultaneously.
D) It lacks a user interface.
Answer: B) All tasks are executed linearly.
35. Which of the following best describes the resource negotiation
process in YARN?
A) It is a manual process.
B) It requires user intervention.
C) It is non-existent.
D) It is automated and efficient.
Answer: D) It is automated and efficient.
36. What is one of the primary goals of YARN?
A) To improve cluster utilization
B) To eliminate the need for a Resource Manager
C) To reduce the complexity of MapReduce
D) To increase job execution time
Answer: A) To improve cluster utilization
37. What is a key feature of the Application Master in YARN?
A) It runs on the client machine.
B) It does not interact with the Resource Manager.
C) It is responsible for monitoring resource consumption.
D) It stores application data.
Answer: C) It is responsible for monitoring resource consumption.
38. What type of processing does the original MapReduce model
excel at?
A) Real-time processing
B) Interactive processing
C) Streaming processing
D) Batch processing
Answer: D) Batch processing
39. Which of the following is NOT a characteristic of YARN?
A) Scalability
B) Resource management
C) Simplicity
D) User agility
Answer: C) Simplicity
40. What is a major benefit of using YARN for resource
management?
A) It eliminates the need for a scheduler.
B) It allows for better resource allocation and scheduling.
C) It requires less hardware.
D) It simplifies the programming model.
Answer: B) It allows for better resource allocation and scheduling.
Processing systems for big data Lecture 12
Chapter 6 Part 3 Questions
1. What are the four main paradigms of processing systems for big
data?
A) Continuous Processing, Real-Time Processing, Event Processing, Batch
Processing
B) Stream Processing, Event Processing, Real-Time Processing, Offline
Processing
C) Continuous Processing, Batch Processing, Data Warehousing, Event
Processing
D) Real-Time Processing, Batch Processing, Data Mining, Data Lakes
Answer: A) Continuous Processing, Real-Time Processing, Event
Processing, Batch Processing
2. What characterizes continuous processing systems?
A) They require all data to be available before processing.
B) They process data as it arrives without waiting.
C) They operate only on historical data.
D) They prioritize low throughput.
Answer: B) They process data as it arrives without waiting.
3. Which of the following is a key characteristic of real-time
processing?
A) It processes data in batches.
B) It ensures data is processed immediately or within tight deadlines.
C) It can tolerate significant delays.
D) It is designed for unbounded data streams.
Answer: B) It ensures data is processed immediately or within tight
deadlines.
4. What type of processing guarantees must be met in hard real-time
systems?
A) Deadlines may be missed occasionally.
B) Processing is optional.
C) Processing can be delayed indefinitely.
D) Deadlines must always be met.
Answer: D) Deadlines must always be met.
5. What is a primary use case for event processing systems?
A) Historical data analysis
B) Fraud detection and anomaly detection
C) Data warehousing
D) Batch job scheduling
Answer: B) Fraud detection and anomaly detection
6. Which of the following tools is commonly associated with continuous
processing?
A) Apache Kafka Streams
B) Apache Hadoop
C) Apache Spark
D) Apache Hive
Answer: A) Apache Kafka Streams
7. What is the main focus of event-driven systems?
A) Periodic data processing
B) Processing data in large batches
C) Responding to specific events as they occur
D) Storing data for later analysis
Answer: C) Responding to specific events as they occur
8. Which processing model is optimized for efficiency and scalability
rather than low latency?
A) Continuous Processing
B) Real-Time Processing
C) Event Processing
D) Batch Processing
Answer: D) Batch Processing
9. What is a defining feature of complex event processing (CEP)?
A) It detects patterns or sequences of events over time.
B) It processes data in fixed intervals.
C) It only processes historical data.
D) It requires manual intervention for each event.
Answer: A) It detects patterns or sequences of events over time.
10. Which tool is known for true real-time processing?
A) Apache Hadoop
B) Apache Flink
C) Apache Hive
D) Apache Pig
Answer: B) Apache Flink
11. What distinguishes true real-time processing from near real-
time processing?
A) True real-time processing has higher latency.
B) Near real-time processing provides instant results.
C) True real-time processing has minimal latency.
D) Near real-time processing is faster.
Answer: C) True real-time processing has minimal latency.
12. Which of the following factors can impact real-time
performance?
A) Data volume
B) System design
C) Latency tolerance
D) All of the above
Answer: D) All of the above
13. What is a primary characteristic of batch processing systems?
A) They process data continuously.
B) They work on a finite dataset available all at once.
C) They prioritize low latency.
D) They are event-driven.
Answer: B) They work on a finite dataset available all at once.
14. Which programming model is commonly associated with batch
processing?
A) Event-Driven Model
B) Dataflow Model
C) MapReduce Model
D) Stream Processing Model
Answer: C) MapReduce Model
15. What is the main purpose of Apache Kafka?
A) To provide a distributed real-time processing platform
B) To store large datasets
C) To perform batch processing
D) To analyze historical data
Answer: A) To provide a distributed real-time processing platform
16. In Kafka architecture, what role do producers play?
A) They consume messages from topics.
B) They send messages to Kafka.
C) They manage the partitions.
D) They coordinate the brokers.
Answer: B) They send messages to Kafka.
17. What is a Kafka topic?
A) A type of message format
B) A server that processes messages
C) A consumer group
D) A mailbox that holds messages
Answer: D) A mailbox that holds messages
18. How does Kafka maintain low latency?
A) By using high-level abstractions
B) Through zero-copy I/O
C) By limiting the number of producers
D) By compressing messages
Answer: B) Through zero-copy I/O
19. What is the function of Kafka brokers?
A) To store and manage data partitions
B) To produce messages
C) To read messages from topics
D) To coordinate consumers
Answer: A) To store and manage data partitions
20. Which component in Kafka architecture coordinates the
brokers, producers, and consumers?
A) Producer
B) Consumer
C) Zookeeper
D) Topic
Answer: C) Zookeeper
21. What is an example of a soft real-time application?
A) Medical devices
B) Autonomous vehicles
C) Video streaming
D) Flight control systems
Answer: C) Video streaming
22. Which of the following best describes event correlation?
A) Processing events in batches
B) Ignoring unrelated events
C) Storing events for future analysis
D) Linking events based on time or context
Answer: D) Linking events based on time or context
23. What is the primary goal of real-time processing systems?
A) To analyze historical data
B) To ensure immediate processing of data
C) To batch process large datasets
D) To store data for later use
Answer: B) To ensure immediate processing of data
24. Which of the following is NOT a tool used for batch
processing?
A) Apache Hadoop
B) Apache Flink
C) Apache Spark
D) Apache Beam
Answer: B) Apache Flink
25. What type of data does continuous processing typically
handle?
A) Static data
B) Historical data
C) Unbounded data streams
D) Archived data
Answer: C) Unbounded data streams
26. In which scenario would you primarily use batch processing?
A) Monitoring live traffic conditions
B) Analyzing historical sales data
C) Detecting fraud in real-time transactions
D) Responding to user interactions
Answer: B) Analyzing historical sales data
27. What is a common application of complex event processing
(CEP)?
A) Fraud detection
B) Data storage
C) Data compression
D) Batch job scheduling
Answer: A) Fraud detection
28. Which of the following statements about Apache Kafka is
true?
A) It is primarily a batch processing tool.
B) It is designed for high latency.
C) It operates as a distributed messaging system.
D) It does not support real-time data streams.
Answer: C) It operates as a distributed messaging system.
29. What is the main advantage of using an event-driven model?
A) It processes data in fixed intervals.
B) It allows for immediate responses to events.
C) It requires less memory.
D) It is simpler to implement than other models.
Answer: B) It allows for immediate responses to events.
30. Which of the following is a key characteristic of low-latency
systems?
A) They process data in large batches.
B) They work with historical data only.
C) They require extensive buffering.
D) They prioritize immediate processing.
Answer: D) They prioritize immediate processing.
31. What is the role of the consumer in Kafka?
A) To send messages to topics
B) To read messages from topics
C) To manage data partitions
D) To coordinate brokers
Answer: B) To read messages from topics
32. What does a Kafka partition do?
A) It splits a topic into smaller parts for scalability.
B) It stores all messages in one location.
C) It manages the consumers.
D) It compresses messages for storage.
Answer: A) It splits a topic into smaller parts for scalability.
33. What is a potential drawback of real-time processing systems?
A) They cannot handle large datasets.
B) They are slow to respond.
C) They require strict timing guarantees.
D) They are easier to implement than batch systems.
Answer: C) They require strict timing guarantees.
34. What is an example of a hard real-time application?
A) Stock market analysis
B) Video streaming
C) Medical monitoring systems
D) Social media trend tracking
Answer: C) Medical monitoring systems
35. Which programming model is used in Apache Kafka for
processing streams?
A) Batch Processing Model
B) Event-Driven Model
C) MapReduce Model
D) Dataflow Model
Answer: B) Event-Driven Model
36. What is a key benefit of using Apache Flink?
A) It is only for batch processing.
B) It requires extensive configuration.
C) It cannot handle event processing.
D) It supports both batch and stream processing.
Answer: D) It supports both batch and stream processing.
37. In Kafka, what is the role of Zookeeper?
A) To produce messages
B) To store data
C) To coordinate brokers and manage metadata
D) To read messages from topics
Answer: C) To coordinate brokers and manage metadata
38. What type of processing is best suited for applications that
require immediate feedback?
A) Batch Processing
B) Continuous Processing
C) Event Processing
D) Real-Time Processing
Answer: D) Real-Time Processing
39. What does the term "latency tolerance" refer to in real-time
systems?
A) The maximum amount of time data can be delayed
B) The ability to process data in batches
C) The requirement for low throughput
D) The need for strict deadlines
Answer: A) The maximum amount of time data can be delayed
40. What is the primary goal of using event correlation in event
processing?
A) To process data in batches
B) To identify relationships between events
C) To store events for future analysis
D) To ignore unrelated events
Answer: B) To identify relationships between events
Data Warehouses and Data Lakes Lecture 13
Questions
1. Who introduced the concept of data warehouses?
A) Microsoft researchers
B) IBM researchers Barry Devlin and Paul Murphy
C) Google engineers
D) Oracle developers
Answer: B) IBM researchers Barry Devlin and Paul Murphy
2. What is a primary purpose of a data warehouse?
A) To store unstructured data
B) To support management decisions through data analytics
C) To handle real-time data processing
D) To serve as a transactional database
Answer: B) To support management decisions through data analytics
3. Which of the following best describes a data warehouse?
A) A real-time data processing system
B) A repository for unprocessed raw data
C) A transactional processing system
D) A subject-oriented, nonvolatile, integrated collection of data
Answer: D) A subject-oriented, nonvolatile, integrated collection of data
4. What does the process of compiling information into a data
warehouse refer to?
A) Data extraction
B) Data warehousing
C) Data mining
D) Data cleansing
Answer: B) Data warehousing
5. What type of processing does a data warehouse primarily support?
A) Online Analytical Processing (OLAP)
B) Online Transaction Processing (OLTP)
C) Real-time processing
D) Batch processing
Answer: A) Online Analytical Processing (OLAP)
6. Which of the following is a key characteristic of data lakes?
A) They store data in a structured format.
B) They require complex data transformation.
C) They allow storage of raw, unprocessed data.
D) They are primarily used for transactional processing.
Answer: C) They allow storage of raw, unprocessed data.
7. What is the main difference between a data warehouse and a data
lake?
A) Data warehouses store raw data, while data lakes store processed data.
B) Data lakes are more structured than data warehouses.
C) Data warehouses store processed data, while data lakes store raw data.
D) Data lakes do not support analytics.
Answer: C) Data warehouses store processed data, while data lakes store
raw data.
8. What type of data does a data lake typically handle?
A) Only structured data
B) Only unstructured data
C) Structured, semi-structured, and unstructured data
D) Only processed data
Answer: C) Structured, semi-structured, and unstructured data
9. Which of the following best describes the architecture of a data
lake?
A) Hierarchical and structured
B) Flat and flexible
C) Rigid and predefined
D) Centralized and transactional
Answer: B) Flat and flexible
10. What is an advantage of using a data lake?
A) Requires specialized expertise for all users
B) Supports complex data transformations
C) Allows for scalability and flexibility in data storage
D) Always stores data in a cleaned format
Answer: C) Allows for scalability and flexibility in data storage
11. Which statement best describes the ETL process?
A) It extracts, transforms, and loads data into data warehouses.
B) It is used exclusively in data lakes.
C) It is unnecessary for data warehousing.
D) It is a method for real-time data processing.
Answer: A) It extracts, transforms, and loads data into data warehouses.
12. What is a disadvantage of data lakes?
A) They can only handle structured data.
B) They require significant upfront financial investment.
C) They can lead to data swamps if not managed properly.
D) They do not allow for data scalability.
Answer: C) They can lead to data swamps if not managed properly.
13. Which of the following is a typical use case for a data
warehouse?
A) Real-time fraud detection
B) Historical data analysis for business intelligence
C) Storing raw sensor data
D) Social media analysis
Answer: B) Historical data analysis for business intelligence
14. What is the primary focus of a data warehouse?
A) Data storage
B) Data processing
C) Decision support through analytics
D) Data collection
Answer: C) Decision support through analytics
15. Which of the following statements is true regarding data lakes?
A) They require data to be structured before storage.
B) They are ideal for machine learning and big data analysis.
C) They eliminate the need for data preprocessing.
D) They are primarily used for transactional applications.
Answer: B) They are ideal for machine learning and big data analysis.
16. What type of expertise is typically required to analyze data in a
data lake?
A) Basic familiarity with data presentation
B) No expertise is required
C) Knowledge of OLAP tools
D) Specialized skills in data science and analytics
Answer: D) Specialized skills in data science and analytics
17. Which of the following best describes the data stored in a data
warehouse?
A) Raw and unprocessed
B) Processed and filtered
C) Semi-structured
D) Only transactional
Answer: B) Processed and filtered
18. What is a common challenge associated with data lakes?
A) High cost of storage
B) Difficulty in managing unstructured data
C) Lack of scalability
D) Limited data types supported
Answer: B) Difficulty in managing unstructured data
19. Which of the following is NOT a characteristic of a data
warehouse?
A) Subject-oriented
B) Time-variant
C) Raw data storage
D) Nonvolatile
Answer: C) Raw data storage
20. What do data lakes primarily enable organizations to do?
A) Analyze large volumes of raw data for insights
B) Store only structured data
C) Perform complex data transformations
D) Ensure data is always cleansed before analysis
Answer: A) Analyze large volumes of raw data for insights
21. What is a key benefit of using data warehouses for business
intelligence?
A) They allow for immediate data processing.
B) They provide a structured approach to data analysis.
C) They eliminate the need for data governance.
D) They only store historical data.
Answer: B) They provide a structured approach to data analysis.
22. Which of the following is a common tool used for data
warehousing?
A) Apache Hadoop
B) Apache Kafka
C) Amazon Redshift
D) Apache Spark
Answer: C) Amazon Redshift
23. What is a significant difference in the data structure between a
data warehouse and a data lake?
A) Data lakes are more structured than data warehouses.
B) Data warehouses store data in processed form, while data lakes store raw
data.
C) Data warehouses only support structured data.
D) Data lakes require predefined schemas.
Answer: B) Data warehouses store data in processed form, while data
lakes store raw data.
24. Which of the following best describes the data ingestion
process in data lakes?
A) It requires extensive data cleansing.
B) It is limited to structured data only.
C) It involves strict ETL processes.
D) It is often more flexible and less structured.
Answer: D) It is often more flexible and less structured.
25. What is the primary advantage of separating storage from
computation in data lakes?
A) It reduces costs and increases scalability.
B) It simplifies data ingestion.
C) It eliminates the need for data scientists.
D) It ensures all data is processed immediately.
Answer: A) It reduces costs and increases scalability.
26. Which of the following is a disadvantage of a data warehouse?
A) Inability to store unstructured data
B) High cost of storage
C) Complexity of data management
D) Lack of real-time processing capabilities
Answer: D) Lack of real-time processing capabilities
27. What is the main purpose of using OLAP in a data warehouse?
A) To process transactions in real-time
B) To support complex analytical queries
C) To store raw data
D) To perform data cleaning
Answer: B) To support complex analytical queries
28. Which of the following statements about data lakes is true?
A) They are designed for structured data only.
B) They require extensive data preprocessing.
C) They are used primarily for transactional processing.
D) They allow for diverse data types and formats.
Answer: D) They allow for diverse data types and formats.
29. What is a common feature of data lake architecture?
A) Strict schema enforcement
B) Flat storage structure
C) High-level data abstraction
D) Transactional consistency
Answer: B) Flat storage structure
30. Which of the following is a key characteristic of data stored in
a data warehouse?
A) It is organized for easy access and analysis.
B) It is always raw and unprocessed.
C) It is typically stored in a flat format.
D) It lacks metadata.
Answer: A) It is organized for easy access and analysis.
31. What is the primary role of metadata in a data lake?
A) To restrict data access
B) To enhance data quality
C) To provide context and facilitate data discovery
D) To enforce data governance
Answer: C) To provide context and facilitate data discovery
32. Which of the following is NOT a benefit of data warehouses?
A) Improved decision-making capabilities
B) Simplified data access for non-technical users
C) Real-time data processing
D) Enhanced data quality and consistency
Answer: C) Real-time data processing
33. What type of data analysis is typically performed in data
lakes?
A) Only historical analysis
B) Complex and exploratory analysis
C) Transactional analysis
D) Simple reporting
Answer: B) Complex and exploratory analysis
34. Which of the following best describes the data governance
challenges associated with data lakes?
A) They are easier to manage than data warehouses.
B) They require strict adherence to schemas.
C) They can lead to data quality issues if not properly managed.
D) They eliminate the need for data governance entirely.
Answer: C) They can lead to data quality issues if not properly managed.
35. What is the primary function of a data lake?
A) To process transactions
B) To store and analyze large volumes of raw data
C) To provide structured data for reporting
D) To enforce data security
Answer: B) To store and analyze large volumes of raw data
36. Which of the following statements is true regarding data
warehouses and data lakes?
A) Both are used interchangeably.
B) Data lakes are more suitable for structured data.
C) Data warehouses are optimized for analytics, while data lakes are
optimized for storage.
D) Data lakes require no data management.
Answer: C) Data warehouses are optimized for analytics, while data lakes
are optimized for storage.
37. What is a potential risk of using data lakes without proper
management?
A) Data redundancy
B) Data swamps due to poor data quality
C) Increased operational costs
D) Limited data access
Answer: B) Data swamps due to poor data quality
38. Which of the following is a common tool used for data lake
implementation?
A) Microsoft SQL Server
B) Amazon S3
C) Oracle Database
D) MySQL
Answer: B) Amazon S3
39. What is a primary goal of data governance in the context of
data lakes?
A) To ensure all data is processed in real-time
B) To maintain data quality and compliance
C) To eliminate the need for data scientists
D) To restrict data access to a few users
Answer: B) To maintain data quality and compliance
40. What is the primary characteristic of data stored in a data lake
compared to a data warehouse?
A) Data lakes store processed data; data warehouses store raw data.
B) Data lakes are limited to structured data; data warehouses are not.
C) Data lakes are more expensive to maintain than data warehouses.
D) Data lakes store raw data; data warehouses store processed data.
Answer: D) Data lakes store raw data; data warehouses store processed data.
Lecture 14: Data Warehouse and Data Lake
Architecture Part 1
1. What is the primary purpose of a data warehouse?
A. To store unstructured data for real-time analytics
B. To process transactional data in real-time
C. To store and manage historical data for analytical purposes
D. To replace operational databases
Answer: C
2. Which of the following is NOT a characteristic of a data warehouse?
A. Subject-oriented
B. Real-time updates
C. Time-variant
D. Non-volatile
Answer: B
3. What is the main difference between a data warehouse and a data
lake?
A. Data lakes store structured data, while data warehouses store unstructured
data
B. Data lakes store raw data, while data warehouses store processed data
C. Data lakes are OLAP-based, while data warehouses are OLTP-based
D. Both store raw data but differ in storage formats
Answer: B
4. Which of the following is NOT a layer in the three-tier data
warehouse architecture?
A. Bottom tier
B. Middle tier
C. Data lake tier
D. Top tier
Answer: C
5. What is a major disadvantage of the single-tier architecture?
A. High data redundancy
B. It cannot separate analytical and transactional processing
C. It is overly complex
D. It cannot handle metadata effectively
Answer: B
6. Which architecture uses a staging area to cleanse data before
loading it into the warehouse?
A. Single-tier
B. Two-tier
C. Three-tier
D. Multi-tier
Answer: B
7. What is the role of the middle tier in a three-tier architecture?
A. To store raw data
B. To act as an OLAP server for analytical processing
C. To manage metadata
D. To load data into the warehouse
Answer: B
8. Which tier in the three-tier architecture is responsible for user
interaction?
A. Middle tier
B. Bottom tier
C. Top tier
D. Staging area
Answer: C
9. What is the first step in the ETL process?
A. Data cleansing
B. Extraction
C. Transformation
D. Loading
Answer: B
10. During the transformation phase of ETL, what happens to the
data?
A. It is loaded into the database
B. It is converted into a standard format
C. It is extracted from source systems
D. It is partitioned for OLAP queries
Answer: B
11. What is the purpose of the loading phase in ETL?
A. To clean data
B. To extract data
C. To store transformed data into the data warehouse
D. To analyze data
Answer: C
12. Which of the following is NOT a function of ETL tools?
A. Data extraction
B. Data visualization
C. Data transformation
D. Data loading
Answer: B
13. What does metadata describe in a data warehouse?
A. The OLAP server's configuration
B. The structure, source, and usage of data
C. The staging area processes
D. The query tools used
Answer: B
14. Why is metadata critical in a data warehouse?
A. It manages the staging area
B. It defines how data is updated and processed
C. It replaces the ETL process
D. It provides user-friendly interfaces for querying
Answer: B
15. What type of metadata defines the source and target of data in
ETL processes?
A. Operational metadata
B. Business metadata
C. Technical metadata
D. Process metadata
Answer: C
16. Which operation is NOT typically supported by OLAP tools?
A. Slicing
B. Dicing
C. Indexing
D. Drilling
Answer: C
17. What is the purpose of query tools in a data warehouse?
A. To perform ETL operations
B. To interact with the data warehouse and retrieve insights
C. To manage the OLAP server
D. To perform metadata management
Answer: B
18. Which tool is used to discover patterns and correlations in
large datasets?
A. Query tools
B. Reporting tools
C. Data mining tools
D. Metadata tools
Answer: C
19. What is the role of APIs in the top tier of a data warehouse?
A. To cleanse data
B. To enable external tools to interact with the data warehouse
C. To perform metadata management
D. To execute OLAP operations
Answer: B
20. What is the core foundation of a data warehouse environment?
A. Metadata
B. RDBMS database
C. OLAP server
D. Query tools
Answer: B
21. Which database type is optimized for analytical queries in data
warehouses?
A. NoSQL databases
B. Relational databases (RDBMS)
C. Multidimensional databases (MDDBs)
D. Parallel databases
Answer: C
22. What is a limitation of traditional RDBMS for data
warehousing?
A. Poor optimization for large analytical queries
B. Lack of metadata support
C. Inability to handle small transactions
D. Lack of scalability
Answer: A
23. What is the main purpose of parallel database systems in data
warehousing?
A. To manage metadata
B. To distribute data processing across multiple servers
C. To perform ETL operations
D. To replace OLAP tools
Answer: B
24. Which of the following is a feature of a data lake?
A. Stores only structured data
B. Supports raw data storage
C. Optimized for OLAP queries
D. Requires ETL before storing data
Answer: B
25. What is a key difference between a data warehouse and a data
lake?
A. Data lakes store processed data
B. Data warehouses are schema-on-read
C. Data lakes are schema-on-read
D. Data warehouses store raw data
Answer: C
True/False Questions
1. Data warehouses are optimized for transactional processing.
False
(They are optimized for analytical processing.)
2. ETL tools are used to extract, transform, and load data into the data
warehouse.
True
3. The middle tier in a three-tier architecture is responsible for user
interaction.
False
(The top tier handles user interaction.)
4. Metadata in a data warehouse defines the structure and usage of
data.
True
5. OLAP tools support slicing, dicing, and indexing operations.
False
(OLAP tools do not support indexing.)
6. Data lakes store only structured data.
False
(Data lakes store structured, semi-structured, and unstructured data.)
7. The bottom tier in a three-tier architecture is responsible for data
cleansing and loading.
True
8. Data mining tools are used to automate the discovery of patterns in
data.
True
9. Traditional RDBMS systems are optimized for large-scale analytical
queries.
False
(They are optimized for transactional queries.)
10. A two-tier architecture is more scalable than a three-tier
architecture.
False
(A three-tier architecture is more scalable.)
Lecture 15: Data Warehouse and Data Lake
Architecture Part 2
1. What is the primary advantage of a schema-on-read approach in
data lakes?
A. It enforces strict data governance
B. It allows flexibility for varied use cases
C. It improves query performance
D. It eliminates the need for metadata
Answer: B
2. What type of data is NOT typically stored in a data lake?
A. Structured data
B. Semi-structured data
C. Unstructured data
D. Fully transformed data
Answer: D
3. Which of the following best describes the layered architecture of a
data lake?
A. A single repository for all data types
B. Zones to manage the data lifecycle, ensuring governance and accessibility
C. A fully normalized database structure
D. A flat file system for raw data storage
Answer: B
4. What is the role of decoupled compute and storage in a data lake?
A. It ensures faster data ingestion
B. It separates data transformation from data visualization
C. It allows independent scaling of compute and storage resources
D. It eliminates the need for ELT processes
Answer: C
5. What is the primary difference between ELT and ETL processes?
A. ELT transforms data before loading it into the data lake
B. ELT loads raw data into the lake and transforms it later
C. ELT is used only for structured data
D. ELT does not involve data transformation
Answer: B
6. In the ELT process, where is raw data first loaded?
A. Standardized layer
B. Cleansed layer
C. Raw data layer
D. Application layer
Answer: C
7. What type of transformations are typically performed during the
loading phase in ELT?
A. Heavy transformations, such as denormalization
B. Light transformations, such as column selection or PII hashing
C. No transformations are performed during loading
D. Both heavy and light transformations
Answer: B
8. What is the purpose of the Cleansed layer in a data lake?
A. To store raw data in its native format
B. To transform raw data into consumable datasets
C. To provide a sandbox for data scientists
D. To archive historical data
Answer: B
9. Which layer in a data lake is also known as the ingestion layer?
A. Raw data layer
B. Standardized data layer
C. Cleansed layer
D. Application layer
Answer: A
10. What is the primary function of the Standardized data layer?
A. To store data in its native format
B. To improve performance during data transfer to the curated layer
C. To provide a secure layer for production applications
D. To archive historical data
Answer: B
11. Which layer is also referred to as the trusted layer or
production layer?
A. Raw data layer
B. Application layer
C. Sandbox data layer
D. Cleansed layer
Answer: B
12. Where do machine learning models typically interact with data
in a data lake?
A. Sandbox data layer
B. Cleansed layer
C. Application layer
D. Standardized data layer
Answer: C
13. What is the purpose of the sandbox data layer in a data lake?
A. To store raw data
B. To enrich data with external sources for experimentation
C. To provide secure access for production applications
D. To archive historical data
Answer: B
14. Why are security mechanisms in data lakes different from
relational databases?
A. Data lakes do not require encryption
B. Data lakes store only unstructured data
C. Data lakes lack the comprehensive security features of relational databases
D. Data lakes do not support user authentication
Answer: C
15. What is the role of governance in a data lake?
A. To enforce schema-on-read policies
B. To monitor and log operations for analysis
C. To eliminate the need for metadata
D. To secure raw data in the ingestion layer
Answer: B
16. What does metadata in a data lake describe?
A. The format of raw data
B. The purpose, structure, and usage of data
C. The security policies applied to the data lake
D. The orchestration tools used in ELT processes
Answer: B
17. Which layer in a data lake is responsible for archiving
historical data?
A. Sandbox layer
B. Raw data layer
C. Archive layer
D. Application layer
Answer: C
18. What is the purpose of the offload area in a data lake?
A. To store metadata
B. To reduce the ETL load on relational data warehouses
C. To manage machine learning models
D. To store cleansed data for production applications
Answer: B
19. Which tool is typically required to orchestrate ELT processes
in a data lake?
A. OLAP server
B. Metadata management tool
C. Orchestration tool
D. Query tool
Answer: C
20. What is a key challenge of implementing a data lake
architecture?
A. Managing schema-on-write
B. Ensuring data governance and security
C. Scaling compute and storage independently
D. Storing structured data
Answer: B
True/False Questions
1. Data lakes use a schema-on-write approach, similar to traditional
databases.
False
(Data lakes use schema-on-read.)
2. The raw data layer in a data lake allows direct access to end users.
False
(End users are not granted access to raw data.)
3. ELT processes in data lakes load data before transforming it.
True
4. The sandbox layer in a data lake is used for production applications.
False
(It is used for experimentation and analysis.)
5. Metadata is optional in a data lake architecture.
False
(Metadata is essential for managing and understanding data.)
6. The application layer is also known as the trusted layer.
True
7. Data lakes cannot store structured data.
False
(Data lakes can store structured, semi-structured, and unstructured data.)
8. Security is less of a concern in data lakes compared to relational
databases.
False
(Security is a critical concern in data lakes.)
9. The standardized layer in a data lake is mandatory in all
implementations.
False
(It is optional in most implementations.)
10. Data lakes are typically built on scalable storage platforms like
Hadoop or Amazon S3.
True