Department of Artificial Intelligence And Data Science
BIG DATA ANALYTICS (CCS334)
QUESTION BANK
UNIT 1 - TWO MARKS QUESTION AND ANSWERS
Introduction to big data – convergence of key trends – unstructured data – industry examples of
bigdata – web analytics – big data applications– big data technologies – introduction to Hadoop
–open source technologies – cloud and big data – mobile business intelligence – Crowd sourcing
analytics – inter and trans firewall analytics.
1. What do you mean by Big Data?
Big Data is a collection of data that is huge in volume , yet growing exponentially
with time.
It is a data with so large a size and complexity that none of the traditional data
management tools can store it or process it efficiently.
2. Name the types of Big Data?
There are three main types of Big Data
Structured data
Semi-structured data
Unstructured data
3. List out the characteristics of Big Data?
Big Data can be described by the following characteristics:
Volume
Variety
Velocity
Variability
4. What is the advantage of Big Data?
Big data has several advantages for businesses and organizations, including:
Improved decision-making
Enhanced customer insights
Improved operational efficiency
Competitive advantage
5. What do you mean by unstructured data?
Unstructured data is data that does not have a well-defined data model or structure
It is typically text-based data, but it can also include multimedia data such a images,
videos, and audio.
Unstructured data is generated from a variety of sources such as social media, email,
mobile devices, sensors, and web logs.
Some examples of unstructured data include:
Social media data
Emails
Web content
Audio and video files
6. Difference between structured and unstructured data.
Structured Data Unstructured Data
Data that has a clearly defined schema Data that has no clear structure or schema
and is easily searchable and organized and is often difficult to search and
organize
Relational databases, spreadsheets Text documents, social media posts,
images, videos
Typically stored in a tabular format Can be stored in a variety of formats such
as text, JSON, XML, binary, etc.
Can be processed using traditional data Often requires specialized tools such as
processing tools like SQL natural language processing, machine
learning, and computer vision
Typically, smaller in size easier to manage Can be extremely large and difficult to
manage
Changes to structured data are often slow Changes to unstructured data can be rapid
and predictable and unpredictable
Structured data is usually uniform in Unstructured data can be very diverse in
format and easier to analyze format and harder to analyze
Structured data is valuable for traditional Unstructured data is valuable for gaining
data analysis and reporting insights into customer sentiment, social
media trends, and other areas where
traditional data may not provide enough
context
7. Define Web Analytics.
Web analytics is the process of analyzing and measuring website traffic and visitor
behavior to improve the overall effectiveness of a website.
It involves the collection, measurement, and analysis of website data in order to
understand user behavior, and identify areas for improvement.
Web analytics helps website owners make data-driven decisions to optimize.
website performance, increase traffic, and improve user experience.
It involves various techniques such as data mining, user profiling, clickstream
analysis, and web metrics to gain insights into website traffic, visitor behavior, and
website performance.
8. What are the data collection metrics in web analytics?
The data collected can include metrics such as:
Pageviews
Unique visitors
Bounce rate
Session duration
Conversion rate
9. List out the types of web analytics.
There are two main types of web analytics
On-site analytics
Off-site analytics
10. Name some benefits of Web analytics.
Understanding visitor behavior
Improving website design
Measuring marketing performance
Increasing website traffic
11. List out some applications of Bigdata.
Here are some applications of big data:
Business Analytics
Healthcare
Finance
Government
Retail
Manufacturing
Energy
Education
Transportation
Media and Entertainment
12. Name some Bigdata technologies.
Here are some popular Big Data technologies:
Hadoop
Spark
Hive
HBase
MongoDB
Zookeeper
13. What is Hadoop?
Hadoop is an open-source, distributed processing framework that enables the storage
and processing of large volumes of data on a cluster of commodity hardware.
It provides a scalable and fault-tolerant platform for processing big data. Hadoop
consists of two main components: Hadoop Distributed File System (HDFS) and Yet
Another Resource Negotiator (YARN).
14. Explain the core components of Hadoop.
o Hadoop is an open-source framework intending to store and process big data in a
distributed manner.
o Hadoop's Essential Components:
1) HDFS (Hadoop Distributed File System) -Hadoop's key storage system is HDFS. The
extensive data is stored on HDFS. It is mainly devised for storing massive datasets in
commodity hardware.
2) Hadoop MapReduce-The responsible layer of Hadoop for data processing is
MapReduce. There are two stages of processing: Map and Reduce. In simple terms, Map
is a stage where data blocks are read and made available to the executors
(computers/nodes/ containers) for processing. Reduce is a stage where all processed data
is collected and collated
3) YARN-The framework which is used to process in Hadoop is YARN. For resource
management and to provide multiple data processing engines like real-time streaming,
data science, and batch processing is done by YARN.
4) Explain the features of Hadoop.
Hadoop assists in not only store data but also processing big data. It is the most reliable
way to handle significant data hurdles. Some salient features of Hadoop are-
a. Distributed Processing-Hadoop helps in distributed processing of data, Le, quicker
processing. In Hadoop HDFS, the data is collected in a distributed manner, and the
data is parallel processing, and MapReduce is liable for the same.
b. Open Source-Hadoop is independent of cost as it is an open-source framework.
Changes are allowed in the source code as per the user's requirements.
c. Fault Tolerance - Hadoop is highly fault-tolerant. By default, for every block, it
creates three replicas at distinct nodes. This number of replicas can be modified
according to the requirement. So, we can retrieve the data from a different node if one
of the nodes fails. The discovery of node failure and restoration of data is made
automatically.
d. Scalability-It is fitted with different hardware, and we can promptly access the new
device.
e. Reliability-The data in Hadoop is stored on the cluster in a safe manner that is
autonomous of the machine. So, the data stored in the Hadoop ecosystem's data does
not get affected by any machine breakdowns.
15. What you mean by HDFS?
Hadoop Distributed File System (HDFS) is a distributed file system that is
designed to run on commodity hardware.
It is a core component of the Hadoop framework and provides a distributed
storage system for large data sets
HDFS is designed to handle very large files with streaming data access patterns,
and to provide high-throughput access to data
16. What do you mean by VARN
YARN (Yet Another Resource Negotiator) is one of the core components of
Hadoop, responsible for managing resources and scheduling tasks across a
Hadoop cluster
It was introduced in Hadoop 2.x as an improvement over the earlier version
of MapReduce, which suffered from scalability and flexibility issues
YARN separates the job scheduling and resource management functions of
MapReduce into two separate daemons, the Resource Manager (RM) and the
Node Manager (NM), respectively.
17. List out the benefits of the Hadoop
Hadoop offers several benefits in the world of big data processing, including:
a. Scalability
b. Fault tolerance
c. Cost-effective
d. Processing speed
e. Flexibility
f. Data storage
g. Integration
h. Open-source
18. Define Open-source technology.
Open-source technologies refer to software or computer programs that have their
source code available to the public, allowing anyone to access, modify, and
distribute it.
This means that users can see and edit the code, which can result in greater
collaboration and innovation in software development.
The open-source movement started in the late 1990s and has since become a
significant force in the tech industry.
Many popular software tools and platforms, including Linux. Apache, MySQL,
and WordPress, are open source.
19. How cloud technology impacts the bigdata?
Cloud technology has a significant impact on big data in the following ways:
Scalability
Cost-Effectiveness
Accessibility
Flexibility
Agility
Security
Reliability
Innovation
20. What do you mean by cloud computing?
Cloud computing is the delivery of computing services including servers, storage,
databases, networking, software, analytics, and intelligence over the Internet ("the
cloud") to offer faster innovation, flexible resources, and economies of scale.
The services provided by cloud computing can be categorized into three main
models: Infrastructure as a Service (IaaS), Platform as a Service (PaaS). and
Software as a Service (SaaS)
21. List out the features of Cloud Computing.
Scalability
Elasticity
Resource Pooling
Self service
Low Costs.
Fault Tolerance
22. What are the issues in using cloud services?
Some important cloud services issues are as listed:
Data Security
Performance
Understanding Big Data
Compliance
Legal Issues
Costs
23. Define mobile business intelligence.
Mobile business intelligence (Mobile BI) is a type of business intelligence that
enables the access and analysis of business data on mobile devices such as
smartphones and tablets.
It allows users to access relevant business data, including key performance
indicators, reports, and dashboards, from any location at any time, giving them
the ability to make informed decisions on the go.
Mobile BI also enables users to share and collaborate on data insights with
others, providing real-time updates and notifications.
Mobile BI leverages the power of cloud computing and mobile technology to
make data-driven decision-making faster, more accurate, and more efficient.
24. Justify the need for Business intelligence.
Mobile phones' data storage capacity has grown in tandem with their use. You are
expected to make decisions and act quickly in this fast-paced environment.
The number of businesses receiving assistance in such a situation is growing by
the day.
To expand your business or boost your business productivity, mobile BI can help,
and it works with both small and large businesses.
25. What are the advantages of business Intelligence.
Simple access
Competitive advantage
Simple decision-making
Increase Productivity
26. Define crowd sourcing.
Crowdsourcing is the collection of information, opinions, or work from a group
of people, usually sourced via the Internet.
Crowdsourcing work allows companies to save time and money while tapping
into people with different skills or thoughts from all over the world.
27. What are the types of crowd sourcing?
a. Wisdom of the crowd
b. Crowd creation
c. Crowd sourcing
d. Crowdfunding
28. What is inter firewall analytics?
Inter Firewall Analytics is a type of security analytics that focuses on monitoring
and analyzing traffic flowing between different zones or segments of a network
that are separated by firewalls.
The goal is to identify and prevent potential threats that may be hiding within the
traffic.
29. What do you mean by trans firewall analytics?
Trans Firewall Analytics is a type of network security analytics that focuses on
monitoring and analyzing network traffic passing through an organization's
perimeter firewall(s).
The main purpose of trans firewall analytics is to identify and prevent network
threats and attacks such as malware, viruses, phishing attempts, and other types of
cyber threats that try to penetrate an organization's network.
Trans Firewall Analytics involves monitoring and analyzing network traffic logs
generated by the firewall.
30. Difference between inter and trans firewall analytics.
Inter-Firewall Trans-Firewall
Firewall analytics performed between two Firewall analytics performed within a single
or more firewalls firewall
Spans multiple firewalls or security Limited to a single firewall or security domain
domains
Captures and analyzes traffic between Captures and analyzes traffic within a single
firewalls, including ingress and egress firewall, including traffic between different
traffic network zones or segments
FireEye. Palo Alto Networks, Cisco ASA Suricata, Snort, pfSense
PART - B
1.Explain about Introductiontobig data.
2. Explain about convergenceofkeytrends
3. Explain unstructureddata,industryexamplesofbigdata .
4. Explain webanalyticsand bigdataapplications.
5. Explainbigdatatechnologies.
6. ExplainintroductiontoHadoop.
7. ExplainOpen sourcetechnologiesand cloudandbigdata
8. ExplainmobilebusinessintelligenceCrowdsourcinganalytics
9. Explain interandtransfirewall analytics
UNIT 2 TWO MARKS QUESTIONS
Introduction to NoSQL – aggregate data models – key-value and document data
models –relationships – graph databases – schemaless databases – materialized views
– distribution models – master-slave replication – consistency - Cassandra Cassandra
data model –Cassandra examples – Cassandra clients
1. What is NOSQL?
NoSQL Database is a non-relational Data Management System, that does not require a
fixed schema. NoSQL database stands for "Not Only SQL" or "Not SQL." NoSQL
database technology stores information in JSON documents instead of columns and rows
used by relational databases. NoSQL databases are widely used in real-time web
applications and big data, because their main advantages are high scalability and high
availability.
2. What are the features of NoSQL?
Availability
Flexibility
Scalability
Availability
Distributed
Highly functional
High performance
3. What are the types of NOSQL databases?
Document databases: It store data in documents similar to JSON (JavaScript
Object Notation) objects. Each document contains pairs of fields and values. The
values can typically be a variety of types including things like strings, numbers,
booleans, arrays, or objects.
Key-value databases: are a simpler type of database where each item contains
keys and values.
Wide-column stores or Column Family Data stores: store data in tables, rows,
and dynamic columns.
Graph databases: store data in nodes and edges. Nodes typically store
information about people, places, and things, while edges store information about
the relationships between the nodes.
4. Difference between SQL and NOSQL?
SQL NOSQL
SQL databases are primarily NoSQL databases are primarily
called RDBMS or Relational called a Non-relational or distributed
Databases. database
SQL databases are table based NoSQL databases can be document
databases. based, key-value pairs, graph
databases NoSQL databases can be
document
SQL databases have fixed based, key-value pairs, graph
databases. databases
Vertically Scalable (scale-up Horizontal (scale-out across
with a larger server) commodity servers)
ACID (atomicity, consistency, Follows CAP(consistency.
isolation, durability) availability partition tolerance)
Transactions Supported
Joins required Joins not required
These databases are best suited These databases are not so good for
for complex queries complex queries
A mix of open-source like Open-source
Postgres & MySQL, and
commercial like Oracle
database.
5. List the advantages of NOSQL
NoSQL databases simplify application development, particularly for interactive
real-time web applications, such as those using a REST API and web services.
These databases provide flexibility for data that has not been normalized, which
requires a flexible data model, or has different properties for different data
entities
They offer scalability for larger data sets, which are common in analytics and
artificial intelligence (AI) applications.
NoSQL databases are better suited for cloud, mobile, social media and bi data
requirements.
They are designed for specific use cases and are easier to use than general
purpose relational or SQL databases for those types of applications.
6. List the disadvantages of NoSQL
Each NoSQL database has its own syntax for querying and managing data.
Lack of a rigid database schema and constraints removes the data integrity
safeguards that are built into relational and SQL database systems.
A schema with some sort of structure is required in order to use the data.
Because most NoSQL databases use the eventual consistency model, they do not
provide the same level of data consistency as SQL databases.
The data will not be consistent, which means they are not well-suited for
transactions that require immediate integrity, such as banking and ATM
transactions.
There are no comprehensive industry standards as with relational and SQL
DBMS offerings.
Lack of ACID properties
Lack of JOINS.
7. Difference between Cassandra and MySQL
Cassandra MySQL
Apache Cassandra is a type of It is a type of Relational
No- SQL database Database.
Apache Software Foundation Oracle developed MySQL and
developed Cassandra and released it in May 1995. a W
released it in July 2008.
Apache Cassandra is written in MySQL is written in C and C++.
JAVA.
Cassandra does not provide It provides ACID properties.
ACID properties. It only
provides AID property.
Read operation in MySQL takes Read operation in MySQL takes
O(1) complexity. O(log(n)) complexity.
There is no foreign key in MySQL has a foreign key, so it
Cassandra. As a result, it does supports the concept of
not provide the concept of Referential Integrity.
Referential Integrity.
8. Difference between Cassandra and RDBMS
Cassandra RDBMS
Cassandra handles the RDBMS handles the structured
unstructured data. RDBMS data.
handles the structured data.
RDBMS provides a fixed schema
to store
Cassandra provides a flexible RDBMS provides a fixed schema
schema to store the data. to store the data.
In Cassandra, the Keyspaces are In RDBMS the databases are used
used to store the tables and it is to store the tables and it is the
the outermost container of outermost container of the
Cassandra. Relational Database management
system.
In Cassandra, the Tables are The Tables are represented as the
represented as the nested key- array of ROW and COLUMNS.
value pairs.
In Cassandra, the entities are In RDBMS the entities are
represented through Tables and represented through Tables.
Columns.
In Cassandra, the relationship is The Joining in RDBMS is
presented by the Collections. supported by the concept of
foreign key join.
In Cassandra, a column is a The attribute of the table is
storage unit represented through a Column.
In Cassandra, the rows are In RDBMS, the rows are
represented as the replication unit. represented as the actual data.
9. What is Apache Cassandra?
Apache Cassandra is a free and open-source distributed NoSQL database management
system designed to handle large amounts of data across many commodity servers,
providing high availability with no single point of failure.
10. What is CQLSH? And why is it used?
Cassandra-Cqlsh is a query language that enables users to communicate with its database.
By using Cassandra cqlsh, you can do following things,
o Define a schema
o Insert a data, and
o Execute a query
11. What are Clusters in Cassandra?
The outermost structure in Cassandra is the cluster. A cluster is a container for Keyspaces.
Sometimes called the ring, because Cassandra assigns data to nodes in the cluster by
arranging them in a ring. A node holds a replica for a different range of data.
12. What is a Keyspace in Cassandra?
A keyspace is the outermost container for data in Cassandra. Like a relational database, a
keyspace has a name and a set of attributes that define keyspace- wide behaviour. The
keyspace is used to group Column families together.
13. What is a Column Family?
A column family is a container for an ordered collection of rows, each of which is itself an
ordered collection of columns. We can freely add any column to any column family at any
time, depending on your needs. The comparator value indicates how columns will be
sorted when they are returned to you in a query.
14. What is a Row in Cassandra? and What are the different elements of it?
A row is a collection of sorted columns. It is the smallest unit that stores related data in
Cassandra. Any component of a Row can store data or metadata.
The different elements/parts of a row are the,
Row Key
Column Keys
Column Values
15. Name some features of Apache Cassandra.
High Scalability High fault tolerant
Flexible Data storage
Easy data distribution
Tunable Consistency
Efficient Wires
Cassandra Query Language
16. List some of the components of Cassandra.
Some components of Cassandra are:
Table
Node
Cluster
Data Centre
Commit log
Bloom Filter
Memtable
SSTable
17. Write some advantages of Cassandra.
These are the advantages if Cassandra:
Since data can be replicated to several nodes, Cassandra is fault tolerant
Cassandra can handle a large set of data. Cassandra provides high scalability.
18. 18.Define commit log.
It is a mechanism that is used to recover data in case the database crashes. Ever operation
that is carried out is saved in the commit log. Using this the data ca be recovered.
19. Define composite key.
Composite keys include row key and column name. They are used to define column family
with a concatenation of data of different type.
20. Define SSTable.
SSTable is Sorted String Table. It is a data file that accepts regular Mem Tables.
21. What is memtable?
Memtable is in-memory/write-back cache space containing content in key column format.
In memtable, data is sorted by key, and each ColumnFamily has a distinct memtable that
retrieves column data via key. It stores the writes until it is full, and then flushed out.
22. How the SSTable is different from other relational tables?
SStables do not allow any further addition and removal of data items once written. For
each SSTable, Cassandra creates three separate files like partition index, partition
summary and a bloom filter.
23. What is data replication in Cassandra?
Data replication is an electronic copying of data from a database in one computer or server
to a database in another so that all users can share the same level of information.
Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. The
replication strategy decides the nodes where replicas are placed
PART - B
1.Explain about IntroductiontoNoSQLandaggregatedatamodels.
2.Explain key-valueanddocumentdatamodels.
3. Explain Relationshipsand graphdatabases
4. Explainschemalessdatabasesand materializedviews
5. Explain distributionmodels and master slave replication
6. Explain consistency and Cassandra
7. Explain Cassandra data model and Cassandraexamples
8. Explain Cassandraclients
UNIT-III MAP REDUCE APPLICATIONS
MapReduce workflows – unit tests with MRUnit – test data and local tests – anatomy of
MapReduce job run – classic Map-reduce – YARN – failures in classic Map-reduce and YARN –
job scheduling – shuffle and sort – task execution – MapReduce types – input formats – output
formats.
PART-A
1. What do you mean by MapReduce?
MapReduce is a programming model and software framework that allows for the
processing of large data sets in parallel across multiple computers.
The MapReduce model breaks down a large dataset into smaller chunks, Abo and then
processes those chunks in parallel across a distributed computing environment. Two
functions in MapReduce are map() and reduce().
2. What are the advantages of using MapReduce with Hadoop?
Advantages of MapReduce
Advantage Description
Flexible Hadoop MapReduce programming can access
and operate on different types of structured and
unstructured
Parallel Processing MapReduce programming divides tasks for
execution in parallel
Resilient Is fault tolerant that quickly recognizes the
faults & then apply a quick recovery solution
implicitly
Scalable Hadoop is a highly scalable platform that can
store as well as distribute large data sets across
plenty of servers
Cost-effective High scalability of Hadoop also makes it a cost-
effective solution for ever-growing data storage
needs
Simple It is based on a simple programming model
Secure Hadoop MapReduce aligns with HDFS and
HBase security for security measures
Speed It uses the distributed file system for storage that
processes even the large sets of
unstructured data in minutes
3. Explain what is distributed Cache in MapReduce Framework?
Distributed Cache is an important feature provided by the MapReduce framework When you
want to share some files across all nodes in Hadoop Cluster, Distributed Cache is used. The
files could be an executable jar files or simple properties file.
4. Explain what is the function of MapReduce partitioner?
The function of MapReduce partitioner is to make sure that all the value of a file single key
goes to the same reducer, eventually which helps even distribution of the map output over the
reducers
5. Mention what are the main configuration parameters that user need to specify to run
MapReduce Job?
The user of the MapReduce framework needs to specify
Job's input locations in the distributed file system
Job's output location in the distributed file system
Input format
Output format
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes
6. List out the key concepts related to MapReduce.
Job
Task
JobTracker
TaskTracker
Map()
Reduce()
Data Locality
7. Explain Map () and Reduce() function.
Map()
Map Task in MapReduce is performed using the Map() function.
This part of MapReduce is responsible for processing one or more chunks of data and
producing the output results.
Reduce()
The next part/component/stage of the MapReduce programming model is the Reduce()
function.
This part of MapReduce is responsible for consolidating the results produced by each
of the Map() functions/tasks.
8. Difference between job Tracker and Task Tracker.
Job Tracker Task Tracker
Central coordination and management of Execution and monitoring of individual tasks
MapReduce jobs
Coordination of job scheduling, progress Execution of individual tasks assigned by
monitoring, failure management, and job-level Job Tracker, progress reporting to Job
management Tracker
Manages MapReduce jobs at the job- level Manages individual tasks at the task-level
Single instance per Hadoop cluster Multiple instances per Hadoop cluster
Single point of failure Fault-tolerant through replication and
recovery mechanisms
Receives job requests from clients, assigns tasks Executes tasks assigned by Job Tracker,
to Task Trackers, monitors progress, and reports progress and status to Job Tracker
manages failures and re- scheduling of tasks
Manages job-specific configurations and Executes individual tasks assigned by Job
coordination of multiple tasks across Tracker Tracker
different Task Trackers
Applicable in earlier versions of Hadoop (prior Not applicable in Hadoop 2.x and
to Hadoop 2.x) later versions
9. What are the stages in MapReduce workflow?
Input
Split
Map
Combine
Shuffle & Sort
Reduce
Output
10. What do you mean by MRUnit?
MRUnit is a Java-based testing framework that is specifically designed for testing
MapReduce programs in Hadoop.
It provides a set of tools and utilities for writing unit tests for MapReduce jobs, which
enables developers to test their MapReduce code in an isolated and controlled
environment.
MRUnit allows developers to write test cases for MapReduce programs using familiar
JUnit testing syntax, making it easy to incorporate MapReduce testing into existing
Java development workflows.
11. List down the steps to write unit test with MRUnit.
Set up your test environment.
Define your input and expected output
Write your test case
Run your test case
12. What are the key components of MapReduce development that ensure the correctness of
MapReduce Program?
The two key components of MapReduce development that are used to ensure the correctness
of a MapReduce program are
Test data
local tests
13. What is Test data?
Test data is a set of input data that is used to test a MapReduce program.
The test data should cover a range of possible scenarios and edge cases to ensure that
the program handles all possible inputs correctly.
The test data should be representative of real-world scenarios that the program is
expected to handle.
To ensure that the MapReduce program works as expected, developers must define the
expected output for each input scenario.
14. What is the use of Local test?
Development and debugging
Faster feedback loop
Cost-effective
Controlled environment
Flexibility
15. Write the benefits of using test data and local test in MR development.
Early bug detection
Improved code quality
Faster feedback
Easier refactoring
16. List the step involved in using test data and local test in MapReduce.
Define the test data
Define the expected output
Write unit tests
Write integration tests
Automate the testing process
Run end-to-end tests
17. What do you mean by Shuffle and Sort?
Shuffling is the process by which it transfers the mapper's intermediate output to the
reducer.
Reducer gets one or more keys and associated values based on reducers.
The intermediated key-value generated by the mapper is sorted automatically by key.
In Sort phase merging and sorting of the map, the output takes place.
Shuffling and sorting in Hadoop occur simultaneously.
18. Infer your knowledge about YARN.
YARN is a cluster management technology that provides resource management and
scheduling capabilities for Hadoop.
It decouples the resource management and job scheduling functions of Hadoop from
the data processing functions, enabling more flexible efficient cluster utilization.
YARN consists of a ResourceManager that manages cluster resources and an
Application Master that manages the execution of individual applications
It provides a centralized framework for managing resources and scheduling tasks,
which makes it easier to scale and manage large clusters.
19. List out the failures in MapReduce and YARN.
Failures in Classic MapReduce:
Node Failures
Job Failures
Network Failures
Disk Failures
Failures in YARN:
Resource Manager Failures
Node Manager Failures
Application Master Failures
Network Failures
Disk Failures
20. List out the features of YARN.
Scalability
Flexibility
Resource management
Fault tolerance
Scalable data processing
Data locality
Job scheduling
Security
Monitoring and diagnostics
Extensibility
21. Name the types of Hadoop Scheduler.
There are three schedulers in Hadoop
FIFO scheduler
Capacity scheduler
Fair scheduler
22. List out the types of MapReduce.
Traditional MapReduce
MapReduce 2(YARN-based)
23. List out the stages in Task Execution.
Task execution refers to the process of running individual tasks or computations on a
computing system or framework.
In the context of distributed computing frameworks like Hadoop, task execution
involves the parallel processing of data across multiple machines or nodes in a
distributed cluster.
24. What do you mean by Job Scheduling?
Job scheduling refers to the process of allocating computing resources and
determining the order in which tasks or jobs are executed in a distributed computing
system.
It involves managing the submission, prioritization, and execution of jobs or tasks on a
cluster of to ensure efficient and optimal resource utilization.
PART-B
1. Explain MapReduceworkflowsmachines
2. Explain unittestswithMRUnitand testdataandlocaltests
3. ExplainanatomyofMapReducejobrun
4. Explain classicMap-reduce
5. Explain YARNandfailuresinclassicMap-reduceandYARN
6. Explain shuffleandsortand task execution
7. Explain MapReducetypesandinputformatsandoutputformats.
UNIT IV - TWO MARKS QUESTION AND ANSWERS
Data format – analyzing data with Hadoop – scaling out – Hadoop streaming –
Hadooppipes –design of Hadoop distributed file system (HDFS) – HDFS concepts –
Java interface – data flow –Hadoop I/O – data integrity – compression – serialization –
Avro – file-based data structures -Cassandra – Hadoop integration
1. What are the two main files stored in HDFS?
Text files
Binary files.
Text files are stored as plain text files and are the most common type of files used in
Hadoop. Text files can be read and processed by Hadoop applications using standard
input/output libraries.
Binary files, on the other hand, are stored as raw data in the HDFS cluster. Examples
of binary files include image files, audio files, or video files. Binary files can be
processed by Hadoop applications using specialized input/output libraries, such as the
Hadoop Sequence File format or the Hadoop Avro format.
2. What are the steps in analyzing data with Hadoop?
The process of analyzing data with Hadoop typically involves the following steps:
Data ingestion
Data preparation
Data processing
Data analysis
Data visualization
3. What is data visualization?
Data visualization refers to the use of graphical representations, such as charts, graphs,
and maps, to visually present data and information in a clear and understandable manner.
It involves the use of visual elements, such as colors, shapes, and patterns, to represent
data points or patterns in data, making it easier for users to interpret complex information
and identify meaningful insights.
4. What do you mean by scaling out?
Scaling out, or horizontal scaling, involves adding servers for parallel computing.
The scale out technique is a long-term solution, as more and more servers may be added
when needed.
But going from one monolithic system to this type of cluster may be a difficult, although
extremely effective solution.
5. What are the bottlenecks that can be solved with scaling?
High CPU Usage
Low Memory
High disk usage
6. Infer your knowledge on streaming.
Hadoop Streaming is another feature/utility of the Hadoop ecosystem that enables users to
execute a MapReduce job from an executable script as Mapper and Reducers.
Hadoop Streaming is often confused with real-time streaming, but it's simply a utility that
runs an executable script in the MapReduce framework.
The executable script may contain code to perform real-time ingestion ofdata.
The basic functionality of the Hadoop Streaming utility is that it executes the Mapper and
Reducer without any external script, creates a MapReduce job, submits the MapReduce
job to the cluster, and monitors the job until it completes.
7. What is Hadoop pipes?
Hadoop Pipes is a C++ API that enables developers to write MapReduce applications in
C++,
It is an alternative to Hadoop's native Java API and can be used with any programming
language that can read from standard input and write to standard output.
The Hadoop Pipes API allows external C++ programs to be used as both map and reduce
functions. This means that existing C++ code can be leveraged in a Hadoop MapReduce
job.
8. Name some key concepts of Hadoop.
Blocks
Namenode
Datanodes
Rack Awarness
Replication
9. What is Java Interface?
In Hadoop, Java interfaces are used to define the contracts that classes must implement in
order to work with Hadoop components.
Hadoop is built using Java, and interfaces are used extensively throughout the framework.
Hadoop is written in Java, and interfaces are heavily used throughout the
framework to define the contracts that classes must implement.
Interfaces in Hadoop define a set of methods that must be implemented by classes in
order to work with Hadoop components.
10. Write down the basic dataflow of Hadoop system.
Capture Big Data
Process and Structure
Distribute Results
Feedback and Retain
11. What do you mean by Data Integrity?
Data integrity means that data should remain accurate and consistent across its storing,
processing, and retrieval operations.
. To ensure that no data is lost or corrupted during persistence and processing, Hadoop
maintains stringent data integrity constraints.
Every read/write operation that occurs in disks, more so through the network is prone to
errors.
And, the volume of data that Hadoop handles only aggravates the situation. Infer your
knowledge on Compression.
12. Infer your knowledge on Compression.
Keeping in mind the volume of data Hadoop deals with, compression is not a luxury but a
requirement.
There are many obvious benefits of file compression rightly used by Hadoop.
It economizes storage requirements and is a must-have capability to speed up data
transmission over the network and disks. example, gzip, bzip2, LZO, zip, and so forth are
often used.
13. What is serialization?
The process that turns structured objects to stream of bytes is called serialization. This is
specifically required for data transmission over the network or
persisting raw data in disks.
Deserialization is just the reverse process, where a stream of bytes is transformed into a
structured object.
This is particularly required for object implementation of the raw bytes.
14. What do you mean by AVRO?
Apache Avro is a data serialization system that is used to exchange data between different
systems.
It is based on a schema-based serialization technique and is used in various big data
processing frameworks like Hadoop, Spark, and Flink.
Avro is a language-neutral data serialization system that uses JSON or binary encoding
for compactness and fast processing.
15. What are the commonly used file-based data structures in Hadoop?
SequenceFile
Avro file
RCFile
ORC file
Parquet file
16. List out the features of Cassandra.
Easy data Distribution Flexible data
Elastic scalability
Fast writes
Always on Architecture
Fast linear-scale performance
Transaction support
17. Infer your knowledge on Hadoop Integration.
Hadoop architecture is designed to be easily integrated with other systems.
Integration is very important because although we can process the data efficiently in
Hadoop, but we should also be able to send that result to another system to move the data
to another level.
Data has to be integrated with other systems to achieve interoperability and flexibility.
PART-B
1. Explain Data format and analyzing data with Hadoop
2. Explain scaling out and Hadoop streaming and Hadoop pipes
3. Explaindesign of Hadoop distributed file system (HDFS)
4. Explain HDFS concepts and Java interface
5. Explain data flow and Hadoop I/O
6. Explain data integrity and compression
7. Explain serialization and Avro , file-based data structures
8. ExplainCassandra and Hadoop integration.
UNIT-V - TWO MARK QUESTIONS AND ANSWERS
Hbase – data model and implementations – Hbase clients – Hbase examples – praxis
Pig– Grunt – pig data model – Pig Latin – developing and testing Pig Latin scripts.Hive
– data types and file formats – HiveQL data definition – HiveQL data manipulation –
HiveQLqueries.
1.Explain What is HBase?
HBase is a column-oriented database management system which runs on top of HDFS (Hadoop
Distribute File System). HBase is not a relational data store, and it does not support structured
query language like SQL. In HBase, a master node regulates the cluster and region servers to store
portions of the tables and operates the work on the data.
2. Explain why to use HBase?
High capacity storage system
Distributed design to cater large tables
Column-Oriented Stores
Horizontally Scalable
High performance & Availability
Base goal of HBase is millions of columns, thousands of versions and billions of rows
Unlike HDFS (Hadoop Distribute File System), it supports random real time CRUD
operations.
3. Mention what are the key components of HBase?
HBase architecture consists mainly of following components
Zookeeper: It does the co-ordination work between client and HBase Maser
HBase Master: HBase Master monitors the Region Server
Region Server: Region Server monitors the Region
Region: It contains in memory data store (MemStore) and Hfile
Catalog Tables:Catalog tables consist of ROOT and META.
4. Explain what is the row key?
Row key is defined by the application. As the combined key is pre-fixed by the rowkey, it enables
the application to define the desired sort order. It also allows logical grouping of cells and make
sure that all cells with the same rowkey are co-located on the same server.
5.Difference between HBase and HDFS?
HBASE HDFS
Low latency operations High latency operations
Random reads and writes Write once Read many times
Accessed through shell commands, client API Primarily accessed through MR (Map Reduce)
in Java, REST, Avro or Thrift jobs
Storage and process both can be perform It's only for storage areas
6. Mention the difference between HBase and Relational Database?
Here are some important differences between Apache HBase and Relational Database:
HBase Relational Database
It is schema-less It is a schema based database
It is a column-oriented data store It is a row-oriented data store
It is used to store de-normalized data It is used to store normalized data
It contains sparsely populated tables It is used to store normalized data
It contains sparsely populated tables It contains thin tables
Automated partitioning is done in HBase There is no such provision or built- in support
for partitioning
7. What are the key components of Apache HBase?
The key components of HBase are HBase Master, RegionServer, Zookeeper, Region, and Catalog
Tables.
HMaster: It is similar to NameNode of Hadoop which manages Region Servers.
RegionServer: A table can be divided into several regions and those regions are served to
the clients by a Region Server.
ZooKeeper:ZooKeeper acts as a coordinator inside HBase distributed system. It maintains
the health of the server by communicating through sessions.
Region: It holds in a memory data store (MemStore) and Hfile.
Catalog Tables: It holds ROOT and META.
HBase:
8. What is the use of HBaseHMaster?
Main responsibilities of a master are:
a. Coordinating the region servers
b. Admin functions
9. What is the data model of Apache HBase?
HBase data model contains below.
List of tables.
Every table has column families and rows.
Row key acts as a Primary key in the table.
HBase tables use this Primary Key for access.
Every column qualifier denotes an attribute that is corresponding to the object that presents
in the cell.
10. What is the difference between RDBMS and HBase?
RDBMS HBase
It uses tables as databases. It uses regions as databases.
File systems supported are FAT, NTFS, and EXT. The file system supported is HDFS.
To store logs, RDBMS uses the commit logs. Apache HBase uses the WAL(Write- Ahead
Logs) logs to store logs.
The reference system used is a coordinate system. The reference system used is ZooKeeper.
Uses the primary key. Uses the row key.
Partitioning is supported. Sharding is supported.
The data model of RDBMS is rows and columns. The data model of HBase is rows, columns,
column families, and cells.
11. What are the features of Apache HBase?
Apache HBase provides the following sets of features that make it a solid non- relational database.
Apache HBase provides linear and modular scalability.
It provides read and writes operations consistently.
It provides the automatic facility to configure the table sharding.
Apache HBaseRegionServres are automatically failover.
A client can interact with Apache HBase using the JAVA APL.
It can be extensible for the JRuby-based (JIRB) shell as well.
12. What are some major advantages of Apache HBase?
The major advantages of Apache HBase are as below.
Apache HBase is great for analytics in association with Hadoop MapReduce.
Apache HBase can billion of rows and can process that as well.
It supports scaling out in coordination with the Hadoop file system even on commodity
hardware.
HBase is Fault tolerance.
Apache foundation provides it License-free.
It is very flexible on schema design/no fixed schema.
HBase can be integrated with Hive for SQL-like queries, which is better for DBAs who
are more familiar with SQL queries.
It provides Auto-sharding.
It provides the feature of auto-failover.
It provides a simple client interface.
HBase provides row-level atomicity, that is, the PUT operation will either write or fail.
13. List the features of pig Latin?
Pig Latin includes operators for many of the traditional data operations (join,
sort, filter, etc.)
Pig Latin is extensible so that users can develop their own functions for reading,
processing, and writing data.
Pig Latin script is made up of a series of operations, or transformations, the are applied to
the input data to produce output
Pig Latin programs can be executed either in Interactive mode through Grunt shell or in
Batch mode via Pig Latin Scripts.
14. Difference between MapReduce and Pig
MapReduce Apache Pig
1. It is a low-level data processing paradigm paradigm 1. It is a high-level data flow platform
2. Complex Java implementations 2. No complex Java implementations
3. Do not provide nested data types 3. Provides nested data types like tuples, bags,
and maps
4. Performing data operations is a humongous 4. Provides many built-in operators to
task support data operations
15. What are the different ways of executing Pig script?
There are three ways to execute the Pig script:
Grunt Shell: This is Pig's interactive shell provided to execute all Pig Scripts.
Script File: Write all the Pig commands in a script file and execute the Pig script file. This
is executed by the Pig Server.
Embedded Script: If some functions are unavailable in built-in operators, we can
programmatically create User Defined Functions (UDF) to bring that functionality using
other languages like Java, Python, Ruby, etc. and embed it in the Pig Latin Script file.
Then, execute that script file.
16. List the applications of pig.
Pig scripts are used for exploring massive databases.
Pig and Pig Latin provide support for ad-hoc queries across huge data sets.
Pig scripts aid in the development of massive data set processing methods.
Pig is required for the processing of time-sensitive data loads.
Pig scripts are used to collect massive volumes of data in search logs and web crawls.
17. What are the types of data models in Pig?
It consists of the four types of data models as follows:
Atom: It is an atomic data value used to store as a string. The primary use of this model is
that it can be used as a number and a string
Tuple: It is an ordered set of fields.
Bag: It is a collection of tuples.
Map: It is a set of key/value pairs.
18. How does the user communicate with shell in Apache Pig?
Users interact with HDFS or any local file system through Grunt, which is the Apache Pig's
communicative shell. To initiate Grunt, users need to invoke the Apache Pig with a no command
as follows:
Executing command "pig-x local" will prompt - grunt>
Pig Latin scripts can run either in local mode or the cluster mode by setting up the
configuration in PIG_CLASSPATH.
For exiting from grunt shell, users need to press CTRL+D or just key in the exit.
19. What is Pigstorage?
Loads or stores relations using field delimited text format.
Each line is broken into fields using a configurable field delimiter (defaults to a tab character) to be
stored in the tuples fields. It is the default storage when none is specified.
20. What is Grunt Shell?
Grunt Shell is an interactive-based shell. This means where exactly we will get the output then and
their itself. Whether it is a success (or) fail.
21. Explain the different ways to run pig scripts?
Pig scripts can be run in three different ways, all of them compatible with local
and Hadoop modes:
Script: A file containing Pig Latin commands are identified by the pig suffix (for
example, file_x.pig). These commands are interpreted by Pig and executed in sequential
order.
Grunt: Grunt is a command parser. If you type Pig Latin on the grunt command line,
Grunt will execute the command for you. This is quite helpful for prototyping and "what
if" scenarios.
Embedded: Pig programs can be executed as part of a Java program.
22. What is the Difference Between Pig & SQL?
Pig SQL
Pig is procedural SQL is declarative
Nested relational data model Flat relational data model
Schema is optional Schema is required
Scan Centric analytic workloads OLTP+ OLAA workloads
Limited query optimization Significant opportunity for query optimization
23. What are the operations supported by Pig?
Pig Latin has a very rich syntax. It supports operators for the following operations:
Loading and storing of data
Streaming data
Filtering data
Grouping and joining data
Sorting data
Combining and splitting data
24. What is the difference between Pig Latin and Pig Engine?
Pig Latin is a scripting language similar to Perl used to search large data sets. It is composed of a
sequence of transformations and operations that are the input data to create data applied to the
input data to create data.
The Pig engine is the environment in which Pig Latin programs are executed. It translates Pig
Latin operators into MapReduce jobs.
25. What is pig storage?
Pig has a built-in load function called pig storage. In addition, whenever we wish to import data
from a file system into the Pig, we can use Pig storage.
26. What are the Pig execution environment modes?
Apache Pig scripts can be executed in three ways: interactive mode, batch mode, and embedded
mode.
27. Explain the features of Pig and Pig Latin?
Apache Pig and Pig Latin come with the following features.
Rich set of operators: It has a variety of operators for performing operations like join,
sort, filter, etc.
Handles all kinds of data: Apache Pig examines various data types, organized and
unstructured. The findings are saved in HDFS.
User-Defined Functions (UDFs): Pig allows you to write user-defined functions in other
programming languages, such as Java, and then invoke or embed them in Pig Scripts.
Extensibility: Users can create their functions to read, process, and write data using
current operators.
Ease of programming: Pig Latin is comparable to SQL, and writing a Pig script is simple
if you know SQL.
Optimization opportunities: The jobs in Apache Pig optimize their execution
automatically, so programmers need to focus on language semantics.
28. List the limitations of Hive?
Apache Hive has the following list of limitations.
OLTP Processing issues: Apache Hive is not intended for online transaction processing
(OLTP).
No Updates and Deletes: Hive does not support updates and deletes, however, it does
support overwriting or apprehending data.
Limited Subquery Support: Hive does not support subqueries.
No Support for Materialized View: Hive does not support Materialized view.
High Latency: The latency of Apache Hive queries is generally very high.
29. What is the difference between Apache Hive and Apache Pig?
Hive Pig
Hive is used for Structured Data. Pig is majorly supported for semi- structured
data.
Hive requires a well-defined Schema. In Pig schema is optional.
Hive is a declarative language that has a very The nature of the Pig is a procedural language.
similar syntax to SQL.
Hive is mainly used for reporting. Pig is mainly used for programming.
Hive is usually used on the server-side of the Pig is usually used on the client-side of the
Hadoop cluster. Hadoop cluster.
Hive is more like SQL. Pig is more like Verbose.
30. Define the difference between Hive and HBase?
HBase Hive
HBase is built on the top of HDFS. It is a data warehousing infrastructure.
HBase operations run in a real-time on its datasets Hive queries are executed as
database rather. MapReduce jobs internally.
Provides low latency to single rows from huge Provides high latency for huge datasets.
datasets.
Provides random access to data. Provides random access to data.
31. Explain what is Hive?
Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System
(HDFS). It is a data warehouse framework for querying and analysis of data that is stored in
HDFS. Hive is an open-source-software that lets programmers analyse large data sets on Hadoop.
32. What is Hive QL?
1. Support SQL-like query Language called HiveQL for select, join aggregate. union all, and
subquery in the from clause.
2. Support DDL statements such as CREATE table with serialization format, partitioning, and
bucketing columns.
3. Command to load data from external sources and INSERT into HIVE tables.
4. Do not support UPDATE and DELETE.
5. Support multi-table INSERT.
6. Support user-defined column transformation (UDF) and aggregation (UDAF) function written
in Java.
33. When to use Hive?
Hive is useful when making data warehouse applications
When you are dealing with static data instead of dynamic data
When application is on high latency (high response time)
When a large data set is maintained
When we are using queries instead of scripting
34. Mention what are the different modes of Hive?
Depending on the size of data nodes in Hadoop, Hive can operate in two modes. These modes are,
Local mode
Map reduce mode
35. Mention key components of Hive Architecture?
Key components of Hive Architecture includes,
User Interface
Compiler
Metastore
Driver
Execute Engine
36. Mention what Hive is composed of?
Hive consists of 3 main parts,
1. Hive Clients
2. Hive Services
3. Hive Storage and Computing
37. Mention what Hive query processor does?
Hive query processor convert graph of MapReduce jobs with the execution time framework. So
that the jobs can be executed in the order of dependencies.
38. Mention what are the components of a Hive query processor?
The components of a Hive query processor include,
Logical Plan Generation
Physical Plan Generation
Execution Engine
Operators
UDF's and UDAF's
Optimizer
Parser
Semantic Analyzer
Type Checking
39. What is a metastore in Hive?
Metastore in Hive stores the meta data information using RDBMS and an open source ORM
(Object Relational Model) layer called Data Nucleus which converts the object representation into
relational schema and vice versa.
40. What is a partition in Hive?
Hive organizes tables into partitions for grouping similar type of data together based on a column
or partition key. Each Table can have one or more partition keys to identify a particular partition.
Physically, a partition is nothing but a sub- directory in the table directory.
41. What is indexing and why do we need it?
Hive index is a Hive query optimization techniques. Basically, we use it to speed up the access of a
column or set of columns in a Hive database. Since, the database system does not need to read all
rows in the table to find the data with the use of the index, especially that one has selected.
42. Mention what is the difference between order by and sort by in Hive?
SORT BY will sort the data within each reducer. You can use any number of reducers
for SORT BY operation.
ORDER BY will sort all of the data together, which has to pass through one reducer.
Thus, ORDER BY in hive uses a single
PART-B
1. Explain Hbase and data model and implementations
2. Explain Hbase clients and Hbase examples
3. Explain and praxis.,Pig
4. Explain Gruntandpigdatamodel
5. ExplainPigLatinand developingandtesting ,PigLatinscripts.
6. Explain Hive,datatypesandfileformats
7. HiveQLdatadefinitionand HiveQLdatamanipulation
8. ExplainHiveQLqueries.
Reg. No. :
Question Paper Code : 50417
B.E./B.Tech. DEGREE EXAMINATIONS, APRIL/MAY 2024.
Fifth/Sixth Semester
Computer Science and Engineering
CCS 334 –– BIG DATA ANALYTICS
(Common to Computer Science and Design/Computer Science and Engineering
(Artificial Intelligence and Machine Learning) Computer and Communication
Engineering/Electrical and Electronics Engineering/Artificial Intelligence and Data
Science/Computer Science and Business Systems/Information Technology)
(Regulations 2021)
Time : Three hours Maximum : 100 marks
Answer ALL questions.
PART A — (10 × 2 = 20 marks)
1. What characteristics will define a dataset as big data?
2. Analytic professionals need permissions to utilize the enterprise data
warehouse. In such case, suggest an alternate mechanism that is ideal for data
exploration.
3. What are NoSQL databases? Give example.
4. Why Cassandra data model is very popular among developers?
5. What is the role of mini reducer in Map reduce?
6. How YARN supports the notion of resource reservation?
7. List out the applications for which HDFS does not work well.
8. Mention the necessity for serialization in Hadoop and present the default
serialization framework supported by Hadoop.
9. Write a short note on HiveQL queries.
10. Mention the data types in Hive.
PART B — (5 × 13 = 65 marks)
11. (a) Discuss the activities in various phases of big data analytics life cycle by
considering the case study of stock market prediction system.
Or
(b) Consider a scenario of recommender system based on previous searches
made by user in social media. Explain how big data analytics lifecycle be
applied for this scenario.
12. (a) What is the need for NoSQL databases? Explain the types of NoSQL
databases with example.
Or
(b) Explain the features of Cassandra. List out the database components of
Cassandra. What is CQLSH and why is it used?
13. (a) (i) Write a pseudo code and map reduce code to count words present in
the file. (6)
(ii) Write in detail about testing techniques in MapReduce work
flow. (7)
Or
(b) (i) Write a Map reduce code using Python or Java in Hadoop to find
palindrome words from a given list of words in a file. Also write the
steps to execute the program. (7)
(ii) Explain about MapReduce input/output formats with examples. (6)
14. (a) Explain in detail about HDFS with neat diagram.
Or
(b) Compare and Contrast Hadoop with RDBMS for performing large scale
batch analysis.
15. (a) Write a user defined function in Pig Latin which performs the following
using the sample dataset provided:
(i) Assume the provided dataset is an excel sheet, Read the countries
and customer data separately and specify the resulting data
structure. (3)
(ii) Out of all the countries available, find the Asian countries. (3)
(iii) Find customers who belong to Asia. (2)
(iv) For those customers, find their country names. (2)
(v) Sort the results in ascending order and save them to a file. (3)
2 50417
Sample dataset is shown below:
Customer_id Customer Name Gender City Country_id
101 Ajay M Kabul 1
102 Badri M New Delhi 3
103 Carolyn F Nairobi 4
104 Daniel M Cape Town 5
105 Edwin M Ottawa 6
106 Fathima F Chicago 7
107 Ganga F Islamabad 2
Country_id Country Name Country Region
1 Afghanistan Asia
5 South Africa Africa
2 Pakistan Asia
4 Kenya Africa
3 India Asia
6 Canada North America
7 United States North America
Or
(b) Write commands to create the following table in hbase and write
commands to perform the following in hbase.
Data
Row Key
Name Age City
Row 1 Ravi 36 Coimbatore
Row 2 Udaya 37 Salem
Row 3 Rama 40 Ooty
(i) Write command to update the age in row 2 to 20? (3)
(ii) Show all rows with a value age above 35? (2)
(iii) Add gender information for all the rows in the table. (3)
(iv) Count the number of entries in the table and print the count. (3)
(v) Write a command to drop the table. (2)
3 50417
PART C — (1 × 15 = 15 marks)
16. (a) The Indian government has decided to use big data analytics to optimize
bus transport management. The State Bus Transport Authority division
has planned to collect and disseminate real-time data to identify the
causes of transport delays. For this purpose the appropriate data is
collected from various sources such as bus timetables, inductive-loop
traffic detectors, closed-circuit television cameras and GPS updates from
the city buses. This allows traffic controllers to see the current status of
the entire bus network. Elaborate on the different types of data that are
being generated in this scenario, devise a big data Ecosystem for Bus
transport and also explain the key roles for the new big data ecosystem
with a diagram.
Or
(b) A health researcher wants to predict “VO2max”, an indicator of fitness
and health. Normally, to perform this procedure requires expensive
laboratory equipment, as well as requiring individuals to exercise to their
maximum (i.e., until they can no longer continue exercising due to
physical exhaustion). This can put off individuals who are not very
active/fit and those who might be at higher risk of ill health (e.g., older
unfit subjects). For these reasons, it has been desirable to find a way of
predicting an individual’s VO2max based on attributes that can be
measured more easily and cheaply. To this end, a researcher recruited
participants to perform a maximum VO2max test, but also recorded their
“age”, “weight” and “heart rate”. The researcher wants to store the
recorded data of size 2.5GB in a Big data environment. Assume the
default block size to be 128 MB with the replication factor as 3. Calculate
the number of blocks needed for storing this dataset in HDFS. Illustrate
and explain the sequence of events on how to use the methods provided
by FileSystem API while reading a file.
————––––——
4 50417