Big Data Question Bank
Big Data Question Bank
Answer Pattern:
1. Introduction statement
2. Relevant explanation
3. Example
4. Diagram
5. Anything more you want to add
Q1. Explain the concept of big data. Compare and contrast with data
warehouse.
Answer:
The term Big Data is being increasingly used almost everywhere on the planet – online
and offline. And it is not related to computers only.
It comes under a blanket term called Information Technology, which is now part of
almost all other technologies and fields of studies and businesses.
Big Data is not a big deal. The hype surrounding it is a sure pretty big deal to confuse
you. This article takes a look at what is Big Data.
It also contains an example of how Netflix used its data, or rather, Big Data, to better
serve its clients’ needs.
What is Big Data
The data lying in the servers of your company was just data until yesterday – sorted and
filed. Suddenly, the slang Big Data got popular, and now the data in your company is
Big Data.
The term covers each and every piece of data your organization has stored till now. It
includes data stored in clouds and even the URLs that you bookmarked. Your company
might not have digitized all the data.
You may not have structured all the data already. But then, all the digital, papers,
structured and non-structured data with your company is now Big Data.
In short, all the data – whether or not categorized – present in your servers is
collectively called BIG DATA. All this data can be used to get different results using
different types of analysis.
It is not necessary that all analysis use all the data. The different analysis uses different
parts of the BIG DATA to produce the results and predictions necessary.
Big Data is essentially the data that you analyse for results that you can use for
predictions and other uses. When using the term Big Data, suddenly your company or
organization is working with top level Information technology to deduce different types of
results using the same data that you stored intentionally or unintentionally over the
years.
How big is Big Data
Essentially, all the data combined is Big Data, but many researchers agree that Big
Data – as such – cannot be manipulated using normal spreadsheets and regular tools
of database management.
They need special analysis tools like Hadoop (we’ll study this in a separate post) so that
all the data can be analysed at one go (may include iterations of analysis).
Contrary to the above, though I am not an expert on the subject, I would say that data
with any organization – big or small, organized or unorganized – is Big Data for that
organization and that the organization may choose its own tools to analyse the data.
Normally, for analysing data, people used to create different data sets based on one or
more common fields so that analysis becomes easy. In case of Big Data, there is no
need to create subsets for analysing it.
We now have tools that can analyse data irrespective of how huge it is. Probably, these
tools themselves categorize the data even as they are analysing it.
Related Question: What is the relation between data warehouse and big data. Explain
with suitable example.
Answer:
Data warehouse = historical data only. Big data = now data + current Data (IOT
Devices)
Answer:
Q3. Explain one application each from Manufacturing and Service Industry.
Answer:
McKinsey and Company offers a big data use case in pharmaceutical manufacturing. A
biopharmaceutical company was using live, genetically engineered cells and tracking
200 variables to track the purity of its manufacturing process for vaccines and blood
components. However, two batches of the same substance manufactured using
identical processes showed a yield variation from 50 to 100 percent. The inconsistency
in capacity and quality could attract regulatory attention.
The project team segmented its manufacturing processes into clusters of activity. Using
big data analytics the team assessed process interdependencies and identified nine
parameters that had a direct impact on vaccine yield. By modifying target processes the
company was able to increase vaccine production by 50 percent resulting in savings
between $5 and $10 million annually.
Tata Consultancy Services cites the case of a $2 billion company that generates most
of its revenue by manufacturing products to order.
Using big data analytics this company was able to analyze the behavior of repeat
customers. The outcome is critical to understanding how to deliver goods in a timely
and profitable manner.
Much of the analyses centered on how to make sure strong contracts were in place.
The company also was able to shift to lean manufacturing to determine which products
were viable and which ones needed to be scrapped.
Intel has been harnessing big data for its processor manufacturing for some time. The
chipmaker has to test every chip that comes off its production line. That normally means
running each chip through 19,000 tests.
Using big data for predictive analytics Intel was able to significantly reduce the number
of tests required for quality assurance. Starting at the wafer level, Intel analyzed data
from the manufacturing process to cut down test time and focus on specific tests.
The result was a savings of $3 million in manufacturing costs for a single line of Intel
Core processors. By expanding big data use in its chip manufacturing, the company
expects to save an additional $30 million.
Since consumers expect rich media on-demand in different formats and in a variety of
devices, some big data challenges in the communications, media and entertainment
industry include:
A case in point is the Wimbledon Championships (YouTube Video) that leverages big
data to deliver detailed sentiment analysis on the tennis matches to TV, mobile, and
web users in real-time.
Spotify, an on-demand music service, uses Hadoop big data analytics, to collect data
from its millions of users worldwide and then uses the analyzed data to give informed
music recommendations to individual users.
Answer:
HDFS
The HDFS is the storage system of the Hadoop framework. It is a distributed file system
that can conveniently run on commodity hardware for processing unstructured data.
Due to this functionality of HDFS that is built to run on commodity hardware, it is able to
be highly fault-tolerant.
The same data is stored in multiple locations and in the event of one storage location
failing to provide the required data, the same data can be easily fetched from another
location. It owes its existence to the Apache Nutch project but today is a top level
Apache Hadoop project.
HDFS is a major constituent of Hadoop along with Hadoop YARN, Hadoop MapReduce
and Hadoop Common.
HDFS is a highly scalable and reliable storage system for big data platform Hadoop.
Working closely with Hadoop YARN for data processing and data analytics, it improves
the data management layer of the Hadoop cluster making it efficient enough to process
big data concurrently. HDFS also works in close coordination with HBase. Let us find
out some of the highlights that make this technology special :
Scaling out It works on scaling out rather than scaling up without single
downtime
TOOLS
How it Works:
MongoDB stores data using a flexible document data model that is similar to JSON.
Documents contain one or more fields, including arrays, binary data and sub-
documents. Fields can vary from document to document.
MongoDB can be used as a file system with load balancing and data replication
features over multiple machines for storing files.
1. Ad hoc queries
2. Indexing
3. Replication
4. Load balancing
5. Aggregation
6. Server-side JavaScript execution
7. Capped collections
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing
data summarization, query, and analysis. Hive gives an SQL-like interface to query data
stored in various databases and file systems that integrate with Hadoop.
How it Works:
Hive has three main functions data summarization, query and analysis. It supports
queries expressed in a language called HiveQL, which automatically translates SQL-like
queries into MapReduce jobs executed on Hadoop.
Apache Hive supports analysis of large datasets stored in Hadoop’s HDFS and
compatible file systems such as Amazon S3 file system.
Q5. What is Predictive analytics? Explain with suitable example
Answer:
Predictive analytics refers to using historical data, machine learning, and artificial
intelligence to predict what will happen in the future. This historical data is fed into a
mathematical model that considers key trends and patterns in the data. The model is
then applied to current data to predict what will happen next.
Using the information from predictive analytics can help companies—and business
applications—suggest actions that can affect positive operational changes. Analysts can
use predictive analytics to foresee if a change will help them reduce risks, improve
operations, and/or increase revenue. At its heart, predictive analytics answers the
question, “What is most likely to happen based on my current data, and what can I do to
change that outcome?”
For many companies, predictive analytics is nothing new. But it is increasingly used by
various industries to improve everyday business operations and achieve a competitive
differentiation.
In practice, predictive analytics can take a number of different forms. Take these
scenarios for example.
Consider a yoga studio that has implemented a predictive analytics model. The system
may identify that ‘Jane’ will most likely not renew her membership and suggest an
incentive that is likely to get her to renew based on historical data. The next time Jane
comes into the studio, the system will prompt an alert to the membership relations staff
to offer her an incentive or talk with her about continuing her membership. In this
example, predictive analytics can be used in real time to remedy customer churn before
it takes place.
Send marketing campaigns to customers who are most likely to buy. If your business
only has a $5,000 budget for an upsell marketing campaign and you have three million
customers, you obviously can’t extend a 10 percent discount to each customer.
Predictive analytics and business intelligence can help forecast the customers who
have the highest probability of buying your product, then send the coupon to only those
people to optimize revenue.
Improve customer service by planning appropriately:
Businesses can better predict demand using advanced analytics and business
intelligence. For example, consider a hotel chain that wants to predict how many
customers will stay in a certain location this weekend so they can ensure they have
enough staff and resources to handle demand.
Q6. Case Study - ABC ltd. is a company who is a maker of boutique leather articles. It
has been in business for last 20 years. It has implemented a CRM system 5 years back
and has transferred all the sales and customer data since inception. As a big data
consultant, chart out the Data -> information lifecycle for the organization and suggest a
suitable advertisement mix based on suitable assumptions (stating them).
Q7. List the components of Hadoop, explain its use.
Answer:
HDFS Components:
1. NameNode
2. DataNode
NameNode
It is also known as Master node. NameNode does not store actual data or dataset.
NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which
DataNode the data is stored and other details. It consists of files and directories.
DataNode
It is also known as Slave. HDFS DataNode is responsible for storing actual data in
HDFS. DataNode performs read and write operation as per the request of the clients.
Replica block of DataNode consists of 2 files on the file system. The first file is for data
and second file is for recording the block’s metadata.
HDFS Metadata includes checksums for data. At start-up, each DataNode connects to
its corresponding NameNode and does handshaking. Verification of namespace ID and
software version of DataNode take place by handshaking. At the time of mismatch
found, DataNode goes down automatically.
Hadoop MapReduce is the core Hadoop ecosystem component which provides data
processing. MapReduce is a software framework for easily writing applications that
process the vast amount of structured and unstructured data stored in the Hadoop
Distributed File system.
MapReduce programs are parallel in nature, thus are very useful for performing large-
scale data analysis using multiple machines in the cluster. Thus, it improves the speed
and reliability of cluster this parallel processing.
Hadoop MapReduce
Working of MapReduce
Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two
phases:
• Map phase
• Reduce phase
Each phase has key-value pairs as input and output. In addition, programmer also
specifies two functions: map function and reduce function
Map function takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). Read Mapper in
detail.
Reduce function takes the output from the Map as an input and combines those data
tuples based on the key and accordingly modifies the value of the key. Read Reducer in
detail.
Features of MapReduce
• Simplicity – MapReduce jobs are easy to run. Applications can be written in any
language such as java, C++, and python.
Answer:
The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications. It employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance access to data
across highly scalable Hadoop clusters.
Answer:
Answer:
A subroutine is a section of code that can be re-used several times in the same
program. It is separate from the main code and has to be ‘called’ upon. In a game of
Mario you could imagine a subroutine as the part of the level that is reached by
travelling down a pipe. It is away from the main level / program and you once you have
gone through it you return to the program again (you can also re-visit it several times).
Subroutines are designed to be repeated and they have three key benefits:
There are two types of subroutines, procedures and functions. A procedure just
executes commands, such as printing something a certain number of times. A function
produces information by receiving data from the main program and returning a value
back to the main program. For example, a function could take the radius of a sphere
from the main program and then calculate a sphere’s area and return the value of the
area back to the main program. A function generally requires parameters to work –
these are the values to be transferred from the main program to the subroutine.
Sub-routines are of two types, one that returns a value = function and one which does
NOT return a value = procedure
Answer:
In a computer program there are often sections of the program that we want to re-use or
repeat. Chunks of instructions can be given a name - they are
called functions and procedures.
Algorithms can be broken down into procedures or functions. This saves time by only
having to execute (call) the function when it is required, instead of having to type out the
whole instruction set.
Programming languages have a set of pre-defined (also known as built-in) functions and
procedures. If the programmer makes their own ones, they are custom-made or user-
defined.
Procedures or functions?
In a program for drawing shapes, the program could ask the user what shape to draw.
The instructions for drawing a square could be captured in a procedure. The algorithm
for this action could be a set of tasks, such as these:
Repeat the next two steps four times: Draw a line of length n. Turn right by 90 degrees.
If this were a computer program, this set of instructions could be given the name
'square' and this sequence would be executed by running (calling) that procedure.
A function could calculate the VAT due on goods sold. The algorithm for this function
could be:
If this were a computer program, this set of instructions could be given the name
The function would then return the value as VAT which is then used elsewhere.
Q12. What are the various types of decision making models
in data analytics and how are they related to the MIS, DSS and Expert
Systems?
Answer:
There are 4 types of analytics. Here, we start with the simplest one and go down to
more sophisticated. As it happens, the more complex an analysis is, the more value it
brings.
Descriptive analytics
Let us also bring an example from our practice: a manufacturer was able to decide on
focus product categories based on the analysis of revenue, monthly revenue per
product group, income by product group, total quality of metal parts produced per
month.
Descriptive analytics juggles raw data from multiple data sources to give valuable
insights into the past. However, these findings simply signal that something is wrong or
right, without explaining why. For this reason, highly data-driven companies do not
content themselves with descriptive analytics only, and prefer combining it with other
types of data analytics.
Diagnostic analytics
At this stage, historical data can be measured against other data to answer the question
of why something happened. Thanks to diagnostic analytics, there is a possibility to drill
down, to find out dependencies and to identify patterns. Companies go for diagnostic
analytics, as it gives in-depth insights into a particular problem. At the same time, a
company should have detailed information at their disposal, otherwise data collection
may turn out to be individual for every issue and time-consuming.
Let’s take another look at the examples from different industries: a healthcare provider
compares patients’ response to a promotional campaign in different regions; a retailer
drills the sales down to subcategories. Another flashback to our BI projects: in the
healthcare industry, customer segmentation coupled with several filters applied (like
diagnoses and prescribed medications) allowed measuring the risk of hospitalization.
Predictive analytics
Predictive analytics tells what is likely to happen. It uses the findings of descriptive and
diagnostic analytics to detect tendencies, clusters and exceptions, and to predict future
trends, which makes it a valuable tool for forecasting. Despite numerous advantages
that predictive analytics brings, it is essential to understand that forecasting is just an
estimate, the accuracy of which highly depends on data quality and stability of the
situation, so it requires a careful treatment and continuous optimization.
Thanks to predictive analytics and the proactive approach it enables, a telecom
company, for instance, can identify the subscribers who are most likely to reduce their
spend, and trigger targeted marketing activities to remediate; a management team can
weigh the risks of investing in their company’s expansion based on cash flow analysis
and forecasting. One of our case studies describes how advanced data
analytics allowed a leading FMCG company to predict what they could expect after
changing brand positioning.
Prescriptive analytics
• DSS will often include modelling tools in them, where various alternative
scenarios can be modelled and compared.
• Typically will also support tactical level management, but sometimes are used at
other levels
• EIS support a range of decision making, but more often than not, this tends to be
unstructured
• EIS support the executive level of management, often used to formulate high level
strategic decisions impacting on the direction of the organization
• These systems will usually have the ability to extract summary data from internal
systems, along with external data that provides intelligence on the environment of the
organization
• Generally these systems work by providing a user friendly interface into other systems,
both internal and external to the organization
Related questions:
1. difference between dss, mis and expert systems.
2. What are the decision making models scenario modelling, Goal seek and Data Table.
Answer: MIS/Data Table Gives only the report and tells people who can provide with the
information
DSS is where data is presented and certain support for decision making is provided.
Expert System does calculation and returns a result.
Q13. Explain with a suitable example the various tasks for a business
analyst and the required skills for data analysis in a business
environment.
Answer:
Business analysts can hone their skills through executive education programs and
eventually earn a Certified Business Analysis Profession (CBAP) certification from the
International Institute of Business Analysis.
When writing your resume, list relevant skills. Don’t assume hiring supervisors know you
have what they want.
When you find a job that appeals to you, read the job description thoroughly and
research the company. That way, you will know what to highlight in your cover letter,
based on what the business values.
The interviewer will want you to elaborate on the skills you bring to the table, so choose
three or four that relate to the position itself and be ready to share a few stories which
showcase your qualifications. It also may help to review the skills listed by job and types
of skills.
Core Skills
A number of skills are beneficial for business analysts, but there are a handful of
abilities that are absolutely necessary.
Communicating
Business analysts spend a significant amount of time interacting with clients, users,
management, and developers. Therefore, being an effective communicator is key. You
will be expected to facilitate work meetings, ask the right questions, and actively listen
to your colleagues to take in new information and build relationships. A project's
success may revolve around your ability to communicate things like project
requirements, changes, and testing results. In your interview, focus on your ability to
communicate proficiently in person, on conference calls, in meetings both digitally and
otherwise, and through email. Consider having an example ready that demonstrates
how being an effective communicator has served former employers well.
Problem-Solving
Every project you work on is, at its core, developing a solution to a problem. Business
analysts work to build a shared understanding of problems, outline the parameters of
the project, and determine potential solutions.
Negotiating
Critical Thinking
Business analysts must assess multiple choices before leading the team toward a
solution. Effectively doing so requires a critical review of data, documentation, user
input surveys, and workflow. They ask probing questions until every issue is evaluated
in its entirety to determine the best conflict resolution.
General Skills
Besides the core skills, employers also will be looking for more general skills and
attributes:
Computer Skills: As a business analyst, you’ll need to be able to use many types of
software, from the popular Microsoft Office Suite to less common packages like
SharePoint, Visio, and Software Design Tools. You will need to stay abreast of new
developments in IT as well.
Analytical Skills: Of course, a business analyst needs analytical tools for the efficient
designing and implementing of processes to forecasting and gap analysis.
Q14. Explain with suitable example the concept of Internal and
External Data sources for performing data analysis in the business
environment.
Answer:
Internal data is information generated from within the business, covering areas such as
operations, maintenance, personnel, and finance. External data comes from the market,
including customers and competitors. It’s things like statistics from surveys,
questionnaires, research, and customer feedback.
Research has shown that business analysts consider data generated internally to be
more valuable. According to one survey, “About 65% of respondents rank internal data
as more important than data collected outside the company.”
Both kinds of data are helpful. Internal data helps you run your business and optimize
your operations. External data helps you better understand your customer base and the
competitive landscape. You need a clear view of both to have truly insightful business
intelligence.
Various types of data are very useful for business reports, and in business reports, you
will quickly come across things like revenue (money earned in a given period, usually a
year), turnover (people who left the organization in a given period), and many others.
There are a variety of data available when one is constructing a business report. We
may categorize data in the following manner:
Internal
Employee headcount
Employee demographics (e.g., sex, ethnicity, marital status)
Financials (e.g., revenue, profit, cost of goods sold, margin, operating ratio)
External
Internal and external business or organizational data come in two main categories:
qualitative and quantitative.
Qualitative data are data that are generally non-numeric and require context, time, or
variance to have meaning or utility.
Examples: taste, energy, sentiments, emotions
Quantitative data are data that are numeric and therefore largely easier to understand.
Both types of data are useful for business report writing. Usually a report will feature as
much “hard” quantitative data as possible, typically in the form of earnings or revenue,
headcount, and other numerical data available. Most organizations keep a variety of
internal quantitative data.
Qualitative data, such as stories, case studies, or narratives about processes or events,
are also very useful, and provide context. We may consider that a good report will have
both types of data, and a good report writer will use both types of data to build a picture
of information for their readers.
Q15. What is granularity (Explain it along the lines of roll-up and drill-
down) of data and how does it affect the data → Information cycle?
Answer:
When designing the data warehouse, one of the most basic concepts is that of storing
data at the lowest level of granularity. By storing data at the lowest level of granularity,
the data can be reshaped to meet different needs – of the finance department, of the
marketing department, of the sales department, and so forth. Granular data can be
summarized, aggregated, broken into many different subsets and so forth. There are
indeed many good reasons for storing data in the data warehouse at the lowest level of
granularity.
And why does data need to be broken into low levels of granularity? The answer is that
most data warehouse data comes from transactions. And typically, transactions contain
data that is very denormalized. Denormalized data is at a high level of granularity.
Let’s take a look at a typical transaction.
All of the data that has been brought to bear on the transaction is natural and normal.
Naturally enough, the data in the transaction focuses on the transaction itself. At the
same time, the data in the transaction is very denormalized.
1) Roll-up:
Roll-up is also known as "consolidation" or "aggregation." The Roll-up operation can be
performed in 2 ways
1. Reducing dimensions
2. Climbing up concept hierarchy. Concept hierarchy is a system of grouping things
based on their order or level.
Consider the following diagram
• In this example, cities New jersey and Lost Angles and rolled up into country
USA
• The sales figure of New Jersey and Los Angeles are 440 and 1560 respectively.
They become 2000 after roll-up
• In this aggregation process, data is location hierarchy moves up from city to the
country.
• In the roll-up process at least one or more dimensions need to be removed. In
this example, Quarter dimension is removed.
•
2) Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the rollup
process. It can be done via
• Moving down the concept hierarchy
• Increasing a dimension
Consider the diagram above
• Quarter Q1 is drilled down to months January, February, and March.
Corresponding sales are also registers.
• In this example, dimension months are added.
Granularity means the level of detail of your data within the data structure. In a
typical Data Warehouse one might find very detailed data (such as
seconds, single product, one specific attribute) and aggregated data (such
as total number of, monthly orders, all products).
The higher the granularity of a fact table the more data (or in an excel sheet: rows) you
will have. But the granularity of your data also determines what kind of information you
can get out of the stored data. So to aggregate data you need of course the same
granularity. (A weekly report can only be generated when you have time related data
stored. At least it should be a “week”, better it is to have “day”.)
Q16. Difference between transactional system and big data system.
Ts cater to B2C
BdS cater to B2B
Answer:
Hence, from the above description, we can see that Cloud enables “As-a-Service”
pattern by abstracting the challenges and complexity through a scalable and elastic self-
service application. Big data requirement is same where distributed processing of
massive data is abstracted from the end users.
Improved analysis
With the advancement of Cloud technology, big data analysis has become more
improved causing better results. Hence, companies prefer to perform big data analysis
in the Cloud. Moreover, Cloud helps to integrate data from numerous sources.
Simplified Infrastructure
Big Data analysis is a tremendous strenuous job on infrastructure as the data comes in
large volumes with varying speeds, and types which traditional infrastructures usually
cannot keep up with. As the Cloud computing provides flexible infrastructure, which we
can scale according to the needs at the time, it is easy to manage workloads.
Both Big data and Cloud technology delivers value to organizations by reducing the
ownership. The Pay-per-user model of Cloud turns CAPEX into OPEX. On the other
hand, Apache cut down the licensing cost of Big data which is supposed to be cost
millions to build and buy. Cloud enables customers for big data processing without
large-scale big data resources. Hence, both Big Data and Cloud technology are driving
the cost down for enterprise purposes and bringing value to the enterprise.
Data security and privacy are two major concerns when dealing with enterprise data.
Moreover, when your application is hosted on a Cloud platform due to its open
environment and limited user control security becomes a primary concern. On the other
hand, being an open source application, Big data solution like Hadoop uses a lot of
third-party services and infrastructure. Hence, nowadays system integrators bring in
Private Cloud Solution that is Elastic and Scalable. Furthermore, it also leverages
Scalable Distributed Processing.
Besides that Cloud data is stored and processed in a central location commonly known
as Cloud storage server. Along with it the service provider and the customer signs a
service level agreement (SLA) to gain the trust between them. If require the provider
also leverages required advanced level of security control.
Q18. What are IOT Devices and how they are related to Big Data and
Cloud Technologies?
Answer:
In order to understand the relationship between big data, IoT and cloud computing, we
might need to rearrange the order. The interconnection that would then be established
would paint the bigger picture for you to understand.
First off, IoT is an ecosystem of devices, which are interconnected. Basically, it is a net
of devices, consisting of specific IP addresses; and are capable of generation,
transmission and reception of data, without human intervention. IoT is thus the
abbreviated version of ‘Internet of Things’. It would make one wonder, “where does all
this data get processed then?”
This is where big data steps in. Big data is the term coined for data sets so humungous,
that trillion units of data generated by IoTs can be processed. As opposed to the
common misconception, big data is not some sort of a database, but is a software
ecosystem. This would then lead one to the next question, “what about the
infrastructure and the expenses involved in setting up such massive machines of data
processing?”
The solution to that is cloud computing. With cloud computing, you are just a click away
from accessing your data, from anywhere in the world, within a second or even less.
This not only saves up the space for infrastructure, but also cuts down on the expenses
behind maintaining them.
And this is how IoT, big data and cloud computing are connected.