Big Data and Its Impact On Data Warehousing
Big Data and Its Impact On Data Warehousing
on Data Warehousing
The “big data” movement has taken the informa-
tion technology world by storm. Fueled by open
source projects emanating from the Apache
Foundation, the big data movement offers a
cost-effective way for organizations to process
and store large volumes of any type of data:
structured, semi-structured and unstructured.
by wayne eckerson
Despite Problems,
Big Data Makes it Huge
Despite
T
Problems,
Big Data
Makes it Huge
Two Markets
for Big Data:
Comparing Value
Propositions
he hype and reality gInsights to model larger volumes of
of the big data move- weather data so it can pinpoint the
ment is reaching a optimal placement of wind turbines.
Categorizing
crescendo. It’s clear And a financial services customer uses
Big Data that Hadoop and Hadoop to improve the accuracy of
Processing
Systems
NoSQL technologies are gaining a its fraud models by addressing much
foothold in corporate computing envi- larger volumes of transaction data.
ronments. But big data software and
The New
computing paradigms are still in their
Analytical infancy and must clear many hurdles Big Data Drivers
Ecosystem:
Making Way
before organizations trust them to Hadoop clearly fills an unmet need in
for Big Data handle serious data and application many organizations. Given its open
workloads. source roots, Hadoop provides a more
Most leading big data vendors now cost-effective way to analyze large
count hundreds of customers. Big data volumes of data compared with tradi-
is no longer the province of Internet tional relational database management
and media companies with large Web systems (RDBMSes). It’s also better
properties; companies in nearly every suited to processing unstructured data,
industry are jumping on the big data such as audio, video or images, and
bandwagon. These include energy, semi-structured data, such as Web
pharmaceuticals, utilities, telecommu- server log data for tracking customer
nications, insurance, retail, financial behavior on social media sites. For
services and government. years, leading-edge companies have
For example, Vestas Wind Systems, struggled in vain to figure out an opti-
a leading wind turbine maker, uses Bi- mal way to analyze this type of data in
panies are hiring several people with driver, and a new JavaScript frame-
related skills to cobble together one work for MapReduce to the Apache
complete “data scientist.” Foundation.
cations run on Hadoop, including tradi- reporting tools if your power users can
tional products centering on business exploit Hadoop using freely available
intelligence, extract, transform and programs such as Java, Python, Pig,
load (ETL) and DBMSes. And com- Hive or Hbase?
mercial vendors benefit if their exist-
ing tools have a new source of data to
connect to and plumb. It’s a big new The Future is Cloudy
Despite
market whose sweet-tasting honey at- Right now, it’s too early to divine the
Problems, tracts a hive full of bees. future of the big data movement and
Big Data
Makes it Huge
predict winners and losers. It’s possible
that in the future all data management
Why invest in proprietary tools? and analysis will run entirely on open
Two Markets
But customers are already asking source platforms and tools. But it’s just
for Big Data: whether data warehouses and BI tools as likely that commercial vendors will
Comparing Value
Propositions
will eventually be folded into Hadoop co-opt (or outright buy) open source
environments or the reverse. Why products and functionality and use
spend millions of dollars on a new them as pipelines to magnify sales of
Categorizing
analytical RDBMS if you can do that their commercial products.
Big Data processing without paying a dime in More than likely, we’ll get a mélange
Processing
Systems
license costs using Hadoop? Why of open source and commercial ca-
spend hundreds of thousands of dol- pabilities. After all, 30 years after the
lars on data integration tools if your mainframe revolution, mainframes are
The New
data scientists can turn Hadoop into a still a mainstay at many corporations.
Analytical huge data staging and transformation In IT, nothing ever dies; it just finds its
Ecosystem:
Making Way
layer? Why invest in traditional BI and niche in an evolutionary ecosystem. p
for Big Data
T
Problems,
Big Data
Makes it Huge
Two Markets
for Big Data:
Comparing Value
here are two types of in parallel across a grid of commodity
Propositions
big data in the mar- servers. Hadoop emanated from large
ket today. There is Internet providers, such as Google and
Categorizing
open source software, Yahoo, which needed a cost-effective
Big Data centered largely on way to build search indexes. They knew
Processing
Systems
Hadoop, which eliminates up-front li- that traditional relational databases
censing costs for managing and pro- would be prohibitively expensive and
cessing large volumes of data. And technically unwieldy, so they came up
The New
then there are new analytical en- with a low-cost alternative they built
Analytical gines, including appliances and col- themselves and eventually gave to the
Ecosystem:
Making Way
umn stores, which provide significantly Apache Software Foundation so others
for Big Data higher price-performance than gen- could benefit from their innovations.
eral-purpose relational databases. Both Today many companies are im-
sets of big data software deliver higher plementing Hadoop software from
return on investment than previous Apache as well as third-party pro-
generations of data management tech- viders, such as IBM, Cloudera, Hor-
nology, but in vastly different ways. tonworks and EMC. Developers see
Hadoop as a cost-effective way to get
their arms around large volumes of
Hadoop data that they›ve never been able to do
Free software. Hadoop is an open much with before.
source distributed file system avail- Many companies use Hadoop to
able through the Apache Software store, process and analyze large vol-
Foundation that is capable of storing umes of Web server log data so they
and processing large volumes of data can get a better feel for the browsing
and shopping behavior of their on- They can also let power users query
line customers. Before, companies Hadoop data directly if they want to
outsourced the analysis of their click- access the raw data or can’t wait for
stream data or simply let it fall on the the aggregates to be loaded into the
floor since they didn›t have a way to data warehouse.
process it in a timely and cost-effec-
tive way. Companies are also turning Hidden costs. Of course, nothing in
Despite
to Hadoop to process more structured technology is ever free. When it comes
Problems, data to improve analytical models. to processing data, you either pay the
Big Data
Makes it Huge
piper up front, as in the data ware-
Data agnostic. Besides being free to housing world, or at query time, as in
implement, the other major advantage the Hadoop world. Before querying
Two Markets
of big data software is that it is data ag- Hadoop data, a developer needs to un-
for Big Data: nostic. It can handle any type of data. derstand the structure of the data and
Comparing Value
Propositions
Unlike a data warehouse or traditional all of its anomalies. With a clean, well-
relational database, Hadoop doesn’t re- understood, homogenous data set,
quire administrators to model or trans- this is not difficult. But most corporate
Categorizing
form data before they load it. With data doesn’t fit that description. So a
Big Data Hadoop, you don’t define a structure Hadoop developer ends up playing the
Processing
Systems
for the data; you simply load and go. role of a data warehousing developer
This significantly reduces the cost of at query time, interrogating the data
preparing data for analysis compared and making sure it’s format and con-
The New
with what happens in a data ware- tent match their expectations. Query-
Analytical house. Most experts assert that 60% ing Hadoop today is a “buyer beware”
Ecosystem:
Making Way
to 80% of the cost of building a data environment.
for Big Data warehouse, which can run into the tens Moreover, to run big data software,
of millions of dollars, involves extract- you still need to purchase, install and
ing, transforming and loading data. Ha- manage commodity servers (unless
doop virtually eliminates this cost. you run your big data environment in
As a result, many companies are us- the cloud, say through Amazon Web
ing Hadoop as a general-purpose stag- Services). While each server may not
ing area and archive for all their data. cost a lot, the price adds up.
So a telecommunications company can But what’s more costly is the ex-
store 12 months of call detail records pertise and software required to ad-
instead of aggregating that data in the minister Hadoop and manage grids
data warehouse and rolling the details of commodity servers. Hadoop is still
to offline storage. With Hadoop, they bleeding-edge technology and few
can keep all their data online and elimi- people have the skills or experience
nate the cost of data archival systems. to run it efficiently in a production en-
vironment. These folks are hard to a popular analytical appliance, and was
find, and they don’t come cheap. The soon followed by dozens of startups.
Apache Software Foundation admits Recognizing the opportunity, all the big
that Hadoop’s latest release is equiva- names in software and hardware—Or-
lent to version 1.0 software. So even acle, IBM, Hewlett-Packard, and SAP—
the experts have a lot to learn, since subsequently jumped into the market,
the technology is evolving at a rapid either by building or buying technology,
Despite
pace. But nonetheless, Hadoop and its to provide purpose-built analytical sys-
Problems, NoSQL brethren have opened up a vast tems to new and existing customers.
Big Data
Makes it Huge
new frontier for organizations to profit Although the pricetag of these sys-
from their data. tems often exceeds $1 million, cus-
tomers find that the exceptional
Two Markets
price-performance delivers significant
for Big Data: Analytical Platforms business value, in both tangible and in-
Comparing Value
Propositions
The other type of big data predates tangible form. For example, Virginia-
Hadoop and NoSQL variants by sev- based XO Communications recovered
eral years. This version of big data is $3 million in lost revenue from a new
Categorizing
less a “movement” than an extension revenue assurance application it built
Big Data of existing relational database tech- on an analytical appliance, even before
Processing
Systems
nology optimized for query process- it had paid for the system. It subse-
ing. These analytical platforms span a quently built or migrated a dozen appli-
range of technology, from appliances cations to run on the new purpose-built
The New
and columnar databases to shared system, testifying to its value.
Analytical nothing, massively parallel process- Kelley Blue Book in Irvine, Calif., pur-
Ecosystem:
Making Way
ing databases. The common thread chased an analytical appliance to run
for Big Data among them is that most are read-only its data warehouse, which was ex-
environments that deliver exceptional periencing performance issues, giv-
price-performance compared with ing the provider of online automobile
general-purpose relational databases valuations a competitive edge. For in-
originally designed to run transaction stance, the new system reduces the
processing applications. time needed to process hundreds of
Teradata laid the groundwork for millions of automobile valuations from
the analytical platform market when it one week to one day. Kelley Blue Book
launched the first analytical appliance now uses the system to analyze its
in the early 1980s. Sybase was also Web advertising business and deliver
an early forerunner, shipping the first dynamic pricing for its Web ads.
columnar database in the mid-1990s.
IBM Netezza kicked the current market Challenges. Given the up-front
into high gear in 2003 when it unveiled costs of analytical platforms, organi-
zations usually undertake a thorough you rationalize having two data ware-
evaluation of these systems before housing environments instead of one?
jumping aboard. Today we find that companies which
First, a company must determine have tapped out their SQL Server or
whether an analytical platform out- MySQL data warehouses often replace
performs its existing data warehouse them with analytical platforms to get
database to a degree that it warrants better performance. But companies
Despite
migration and retraining costs. This re- that have implemented an enterprise
Problems, quires a proof of concept in which the data warehouse on Oracle, Teradata
Big Data
Makes it Huge
customer tests the systems in its own or IBM often find that analytical plat-
data center using its own data across a forms are best used when they sit
range of queries. alongside the data warehouse so they
Two Markets
The good news is that the new ana- can handle new applications or exist-
for Big Data: lytical platforms usually deliver jaw- ing analytical workloads offloaded to
Comparing Value
Propositions
dropping performance for most queries them. This architecture helps organiza-
tested. In fact, many customers don’t tions avoid a costly upgrade to a data
believe the initial results and rerun the warehousing platform, which might
Categorizing
queries to make sure that the results easily exceed the cost of purchasing an
Big Data are valid. analytical platform.
Processing
Systems
Second, companies must choose The big data movement consists of
from more than two dozen analytical two separate but interrelated markets:
platforms on the market today. For in- one for Hadoop and open source data
The New
stance, they must decide whether to management software and the other
Analytical purchase an appliance or a software- for purpose-built SQL databases op-
Ecosystem:
Making Way
only system, a columnar database or an timized for query processing. Hadoop
for Big Data MPP database, or an on-premises sys- avoids most of the up-front licensing
tem or a Web service. Evaluating these and loading costs endemic to tradi-
options takes time, and many compa- tional relational database systems. But
nies create a short list that doesn’t al- since the technology is still immature,
ways contain comparable products. there are hidden costs that have thus
Finally, companies must decide far kept many Hadoop implementa-
what role an analytical platform will tions experimental in nature. On the
play in their data warehousing archi- other hand, analytical platforms are a
tectures. Should it serve as the data more proven technology but impose
warehousing platform? If so, does it significant up-front licensing fees and
handle multiple workloads easily or is potential migration costs. Companies
it a one-trick pony? If the latter, what wading into the waters of the big data
applications and data sets make sense stream need to carefully evaluate their
to offload to the new system? How do options. p
footer 9
chapter 3
f
Problems,
Big Data
Makes it Huge
Two Markets
for Big Data:
Comparing Value
aced with an expanding four or five alternatives for each plat-
Propositions
analytical ecosystem, form, the BI manager is faced with doz-
BI managers need to ens of viable options in each category.
Categorizing
make many technology The once lazy database market is now
Big Data choices. Perhaps the a beehive of activity.
Processing
Systems
most difficult involves selecting a data Staying abreast of all the new prod-
processing system to power a variety ucts, partnerships and technological
of analytical applications (see Chap- advances is now a full-time job. Indus-
The New
ter 4, “The New Analytical Ecosystem: try analysts who make a living sifting
Analytical Making Way for Big Data”). through products in emerging markets
Ecosystem:
Making Way
In the past, these types of deci- are needed now more than ever. Most
for Big Data sions revolved around selecting one analysts will tell you that the first step
of a handful of leading relational in selecting an analytical platform is
database management systems to to understand the broad categories of
power a data warehouse or data products in the marketplace, and then
mart. Often, the choice boiled down make finer distinctions from there (see
to internal politics as much as tech- figure 1, page 11).
nical functionality. At a high-level, there are four cate-
Today the options aren’t as straight- gories of analytical processing systems
forward, although politics may still that are available today: transactional
play a role. Instead of selecting a single RDBMSes. The following describes
data management product, BI manag- those categories and can be used as a
ers may need to select multiple plat- starting point when creating a short list
forms to outfit an expanding analytical of products during a product evalua-
ecosystem. And rather than evaluating tion process:
figure 1.
Database/Platform Positioning
Database/Platform Positioning
The New
Analytical
Ecosystem:
Making Way
for Big Data 1. Transactional RDBM Systems those from IBM, Oracle and Syb-
Transactional RDBMSes were origi- ase, are best suited as data ware-
nally designed to support trans- housing hubs that feed a variety of
action processing applications, downstream, end-user-facing sys-
although most have been retrofitted tems, but don’t handle query traf-
with various types of indexes, join fic directly. Although retrofitted
paths and custom SQL bolt-ons to with analytical capabilities, these
make them more palatable to analyt- systems often hit performance
ical processing. There are two types and scalability walls when used for
of transactional RDBMSes: enter- query processing along with other
prise and departmental. workloads and are expensive to
upgrade and replace. Thus, many
11Enterprise hubs. The traditional customers now use these “gray-
enterprise RDBMSes, such as bearded” data warehousing sys-
11to SAP, which is betting its busi- components to turn Hadoop into an
ness on HANA, an in-memory enterprise-caliber, data processing
database for transactional and environment. The collection of these
analytical processing, and is evan- components is called a Hadoop dis-
gelizing the need for in-memory tribution. Leading providers of Ha-
systems. Another contender in this doop distributions include Cloudera,
space is Kognitio. Many RDBM- IBM, EMC, Amazon, Hortonworks
Despite
Ses are better exploiting memory and MapR.
Problems, for caching results and processing Today, in most customer instal-
Big Data
Makes it Huge
queries. lations, Hadoop serves as a staging
area and online archive for unstruc-
11Columnar. Columnar databases, tured and semi-structured data, as
Two Markets
such as SAP’s Sybase IQ, Hewlett- well as an analytical sandbox for
for Big Data: Packard’s Vertica, ParAccel, Info- data scientists who query Hadoop
Comparing Value
Propositions
bright, Exasol, Calpont and Sand files directly before the data is ag-
offer fast performance for many gregated or loaded into the data
types of queries because of the warehouse. But this could change.
Categorizing
way these systems store and com- Hadoop will play an increasingly im-
Big Data press data—by columns instead portant role in the analytical eco-
Processing
Systems
of rows. Column storage and pro- system at most companies, either
cessing is fast becoming a RDBMS working in concert with an enter-
feature rather than a distinct sub- prise DW or assuming most duties
The New
category of products. of one.
Analytical
Ecosystem:
Making Way
3. Hadoop Distributions 4. NoSQL Databases
for Big Data Hadoop is an open source software NoSQL—shorthand for “not only
project run within The Apache Soft- SQL”—is the name given to a broad
ware Foundation for processing set of databases whose only common
data-intensive applications in a dis- thread is that they don’t require SQL
tributed environment with built-in to process data, although some sup-
parallelism and failover. The most port both SQL and non-SQL forms
important parts of Hadoop are the of data processing. There are many
Hadoop Distributed File System types of NoSQL databases, and the
(HDFS), which stores data in files on list grows longer every month. These
a cluster of servers, and MapReduce, specialized systems are built using
a programming framework for build- either proprietary and open source
ing parallel applications that run on components or a mix of both. In most
HDFS. The open source commu- cases, they are designed to overcome
nity is building numerous additional the limitations of traditional RDBMes
T
Problems,
Big Data
Makes it Huge
Two Markets
for Big Data:
Comparing Value
he big data revolution formance analytical engines—ana-
Propositions
has arrived, and it’s lytical appliances, MPP databases,
transforming long-es- in-memory databases—and interac-
Categorizing
tablished data ware- tive, in-memory visualization tools.
Big Data housing architectures Most source data now flows through
Processing
Systems
into vibrant, multifaceted analytical Hadoop, which primarily acts as a
ecosystems. staging area and online archive. This
Gone are the days when all analytical is especially true for semi-structured
The New
processing first passes through a data data, such as log files and machine-
Analytical warehouse or data mart (or their less generated data, but also for some
Ecosystem:
Making Way
sanctified spreadmart or data shadow structured data that companies can’t
for Big Data system brethren). Now data winds its cost-effectively store and process in
way to users through a plethora of cor- SQL engines (e.g., call detail records in
porate data structures, each tailored to a telecommunications company). From
the type of content it contains and the Hadoop, data is fed into a data ware-
type of user who wants to consume it. housing hub, which often distributes
Figure 1 depicts a reference architec- data to downstream systems, such as
ture for the new analytical ecosystem data marts, operational data stores
that has the fingerprints of big data all and analytical sandboxes of various
over it. The objects in blue represent types, where users can query the data
the traditional data warehousing envi- using familiar SQL-based reporting and
ronment, while those in pink represent analysis tools.
new architectural elements made pos- Today data scientists analyze raw
sible by big data technologies; namely data inside Hadoop by writing MapRe-
Hadoop, NoSQL databases, high-per- duce programs in Java and other lan-
figure 1.
BIArchitecture2020
The New Analytical Ecosystem
Operationalsystems
O ti l t
(Structureddata)
The New
Analytical
Ecosystem:
Making Way
for Big Data guages. In the future, users will be able up data flows that meet all business re-
to query and process Hadoop data us- quirements for reporting and analysis.
ing SQL-based data integration and
query tools. The top-down world. Here source
data is processed, refined and stamped
with a predefined data structure—typi-
Harmonizing Opposites cally a dimensional model—and then
The big data revolution is not only consumed by casual users using SQL-
about analyzing large volumes and based reporting and analysis tools.
new sources of data, it’s also about In this domain, IT developers create
balancing data alignment and consis- data and semantic models so busi-
tency with flexible, ad hoc exploration. ness users can get answers to known
As such, the new analytical ecosystem questions and executives can track
features both top-down and bottom- performance of predefined metrics.
Design precedes access. The top-down they’ll need to answer those questions.
world also takes great pains to align Often, the data they need doesn’t yet
data along conformed dimensions and exist in the data warehouse.
deliver clean, accurate data. The goal is The new analytical ecosystem sup-
to deliver a consistent view of the busi- ports three analytical sandboxes that
ness entities so users can spend their enable power users to explore corpo-
time making decisions instead of ar- rate and local data on their own terms:
Despite
guing about the origins and validity of (1) Hadoop, (2) virtual partitions in-
Problems, data artifacts. side a data warehouse and (3) special-
Big Data
Makes it Huge
ized analytical databases that offload
The underworld. Creating a uniform data or analytical processing from the
view of the business from heteroge-
neous sets of data is not easy. It takes
Combining top-down
Two Markets
for Big Data: time, money and patience, often more
and bottom-up worlds is
Comparing Value
Propositions
than most departmental heads and
business analysts are willing to toler-
ate. They often abandon the top-down not easy. BI professionals
Categorizing
world for the underworld of spread- need to assiduously guard
marts and data shadow systems. Us-
data semantics while
Big Data
Processing
Systems
ing whatever tools are readily available
and cheap, these data-hungry users opening access to data.
create their own views of the busi-
The New
ness. Eventually, they spend more time
Analytical collecting and integrating data than data warehouse or handle new un-
Ecosystem:
Making Way
analyzing it, undermining their produc- tapped sources of data, such as Web
for Big Data tivity and a consistent view of business server logs or machine data. The new
information. environment also gives department
heads the ability to create and use
The bottom-up world. The new ana- dashboards built with in-memory vi-
lytical ecosystem brings these prodi- sualization tools that point both to a
gal data users back into the fold. It corporate data warehouse and other
carves out space within the enterprise independent sources.
environment for true ad hoc explora- Combining top-down and bottom-
tion and promotes the rapid develop- up worlds is not easy. BI professionals
ment of analytical applications using need to assiduously guard data se-
in-memory departmental tools. In a mantics while opening access to data.
bottom-up environment, users can’t For their part, business users need to
anticipate the questions they will ask commit to adhering to corporate data
on a daily or weekly basis or the data standards in exchange for getting the