0% found this document useful (0 votes)
49 views18 pages

Big Data and Its Impact On Data Warehousing

Uploaded by

Beatriz Lezcano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views18 pages

Big Data and Its Impact On Data Warehousing

Uploaded by

Beatriz Lezcano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Big Data and its Impact

on Data Warehousing
The “big data” movement has taken the informa-
tion technology world by storm. Fueled by open
source projects emanating from the Apache
Foundation, the big data movement offers a
cost-effective way for organizations to process
and store large volumes of any type of data:
structured, semi-structured and unstructured.
by wayne eckerson

1 Despite Problems, Big Data Makes it Huge

Two Markets for Big Data:


2 Comparing Value Propositions

3 Categorizing Big Data Processing Systems

The New Analytical Ecosystem:


4 Making Way for Big Data
chapter 1

Despite Problems,
Big Data Makes it Huge
Despite

T
Problems,
Big Data
Makes it Huge

Two Markets
for Big Data:
Comparing Value
Propositions
he hype and reality gInsights to model larger volumes of
of the big data move- weather data so it can pinpoint the
ment is reaching a optimal placement of wind turbines.
Categorizing
crescendo. It’s clear And a financial services customer uses
Big Data that Hadoop and Hadoop to improve the accuracy of
Processing
Systems
NoSQL technologies are gaining a its fraud models by addressing much
foothold in corporate computing envi- larger volumes of transaction data.
ronments. But big data software and
The New
computing paradigms are still in their
Analytical infancy and must clear many hurdles Big Data Drivers
Ecosystem:
Making Way
before organizations trust them to Hadoop clearly fills an unmet need in
for Big Data handle serious data and application many organizations. Given its open
workloads. source roots, Hadoop provides a more
Most leading big data vendors now cost-effective way to analyze large
count hundreds of customers. Big data volumes of data compared with tradi-
is no longer the province of Internet tional relational database management
and media companies with large Web systems (RDBMSes). It’s also better
properties; companies in nearly every suited to processing unstructured data,
industry are jumping on the big data such as audio, video or images, and
bandwagon. These include energy, semi-structured data, such as Web
pharmaceuticals, utilities, telecommu- server log data for tracking customer
nications, insurance, retail, financial behavior on social media sites. For
services and government. years, leading-edge companies have
For example, Vestas Wind Systems, struggled in vain to figure out an opti-
a leading wind turbine maker, uses Bi- mal way to analyze this type of data in

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  2


chapter 1:  Despite Problems, Big Data Makes it Huge

traditional data warehousing environ- To run a Hadoop environment, you


ments, but without much luck. need to get software from a mishmash
Finally, Hadoop is a load-and-go en- of Apache projects, with razzle-dazzle
vironment: Administrators can dump names like Flume, Sqoop, Ooze, Pig,
the data into Hadoop without having Hive and ZooKeeper. These indepen-
to convert it into a particular struc- dent projects often contain compet-
ture. Then users (or data scientists) ing functionality, have separate release
Despite
can analyze the data using whatever
Problems, tools they want, which today are typi-
Big Data
Makes it Huge
cally languages, such as Java, Python To run a Hadoop envi-
or Ruby. This type of data management
paradigm appeals to application de-
ronment, you need to
Two Markets
velopers and analysts, who often feel get software from a
for Big Data:
Comparing Value
straitjacketed by top-down, IT-driven mishmash of Apache
architectures and SQL-based tool sets.
Propositions
projects, with razzle-
dazzle names like Flume,
Sqoop, Ooze, Pig,
Categorizing
Speed Bumps
Big Data But Hadoop is not a data management
Processing
Systems
panacea. It’s clearly at or near the apo- Hive and ZooKeeper.
gee of its hype cycle right now, and
its many warts will disillusion all but
The New
bleeding- and leading-edge adopters. schedules and aren’t always tightly
Analytical For starters, Hadoop is still wet be- integrated. And each project evolves
Ecosystem:
Making Way
hind the ears. The Apache Software rapidly. That’s why there is a healthy
for Big Data Foundation just released the equiva- market for Hadoop distributions that
lent of version 1.0. So there are plenty package these components into a rea-
of basic things missing from the en- sonable set of implementable software.
vironment—like security, a metadata But the biggest complaint among big
catalog, data quality, backups and data advocates is the current lack of
monitoring and control. Moreover, it’s data scientists to build Hadoop appli-
a batch processing environment, not cations. These wunderkinds combine
terribly efficient in the way it exploits a a rare set of skills: statistics and math,
clustered environment. Hadoop knock- data, process and domain knowledge
offs, like MapR, which embed propri- and computer programming. Unfortu-
etary technology underneath Hadoop nately, developers have little data and
application programming interfaces domain experience and data experts
claim up to five-fold faster perfor- don’t know how to program. So there is
mance on half as many nodes. a severe shortage of talent. Some com-

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  3


chapter 1:  Despite Problems, Big Data Makes it Huge

panies are hiring several people with driver, and a new JavaScript frame-
related skills to cobble together one work for MapReduce to the Apache
complete “data scientist.” Foundation.

Evolution Established software


One good thing about the big data vendors stand to lose
movement is that it evolves fast. There
Despite
Problems, are Apache projects to address most significant revenue if
Big Data
Makes it Huge
of the shortcomings of Hadoop. One Hadoop evolves without
promising project is Hive, which pro-
vides SQL-like access to Hadoop, al-
them and gains robust
Two Markets
though it’s stuck in a batch processing data management and
for Big Data:
Comparing Value
paradigm. Another is HBase, which
overcomes Hadoop’s latency issues,
analytical functionality
Propositions
but is designed for fast row-based that cannibalizes their
reads and writes to support high-per- existing products.
Categorizing
formance transactional applications.
Big Data Both create table-like structures on top
Processing
Systems
of Hadoop files. Cooperation or Competition?
Many commercial vendors have Although vendors are quick to rally
jumped into the fray, marrying pro- behind big data, there is some mea-
The New
prietary technology with open source sure of desperation in the move. Estab-
Analytical software to turn Hadoop into a more lished software vendors stand to lose
Ecosystem:
Making Way
corporate-friendly compute environ- significant revenue if Hadoop evolves
for Big Data ment. Vendors such as Zettaset, EMC without them and gains robust data
Greenplum and Oracle have launched management and analytical function-
appliances that embed Hadoop with ality that cannibalizes their existing
commercial software to offer custom- products. They either need to generate
ers the best of both worlds. Many BI sufficient revenue from new big data
and data integration vendors, such as products or circumscribe Hadoop so
Talend, now connect to Hadoop and that it plays a subservient role to their
can move data back and forth seam- existing products. Most vendors are
lessly. Some even create and run Ma- hedging their bets and playing both op-
pReduce jobs in Hadoop using their tions, especially database vendors who
standard visual development environ- perhaps have the most to lose.
ments. Even Microsoft has jumped Both sides are playing nice and are
into the fray, offering its Hadoop port eager to partner and work together.
of Windows Server, an ODBC-to-Hive Hadoop vendors benefit as more appli-

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  4


chapter 1:  Despite Problems, Big Data Makes it Huge

cations run on Hadoop, including tradi- reporting tools if your power users can
tional products centering on business exploit Hadoop using freely available
intelligence, extract, transform and programs such as Java, Python, Pig,
load (ETL) and DBMSes. And com- Hive or Hbase?
mercial vendors benefit if their exist-
ing tools have a new source of data to
connect to and plumb. It’s a big new The Future is Cloudy
Despite
market whose sweet-tasting honey at- Right now, it’s too early to divine the
Problems, tracts a hive full of bees. future of the big data movement and
Big Data
Makes it Huge
predict winners and losers. It’s possible
that in the future all data management
Why invest in proprietary tools? and analysis will run entirely on open
Two Markets
But customers are already asking source platforms and tools. But it’s just
for Big Data: whether data warehouses and BI tools as likely that commercial vendors will
Comparing Value
Propositions
will eventually be folded into Hadoop co-opt (or outright buy) open source
environments or the reverse. Why products and functionality and use
spend millions of dollars on a new them as pipelines to magnify sales of
Categorizing
analytical RDBMS if you can do that their commercial products.
Big Data processing without paying a dime in More than likely, we’ll get a mélange
Processing
Systems
license costs using Hadoop? Why of open source and commercial ca-
spend hundreds of thousands of dol- pabilities. After all, 30 years after the
lars on data integration tools if your mainframe revolution, mainframes are
The New
data scientists can turn Hadoop into a still a mainstay at many corporations.
Analytical huge data staging and transformation In IT, nothing ever dies; it just finds its
Ecosystem:
Making Way
layer? Why invest in traditional BI and niche in an evolutionary ecosystem. p
for Big Data

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  5


chapter 2

Two Markets for Big Data:


Comparing Value Propositions
Despite

T
Problems,
Big Data
Makes it Huge

Two Markets
for Big Data:
Comparing Value
here are two types of in parallel across a grid of commodity
Propositions
big data in the mar- servers. Hadoop emanated from large
ket today. There is Internet providers, such as Google and
Categorizing
open source software, Yahoo, which needed a cost-effective
Big Data centered largely on way to build search indexes. They knew
Processing
Systems
Hadoop, which eliminates up-front li- that traditional relational databases
censing costs for managing and pro- would be prohibitively expensive and
cessing large volumes of data. And technically unwieldy, so they came up
The New
then there are new analytical en- with a low-cost alternative they built
Analytical gines, including appliances and col- themselves and eventually gave to the
Ecosystem:
Making Way
umn stores, which provide significantly Apache Software Foundation so others
for Big Data higher price-performance than gen- could benefit from their innovations.
eral-purpose relational databases. Both Today many companies are im-
sets of big data software deliver higher plementing Hadoop software from
return on investment than previous Apache as well as third-party pro-
generations of data management tech- viders, such as IBM, Cloudera, Hor-
nology, but in vastly different ways. tonworks and EMC. Developers see
Hadoop as a cost-effective way to get
their arms around large volumes of
Hadoop data that they›ve never been able to do
Free software. Hadoop is an open much with before.
source distributed file system avail- Many companies use Hadoop to
able through the Apache Software store, process and analyze large vol-
Foundation that is capable of storing umes of Web server log data so they
and processing large volumes of data can get a better feel for the browsing

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  6


chapter 2:  Two Markets for Big Data: Comparing Value Propositions

and shopping behavior of their on- They can also let power users query
line customers. Before, companies Hadoop data directly if they want to
outsourced the analysis of their click- access the raw data or can’t wait for
stream data or simply let it fall on the the aggregates to be loaded into the
floor since they didn›t have a way to data warehouse.
process it in a timely and cost-effec-
tive way. Companies are also turning Hidden costs. Of course, nothing in
Despite
to Hadoop to process more structured technology is ever free. When it comes
Problems, data to improve analytical models. to processing data, you either pay the
Big Data
Makes it Huge
piper up front, as in the data ware-
Data agnostic. Besides being free to housing world, or at query time, as in
implement, the other major advantage the Hadoop world. Before querying
Two Markets
of big data software is that it is data ag- Hadoop data, a developer needs to un-
for Big Data: nostic. It can handle any type of data. derstand the structure of the data and
Comparing Value
Propositions
Unlike a data warehouse or traditional all of its anomalies. With a clean, well-
relational database, Hadoop doesn’t re- understood, homogenous data set,
quire administrators to model or trans- this is not difficult. But most corporate
Categorizing
form data before they load it. With data doesn’t fit that description. So a
Big Data Hadoop, you don’t define a structure Hadoop developer ends up playing the
Processing
Systems
for the data; you simply load and go. role of a data warehousing developer
This significantly reduces the cost of at query time, interrogating the data
preparing data for analysis compared and making sure it’s format and con-
The New
with what happens in a data ware- tent match their expectations. Query-
Analytical house. Most experts assert that 60% ing Hadoop today is a “buyer beware”
Ecosystem:
Making Way
to 80% of the cost of building a data environment.
for Big Data warehouse, which can run into the tens Moreover, to run big data software,
of millions of dollars, involves extract- you still need to purchase, install and
ing, transforming and loading data. Ha- manage commodity servers (unless
doop virtually eliminates this cost. you run your big data environment in
As a result, many companies are us- the cloud, say through Amazon Web
ing Hadoop as a general-purpose stag- Services). While each server may not
ing area and archive for all their data. cost a lot, the price adds up.
So a telecommunications company can But what’s more costly is the ex-
store 12 months of call detail records pertise and software required to ad-
instead of aggregating that data in the minister Hadoop and manage grids
data warehouse and rolling the details of commodity servers. Hadoop is still
to offline storage. With Hadoop, they bleeding-edge technology and few
can keep all their data online and elimi- people have the skills or experience
nate the cost of data archival systems. to run it efficiently in a production en-

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  7


chapter 2:  Two Markets for Big Data: Comparing Value Propositions

vironment. These folks are hard to a popular analytical appliance, and was
find, and they don’t come cheap. The soon followed by dozens of startups.
Apache Software Foundation admits Recognizing the opportunity, all the big
that Hadoop’s latest release is equiva- names in software and hardware—Or-
lent to version 1.0 software. So even acle, IBM, Hewlett-Packard, and SAP—
the experts have a lot to learn, since subsequently jumped into the market,
the technology is evolving at a rapid either by building or buying technology,
Despite
pace. But nonetheless, Hadoop and its to provide purpose-built analytical sys-
Problems, NoSQL brethren have opened up a vast tems to new and existing customers.
Big Data
Makes it Huge
new frontier for organizations to profit Although the pricetag of these sys-
from their data. tems often exceeds $1 million, cus-
tomers find that the exceptional
Two Markets
price-performance delivers significant
for Big Data: Analytical Platforms business value, in both tangible and in-
Comparing Value
Propositions
The other type of big data predates tangible form. For example, Virginia-
Hadoop and NoSQL variants by sev- based XO Communications recovered
eral years. This version of big data is $3 million in lost revenue from a new
Categorizing
less a “movement” than an extension revenue assurance application it built
Big Data of existing relational database tech- on an analytical appliance, even before
Processing
Systems
nology optimized for query process- it had paid for the system. It subse-
ing. These analytical platforms span a quently built or migrated a dozen appli-
range of technology, from appliances cations to run on the new purpose-built
The New
and columnar databases to shared system, testifying to its value.
Analytical nothing, massively parallel process- Kelley Blue Book in Irvine, Calif., pur-
Ecosystem:
Making Way
ing databases. The common thread chased an analytical appliance to run
for Big Data among them is that most are read-only its data warehouse, which was ex-
environments that deliver exceptional periencing performance issues, giv-
price-performance compared with ing the provider of online automobile
general-purpose relational databases valuations a competitive edge. For in-
originally designed to run transaction stance, the new system reduces the
processing applications. time needed to process hundreds of
Teradata laid the groundwork for millions of automobile valuations from
the analytical platform market when it one week to one day. Kelley Blue Book
launched the first analytical appliance now uses the system to analyze its
in the early 1980s. Sybase was also Web advertising business and deliver
an early forerunner, shipping the first dynamic pricing for its Web ads.
columnar database in the mid-1990s.
IBM Netezza kicked the current market Challenges. Given the up-front
into high gear in 2003 when it unveiled costs of analytical platforms, organi-

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  8


chapter 2:  Two Markets for Big Data: Comparing Value Propositions

zations usually undertake a thorough you rationalize having two data ware-
evaluation of these systems before housing environments instead of one?
jumping aboard. Today we find that companies which
First, a company must determine have tapped out their SQL Server or
whether an analytical platform out- MySQL data warehouses often replace
performs its existing data warehouse them with analytical platforms to get
database to a degree that it warrants better performance. But companies
Despite
migration and retraining costs. This re- that have implemented an enterprise
Problems, quires a proof of concept in which the data warehouse on Oracle, Teradata
Big Data
Makes it Huge
customer tests the systems in its own or IBM often find that analytical plat-
data center using its own data across a forms are best used when they sit
range of queries. alongside the data warehouse so they
Two Markets
The good news is that the new ana- can handle new applications or exist-
for Big Data: lytical platforms usually deliver jaw- ing analytical workloads offloaded to
Comparing Value
Propositions
dropping performance for most queries them. This architecture helps organiza-
tested. In fact, many customers don’t tions avoid a costly upgrade to a data
believe the initial results and rerun the warehousing platform, which might
Categorizing
queries to make sure that the results easily exceed the cost of purchasing an
Big Data are valid. analytical platform.
Processing
Systems
Second, companies must choose The big data movement consists of
from more than two dozen analytical two separate but interrelated markets:
platforms on the market today. For in- one for Hadoop and open source data
The New
stance, they must decide whether to management software and the other
Analytical purchase an appliance or a software- for purpose-built SQL databases op-
Ecosystem:
Making Way
only system, a columnar database or an timized for query processing. Hadoop
for Big Data MPP database, or an on-premises sys- avoids most of the up-front licensing
tem or a Web service. Evaluating these and loading costs endemic to tradi-
options takes time, and many compa- tional relational database systems. But
nies create a short list that doesn’t al- since the technology is still immature,
ways contain comparable products. there are hidden costs that have thus
Finally, companies must decide far kept many Hadoop implementa-
what role an analytical platform will tions experimental in nature. On the
play in their data warehousing archi- other hand, analytical platforms are a
tectures. Should it serve as the data more proven technology but impose
warehousing platform? If so, does it significant up-front licensing fees and
handle multiple workloads easily or is potential migration costs. Companies
it a one-trick pony? If the latter, what wading into the waters of the big data
applications and data sets make sense stream need to carefully evaluate their
to offload to the new system? How do options. p

footer  9
chapter 3

Categorizing Big Data


Processing Systems
Despite

f
Problems,
Big Data
Makes it Huge

Two Markets
for Big Data:
Comparing Value
aced with an expanding four or five alternatives for each plat-
Propositions
analytical ecosystem, form, the BI manager is faced with doz-
BI managers need to ens of viable options in each category.
Categorizing
make many technology The once lazy database market is now
Big Data choices. Perhaps the a beehive of activity.
Processing
Systems
most difficult involves selecting a data Staying abreast of all the new prod-
processing system to power a variety ucts, partnerships and technological
of analytical applications (see Chap- advances is now a full-time job. Indus-
The New
ter 4, “The New Analytical Ecosystem: try analysts who make a living sifting
Analytical Making Way for Big Data”). through products in emerging markets
Ecosystem:
Making Way
In the past, these types of deci- are needed now more than ever. Most
for Big Data sions revolved around selecting one analysts will tell you that the first step
of a handful of leading relational in selecting an analytical platform is
database management systems to to understand the broad categories of
power a data warehouse or data products in the marketplace, and then
mart. Often, the choice boiled down make finer distinctions from there (see
to internal politics as much as tech- figure 1, page 11).
nical functionality. At a high-level, there are four cate-
Today the options aren’t as straight- gories of analytical processing systems
forward, although politics may still that are available today: transactional
play a role. Instead of selecting a single RDBMSes. The following describes
data management product, BI manag- those categories and can be used as a
ers may need to select multiple plat- starting point when creating a short list
forms to outfit an expanding analytical of products during a product evalua-
ecosystem. And rather than evaluating tion process:

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  10


chapter 3:  Categorizing Big Data Processing Systems

figure 1.

Database/Platform Positioning
Database/Platform Positioning

OLTP databases Analytical platforms  Hadoop NoSQL


Oracle, DB2, SQL  Netezza, Vertica, Exadata,  Cloudera, EMC, IBM, Cassandra, MongoDB, 
Server Teradata appliances Hortonworks MarkLogic, Aster Data
MarkLogic, Aster Data
Transaction  Enterprise data  Online data archive for all 
Despite systems warehouse to replace  data (but mostly  Key value pair databases 
Problems, MySQL or SQL Server in  unstructured) for rapid data capture and 
Big Data E t
Enterprise data 
i d t f t
fast‐growing companies
i i analysis
l i
Makes it Huge
warehouse hub Staging area to feed the 
Analytical data marts to  data warehouse Document databases for 
offload the DW high‐performance 
Two Markets
Anal tical s stem hen
Analytical system when  application transactions
application transactions
for Big Data: Free‐standing analytical  you want to query all the 
Comparing Value sandboxes (big data,  raw data (Hbase, Hive) Graph systems that 
Propositions extreme performance) capture relationships 
Analytical system when
Analytical system when  among entities
among entities
you can’t wait until data 
is modeled and put in the  Search databases for 
Categorizing data warehouse (Hbase,  querying structured and 
Big Data Hive) unstructured data
unstructured data
Processing
Systems
Hybrid SQL‐MapReduce 
databases

The New
Analytical
Ecosystem:
Making Way
for Big Data 1. Transactional RDBM Systems those from IBM, Oracle and Syb-
Transactional RDBMSes were origi- ase, are best suited as data ware-
nally designed to support trans- housing hubs that feed a variety of
action processing applications, downstream, end-user-facing sys-
although most have been retrofitted tems, but don’t handle query traf-
with various types of indexes, join fic directly. Although retrofitted
paths and custom SQL bolt-ons to with analytical capabilities, these
make them more palatable to analyt- systems often hit performance
ical processing. There are two types and scalability walls when used for
of transactional RDBMSes: enter- query processing along with other
prise and departmental. workloads and are expensive to
upgrade and replace. Thus, many
11Enterprise hubs. The traditional customers now use these “gray-
enterprise RDBMSes, such as bearded” data warehousing sys-

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  11


chapter 3:  Categorizing Big Data Processing Systems

tems as hubs to feed operational 11MPP database. Massively parallel


data stores, data marts, enterprise processing (MPP) databases with
reporting systems, analytical sand- strong mixed workload utilities
boxes and various analytical and make good enterprise data ware-
transactional applications. houses for analytically minded or-
ganizations. Teradata was the first
11Departmental marts. A number on the block with such a system,
Despite
of companies use Microsoft SQL but it now has many competitors,
Problems, Server or MySQL as data marts fed including EMC Greenplum and
Big Data
Makes it Huge
by an enterprise data warehouse or Microsoft’s Parallel Data Ware-
as standalone data warehouses for housing option, which are relative
a business unit or small and me- upstarts compared to the 30-year
Two Markets
dium-sized business (SMB). Like old Teradata.
for Big Data: their enterprise brethren, these
Comparing Value
Propositions
systems also often hit the wall 11Analytical appliance. These pur-
when usage, data volumes or query pose-built analytical systems
complexity increases rapidly. A come as an integrated hardware-
Categorizing
fast-growing business unit or SMB software combination tuned for
Big Data often replaces these transactional analytical workloads. Analytical
Processing
Systems
RDBMSes with analytical appli- appliances come in many shapes,
ances (see below) which provide sizes and configurations. Some,
the same or greater level of sim- like IBM Netezza, EMC Greenplum
The New
plicity and ease of management as and Oracle Exadata, are more gen-
Analytical SQL Server or MySQL. eral-purpose analytical machines
Ecosystem:
Making Way
that can serve as replacements for
for Big Data 2. Analytical Platforms most data warehouses. Others,
Analytic platforms represent the first such as those from Teradata, are
wave of big data systems (see Chapter geared to specific analytical work-
2, “Two Markets for Big Data: Com- loads and can deliver extremely
paring Value Propositions”). These fast performance or manage super
are purpose-built SQL-based systems large data volumes.
designed to provide superior price-
performance for analytical workloads 11In-memory systems. If you are
compared with transactional RDBM- looking for raw performance, there
Ses. There are many types of analyti- is nothing better than a system that
cal platforms. Most are being used as lets you put all your data into mem-
data warehousing replacements or ory. These systems will soon be-
standalone analytical systems. come more commonplace, thanks

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  12


chapter 3:  Categorizing Big Data Processing Systems

11to SAP, which is betting its busi- components to turn Hadoop into an
ness on HANA, an in-memory enterprise-caliber, data processing
database for transactional and environment. The collection of these
analytical processing, and is evan- components is called a Hadoop dis-
gelizing the need for in-memory tribution. Leading providers of Ha-
systems. Another contender in this doop distributions include Cloudera,
space is Kognitio. Many RDBM- IBM, EMC, Amazon, Hortonworks
Despite
Ses are better exploiting memory and MapR.
Problems, for caching results and processing Today, in most customer instal-
Big Data
Makes it Huge
queries. lations, Hadoop serves as a staging
area and online archive for unstruc-
11Columnar. Columnar databases, tured and semi-structured data, as
Two Markets
such as SAP’s Sybase IQ, Hewlett- well as an analytical sandbox for
for Big Data: Packard’s Vertica, ParAccel, Info- data scientists who query Hadoop
Comparing Value
Propositions
bright, Exasol, Calpont and Sand files directly before the data is ag-
offer fast performance for many gregated or loaded into the data
types of queries because of the warehouse. But this could change.
Categorizing
way these systems store and com- Hadoop will play an increasingly im-
Big Data press data—by columns instead portant role in the analytical eco-
Processing
Systems
of rows. Column storage and pro- system at most companies, either
cessing is fast becoming a RDBMS working in concert with an enter-
feature rather than a distinct sub- prise DW or assuming most duties
The New
category of products. of one.
Analytical
Ecosystem:
Making Way
3. Hadoop Distributions 4. NoSQL Databases
for Big Data Hadoop is an open source software NoSQL—shorthand for “not only
project run within The Apache Soft- SQL”—is the name given to a broad
ware Foundation for processing set of databases whose only common
data-intensive applications in a dis- thread is that they don’t require SQL
tributed environment with built-in to process data, although some sup-
parallelism and failover. The most port both SQL and non-SQL forms
important parts of Hadoop are the of data processing. There are many
Hadoop Distributed File System types of NoSQL databases, and the
(HDFS), which stores data in files on list grows longer every month. These
a cluster of servers, and MapReduce, specialized systems are built using
a programming framework for build- either proprietary and open source
ing parallel applications that run on components or a mix of both. In most
HDFS. The open source commu- cases, they are designed to overcome
nity is building numerous additional the limitations of traditional RDBMes

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  13


chapter 3:  Categorizing Big Data Processing Systems

to handle unstructured and semi- 11Graph systems. These database


structured data. Here’s a partial list- store associations among entities,
ing of NoSQL systems: making them popular among so-
cial media companies that need
11Key value pair databases. These to track different connections
systems store data as a simple re- among people.
cord structure consisting of a key
Despite
and content. These are used for op- 11Unified information access. These
Problems, erational applications that involve systems, such as those from At-
Big Data
Makes it Huge
large volumes of data, flexible data tivio, MarkLogic and Splunk, use
structures and fast transactions. more of a search storage and query
Leading key value pair databases paradigm to query structured and
Two Markets
include Cassandra, Hbase and unstructured data.
for Big Data: Basho Riak.
Comparing Value
11Other. There are many other
Propositions
11Document stores. These systems NoSQL databases that vary in how
specialize in storing, parsing and they store and process data or the
Categorizing
processing application objects, types of applications they are de-
Big Data typically using a lightweight struc- signed to support.
Processing
Systems
ture, such as JSON. Like key value
databases, document stores are The above four categories rep-
used for high-volume transaction resent just the start of a broader
The New
processing. Leaders here include categorization of data process-
Analytical MongoDB and Couchbase. ing systems geared to analytical
Ecosystem:
Making Way
workloads. This is a fast-chang-
for Big Data 11SQL MapReduce. These systems ing field. With the multiplicity of
allow users to use SQL to invoke choices available today, BI profes-
MapReduce jobs running inside sionals need to understand the
the database or associated file sys- differences between data manage-
tem. Teradata’s Aster Data and ment offerings so they can position
EMC Greenplum support these themselves properly in the new an-
capabilities. alytical ecosystem. p

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  14


chapter 4

The New Analytical Ecosystem:


Making Way for Big Data
Despite

T
Problems,
Big Data
Makes it Huge

Two Markets
for Big Data:
Comparing Value
he big data revolution formance analytical engines—ana-
Propositions
has arrived, and it’s lytical appliances, MPP databases,
transforming long-es- in-memory databases—and interac-
Categorizing
tablished data ware- tive, in-memory visualization tools.
Big Data housing architectures Most source data now flows through
Processing
Systems
into vibrant, multifaceted analytical Hadoop, which primarily acts as a
ecosystems. staging area and online archive. This
Gone are the days when all analytical is especially true for semi-structured
The New
processing first passes through a data data, such as log files and machine-
Analytical warehouse or data mart (or their less generated data, but also for some
Ecosystem:
Making Way
sanctified spreadmart or data shadow structured data that companies can’t
for Big Data system brethren). Now data winds its cost-effectively store and process in
way to users through a plethora of cor- SQL engines (e.g., call detail records in
porate data structures, each tailored to a telecommunications company). From
the type of content it contains and the Hadoop, data is fed into a data ware-
type of user who wants to consume it. housing hub, which often distributes
Figure 1 depicts a reference architec- data to downstream systems, such as
ture for the new analytical ecosystem data marts, operational data stores
that has the fingerprints of big data all and analytical sandboxes of various
over it. The objects in blue represent types, where users can query the data
the traditional data warehousing envi- using familiar SQL-based reporting and
ronment, while those in pink represent analysis tools.
new architectural elements made pos- Today data scientists analyze raw
sible by big data technologies; namely data inside Hadoop by writing MapRe-
Hadoop, NoSQL databases, high-per- duce programs in Java and other lan-

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  15


chapter 4:  The New Analytical Ecosystem: Making Way for Big Data

figure 1.

BIArchitecture2020
The New Analytical Ecosystem

Operationalsystems
O ti l t
(Structureddata)

Extract,transform and load


Operational (Batch,nearreal time,orreal time) Casualuser
system Streaming/
Streaming/
CEPEngine
Despite
Problems,
Operational
Big Data system BI
Makes it Huge server
Datawarehouse Dept.
Machine Hadoopcluster data
data mart
TopͲdownarchitecture
Two Markets Vi
Virtualsandboxes
l db BottomͲuparchitecture
for Big Data:
Web data
Comparing Value
Propositions InͲmemory
BIsandbox
FreeͲ
Audio/video standing
data sandbox
Categorizing Analyticalplatform
Big Data ornonrelationaldatabase
External
Processing data Poweruser
Systems
Documentsandtext

The New
Analytical
Ecosystem:
Making Way
for Big Data guages. In the future, users will be able up data flows that meet all business re-
to query and process Hadoop data us- quirements for reporting and analysis.
ing SQL-based data integration and
query tools. The top-down world. Here source
data is processed, refined and stamped
with a predefined data structure—typi-
Harmonizing Opposites cally a dimensional model—and then
The big data revolution is not only consumed by casual users using SQL-
about analyzing large volumes and based reporting and analysis tools.
new sources of data, it’s also about In this domain, IT developers create
balancing data alignment and consis- data and semantic models so busi-
tency with flexible, ad hoc exploration. ness users can get answers to known
As such, the new analytical ecosystem questions and executives can track
features both top-down and bottom- performance of predefined metrics.

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  16


chapter 4:  The New Analytical Ecosystem: Making Way for Big Data

Design precedes access. The top-down they’ll need to answer those questions.
world also takes great pains to align Often, the data they need doesn’t yet
data along conformed dimensions and exist in the data warehouse.
deliver clean, accurate data. The goal is The new analytical ecosystem sup-
to deliver a consistent view of the busi- ports three analytical sandboxes that
ness entities so users can spend their enable power users to explore corpo-
time making decisions instead of ar- rate and local data on their own terms:
Despite
guing about the origins and validity of (1) Hadoop, (2) virtual partitions in-
Problems, data artifacts. side a data warehouse and (3) special-
Big Data
Makes it Huge
ized analytical databases that offload
The underworld. Creating a uniform data or analytical processing from the
view of the business from heteroge-
neous sets of data is not easy. It takes
Combining top-down
Two Markets
for Big Data: time, money and patience, often more
and bottom-up worlds is
Comparing Value
Propositions
than most departmental heads and
business analysts are willing to toler-
ate. They often abandon the top-down not easy. BI professionals
Categorizing
world for the underworld of spread- need to assiduously guard
marts and data shadow systems. Us-
data semantics while
Big Data
Processing
Systems
ing whatever tools are readily available
and cheap, these data-hungry users opening access to data.
create their own views of the busi-
The New
ness. Eventually, they spend more time
Analytical collecting and integrating data than data warehouse or handle new un-
Ecosystem:
Making Way
analyzing it, undermining their produc- tapped sources of data, such as Web
for Big Data tivity and a consistent view of business server logs or machine data. The new
information. environment also gives department
heads the ability to create and use
The bottom-up world. The new ana- dashboards built with in-memory vi-
lytical ecosystem brings these prodi- sualization tools that point both to a
gal data users back into the fold. It corporate data warehouse and other
carves out space within the enterprise independent sources.
environment for true ad hoc explora- Combining top-down and bottom-
tion and promotes the rapid develop- up worlds is not easy. BI professionals
ment of analytical applications using need to assiduously guard data se-
in-memory departmental tools. In a mantics while opening access to data.
bottom-up environment, users can’t For their part, business users need to
anticipate the questions they will ask commit to adhering to corporate data
on a daily or weekly basis or the data standards in exchange for getting the

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  17


chapter 4:  The New Analytical Ecosystem: Making Way for Big Data

keys to the kingdom. To succeed, orga-


nizations need robust data governance
programs and lots of communication
among all parties.
The big data revolution brings ma-
jor enhancements to the BI landscape.
First and foremost, it introduces new Big Data and its Impact
on Data Warehousing is is a joint publi-
Despite
technologies, such as Hadoop, that cation of BeyeNETWORK,
Problems, make it possible for organizations to SearchDataManagement.com and
Big Data
Makes it Huge
cost-effectively consume and analyze SearchBusinessAnalytics.com.
large volumes of semi-structured data.
Hannah Smalltree
Second, it complements traditional, Editorial Director
Two Markets
top-down data-delivery methods with
for Big Data: more flexible, bottom-up approaches Jason Sparapani
Comparing Value Managing Editor, E-Publications
Propositions
that promote ad hoc exploration and
rapid application development. p Wayne Eckerson
Director of Research
Wayne Eckerson has more
Categorizing than 15 years’ experience in data Linda Koury
Big Data warehousing, business intel- Director of Online Design
Processing ligence (BI) and performance
Systems management. He has conducted Mike Bolduc
numerous in-depth research Publisher
studies and wrote the best- [email protected]
selling book Performance Dash-
The New
boards: Measuring, Monitoring, and Managing Your
Analytical Ed Laplante
Business. He is a keynote speaker and blogger and
Ecosystem: Director of Sales
Making Way
conducts workshops on business analytics, perfor-
[email protected]
for Big Data mance dashboards and business intelligence. Ecker-
son served as director of education and research at
The Data Warehousing Institute, where he oversaw
the company’s content and training programs and © 2012 TechTarget Inc. No part of this pub-
chaired its BI Executive Summit. lication may be transmitted or reproduced
in any form or by any means without written
Eckerson is director of research at TechTarget,
permission from the publisher. TechTarget
where he writes a weekly blog called Wayne’s reprints are available through
World, which focuses on industry trends and ex- The YGS Group.
amines best practices in the application of BI. He is
About TechTarget: TechTarget publishes
also president of BI Leader Consulting and founder
media for information technology profes-
of BI Leadership Forum, a network of BI directors sionals. More than 100 focused websites
who exchange ideas about best practices in BI and enable quick access to a deep store of news,
educate the larger BI community. Email him at advice and analysis about the technologies,
products and processes crucial to your job.
[email protected].
Our live and virtual events give you direct
access to independent expert commentary
and advice. At IT Knowledge Exchange, our
social community, you can get advice and
share solutions with peers and experts.

BIG DATA AND ITS IMPACT ON DATA WAREHOUSING  18

You might also like