155928-Turn Big Data
155928-Turn Big Data
ABSTRACT
Some of today’s most successful companies achieve game-changing business advantages by capturing, analyzing, and
acting upon vast amounts of diverse, fast-moving “big data.” This paper describes three usage models that can help you
implement a flexible and efficient big data infrastructure to realize competitive advantages in your own business. It also
describes Intel innovations in silicon, systems, and software that can help you to deploy these and other big data solutions
with optimal performance, cost, and energy efficiency.
Semi/Unstructured Data
Advanced Power Management for
Lower Operating Costs . . . . . . . . . . . . . . . . . 8
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Growth of 400%
0
2006 2008 2010 2012 2014 2016 2018 2020
Figure 1. Current and forecasted growth of big data. Source: Philippe Botteri of Accel Partners, Feb. 2013.
White Paper: Turn Big Data into Big Value
While a tsunami is destructive, big data holds tremendous potential value. Because the value of big data stretches across vast amounts of
With the right tools and strategies, businesses can extract insights that complex, fast-moving content, deriving meaningful insights often
deliver game-changing competitive advantages. A number of public and requires extensive mining and deep analysis that go beyond traditional
private organizations do that today. Business Intelligence (BI) queries and reports. Machine learning,
statistical modeling, graph algorithms and other emerging techniques
• Retailers analyze social media trends in real time to offer the hottest
can unveil valuable, actionable insights that deliver significant
products to the most likely buyers, and they do this at volumes and
with levels of granularity that have never before been possible. competitive advantages.
2
White Paper: Turn Big Data into Big Value
Usage Model 1—ETL using Apache Hadoop* traditional ETL processes. This enables Hadoop to load large amounts
of data in a short time, making that data quickly available for analysis,
Like traditional data, big data must be extracted from external
visualization, and other uses.
sources, transformed into structures that fit operational needs,
and loaded into a database for storage and management. Traditional Infrastructure Considerations
ETL solutions cannot handle the demands of poly-structured data,
Dual-socket servers based on the Intel® Xeon® processor E5 family
so Hadoop software has emerged as the de facto platform for
provide an optimal balance of capability versus cost for most Hadoop
addressing this need (Figure 2).
deployments. These servers offer more cores, cache, and memory
The distributed storage and processing environment of a Hadoop capacity than previous generation servers. They also provide up to
cluster works well for big data ETL. Hadoop breaks-up incoming twice the I/O bandwidth with 30 percent lower I/O latency.1 These
streams into pieces and applies simple operations in parallel to resources sustain high throughput for larger numbers of data-intensive
rapidly process large amounts of data. It supports all data types tasks executing in parallel.
and can operate across tens, hundreds, or even thousands of servers
Lightweight, I/O-bound workloads, such as simple data sorting oper-
to provide massive scalability. The Hadoop Distributed File System
ations, may not require the full processing power of the Intel Xeon
(HDFS) stores the results on low cost storage devices directly attached
processor E5 family. Such workloads run economically on high-density,
to each server in the cluster—ready for immediate up-loading to the
low-power servers based on the Intel® Xeon® processor E3 family or the
enterprise data warehouse or unstructured data stores.
Intel® Atom™ processor-based System on a Chip (Intel Atom SoC). With
Hadoop can process poly-structured data for analysis, even when power envelopes as low as 6 watts, the 64-bit x86-based Intel Atom
that data is not predefined. In other words, Hadoop supports a Schema SoC provides unprecedented density and energy-efficiency in a server-
on Read model as opposed to the Schema on Write model used in class processor.
OLAP
CRM ANALYSIS
ETL
LOAD
ERP DATA DATA
EXTRACT TRANSFORM LOAD
WAREHOUSE MINING
LOAD
WEB SITE
TRAFFIC
REPORTING
ETL OFFLOAD
SOCIAL
MEDIA
Sqoop
FLUME Pig/MR HDFS
Data
Science
SENSOR
LOGS
Figure 2. Using Apache Hadoop,* organizations can ingest, process, and export massive amounts of diverse data at scale.
3
White Paper: Turn Big Data into Big Value
All servers in a Hadoop cluster require substantial memory and a Businesses looking to implement a powerful and cost-effective big
relatively large number of storage drives to meet the demands of data platform should consider combining a large-scale SQL data ware-
data-intensive Hadoop workloads. Sufficient memory is required to house with a Hadoop cluster. The cluster can quickly ingest and process
support high throughput for the many operations performed in parallel. large, diverse, and fast-moving data streams. Appropriate data sets can
Multiple storage drives (two or more per core) deliver the aggregate I/O then be loaded into the data warehouse for ad hoc SQL queries, analy-
throughput needed to avoid storage bottlenecks. Storage performance sis, and reports. Users also can query multi-structured data sets that
improves considerably with at least one Intel® Solid State Drive (Intel® reside in the Hadoop cluster using software such as Apache HBase,*
SSD) in each server node. Spark,* Shark,* SAP HANA,* Apache Cassandra,* MongoDB,* Tao,*
Neo4J,* Apache Drill,* or Impala.* This hybrid strategy offers a founda-
By processing data near where it is stored, Hadoop greatly reduces the
tion for faster, deeper insights than either solution alone can achieve.
need for high-volume data movement. Nevertheless, fast data import
and export requires sufficient network bandwidth. In most cases, each Similar processes apply whether you use a traditional data warehouse
rack of servers should use a 10 Gigabit Ethernet (10 GbE) switch and or a more modern system designed for larger volumes and faster data
each rack-level switch should connect to a 40 GbE cluster-level switch. streams: gather data from external sources then cleanse and format
As data volumes, workloads, and clusters grow, it may be necessary the data to fit into the warehouse data model. This can be done prior
to interconnect multiple cluster-level switches or even to uplink to to loading the data into the warehouse or it can be done on-the-fly as
another level of switching infrastructure. streaming data sources are fed into the warehouse.
For more detailed information, see the Intel white paper: “Extract, With the data loaded, analysis can begin. Modern data warehouses
Transform & Load (ETL) Big Data with Apache Hadoop*” posted in support ad hoc queries, enabling access on-demand for data with any
the Intel Developer Zone at software.intel.com. meaningful combination of values. This contrasts with more-traditional
data warehouses that generate only pre-defined reports based on
Usage Model 2—Interactive Queries known relationships.
4
White Paper: Turn Big Data into Big Value
Self-healing features proactively and reactively repair known errors and • Intelligent tiering to optimize performance versus cost, by automati-
also reduce the likelihood of future errors by acting automatically based cally moving “hot” data to faster storage devices and “cold” data to
on configurable error thresholds. Intel works extensively with hard- higher capacity, lower cost drives. With this approach, a small number
ware, operating system, virtual machine monitor (VMM), and application of high-speed drives, such as Intel® SSD 710 Series SATAs, can deliver
vendors to help ensure tight integration throughout the hardware and substantial performance improvements at relatively low cost.
software stack. Loading data sets into data warehouses quickly and efficiently enables
analytics applications to provide business insights in a timely manner.
As data volumes skyrocket, new strategies help scale data storage
Efficient ETL processing is one component of the solution. Another
capacity more efficiently and cost-effectively, both within and beyond
is a fast and efficient network to drive the growing business value of
the data warehouse. The following strategies can work together to
analytics throughout the enterprise. Intel® Ethernet products integrate
meet diverse needs at lower total cost.
technologies to address these requirements.
• Scale-out storage architectures deliver affordable high
• Near-native performance in virtualized environments.
capacity and support federation across private and hybrid clouds.
Virtualization improves infrastructure flexibility and utilization—
These solutions scale dynamically, and you can provision them
important for containing costs as big data solutions grow. Intel®
faster than traditional storage systems. They also help to improve
Virtualization Technology for connectivity (Intel® VT-c) helps to reduce
data management efficiency.
I/O bottlenecks and improve overall server performance in virtualized
• Low-latency, proximity storage is a good fit for data-intensive environments. Its Virtual Machine Device Queues (VMDQ) technology
applications that perform better when co-located with the data offloads traffic sorting and routing to dedicated silicon in the network
storage devices. Examples include business processes, decision adapter. Its PCI-SIG Single Root I/O Virtualization (SR-IOV) technology
support analyses, and high-performance computing workloads, as allows a single Intel® Ethernet Server Adapter port to support multiple,
well as collaborative processes, applications, and web infrastructure isolated connections to virtual machines.
running on virtualized servers.
• Unified 10 GbE networking. Consolidating data center traffic onto
• Centralized storage aggregated as logical pools in storage area a single, high-bandwidth network helps to reduce cost and complexity
networks (SANs) support high-performance business databases. and provides the performance and scalability needed to address rapidly
When optimized for affordable capacity rather than high performance, growing needs. Intel Ethernet Converged Network Adapters support
centralized solutions provide efficient storage for backup, archive, and fiber channel over Ethernet (FCoE) and iSCSI to simplify implementation
object store requirements. and reduce costs when consolidating local area network (LAN) traffic
and storage area network (SAN) traffic.
Higher storage efficiency can help to contain costs in the face
of rapid data growth. Many storage vendors integrate Intel Xeon • Simpler, faster connections to iSCSI SANs. Intel Ethernet
processors into their storage solutions to support advanced data Converged Network Adapters and Intel Ethernet Server Adapters
management functions that help to improve efficiency. According to provide hardware-based iSCSI acceleration to improve performance.
IDC’s June 2013 Worldwide Storage and Virtualized x86 Environments They also take advantage of native iSCSI initiators integrated into
leading operating systems to simplify iSCSI deployment and
2013–2017 Forecast, about 80 percent of worldwide, enterprise-
configuration in both native and virtualized networks.
class storage solutions for corporations, cloud, and HPC run on Intel
architecture. Look for storage platforms that support data-efficiency For more detailed information, see the Intel SQL Data Warehousing
technologies, including: Usage Model white paper.
5
White Paper: Turn Big Data into Big Value
Usage Model 3—Predictive Analytics Intel IT began its own trailblazing big data analytics effort in 2010, and
recommends combining the two usage models already discussed in this
on the Hadoop Platform paper to create a hybrid analytics infrastructure (Figure 4).
Predictive analytics extracts higher value from data by capturing rela-
tionships from past events and using them to predict future outcomes 1. Deploy a data warehouse appliance based on an MPP architecture
to perform complex predictive analytics quickly on large data sets.
(Figure 3). Retailers use predictive analytics to deliver more compelling
A number of vendors have incorporated the Intel Xeon proces-
offers to individual customers, healthcare organizations use it to select
sor E7 family into blade-based appliances that deliver the required
best-fit treatment protocols, and financial services organizations use it
performance at relatively low cost. These systems fit into existing
to increase investment returns and reduce risk. enterprise BI solutions and provide integrated support for advanced
Although predictive analytics can aid in strategic business planning, analytics tools and applications, such as R, an open-source statistical
computing language that is popular among data scientists.
its greatest value may come from tactical guidance at the point of
decision and operational guidance at the point of execution. 2. Add a Hadoop cluster for fast, scalable, and affordable ETL for
Centralized teams of data scientists, database administrators, the data warehouse. Hadoop also runs other data processing and
and software developers work together to provide customized analytics functions that perform well in a distributed processing
solutions for the most critical business operations. As businesses environment. The Hadoop ecosystem offers a growing variety of
integrate this capability more widely into their operations, they tools and components to address these needs.
must provide optimized decision tools for a wider range of users
Infrastructure Considerations
and automated systems.
To provide maximum flexibility, the data warehouse and the Hadoop
Predictive analysis falls into two main categories: regression and
cluster should use a high-speed data loader and link together using
machine learning.
10 GbE or another high-bandwidth networking technology. This allows
• Regression techniques compare current data with historical models you to move data quickly between the two environments, so you can
to forecast the most probable outcome. use the most effective analytics techniques based on specific data
types, workloads, and business needs.
• Machine learning uses artificial intelligence with little or no human
intervention. The system analyzes a representational data set to
extract relationships, and it generalizes from that to make predictions
based on new data. Optical character recognition (OCR) is a classic
example, but new applications exploit big data across a wide range
of scenarios.
Predictive
Why did it N
ATIO
Analytics
happen?
IMIZ
Diagnostic OPT
What t
Value
Analytics sigh
happened? Fore
Descriptive
Analytics
ht
Insig
ION
ORMAT
INF sigh
t
Hind
Difficulty
Figure 3. According to Gartner, the difficulty and business value of analytics both increase as the focus moves from hindsight to foresight.
6
White Paper: Turn Big Data into Big Value
Figure 4. Intel IT’s big data platform provides a flexible foundation for analytics—including predictive analytics—by using a high-speed data loader to connect a
massively parallel processing (MPP) data warehouse appliance with clusters of industry-standard servers running Hadoop software.
Creating a Better Foundation • Fast, low-overhead data encryption. Intel® Advanced Encryption
Standard New Instructions (Intel AES-NI) provides hardware accel-
for Big Data Analytics eration for encryption to protect data in latency-sensitive analytics
As big data technologies and solutions advance, Intel products and environments without sacrificing performance. Intel performance
technologies help speed-up innovation throughout the ecosystem. tests show that Intel AES-NI can accelerate encryption performance
By working with hardware, software and service providers to ensure in a Hadoop cluster by up to 5.3x and decryption performance by up
broad support, Intel helps businesses integrate these new capabili- to 19.8x when used in combination with the Intel Distribution for
ties more simply and affordably on a standards-based, connected, Apache Hadoop software (Intel Distribution).2 Intel Xeon processors
and the upcoming Intel Atom SoC support Intel AES-NI.
managed, and secure architecture.
New Tools and Optimized Software
Processor Advances for Performance and Security Intel works both independently and in collaboration with leading
Intel processor advances deliver increasing performance and value software vendors and the open-source community to provide opti-
for next-generation big data solutions. Ongoing improvements in mized software stacks and services for big data analytics. These
per-thread performance, parallel execution, I/O throughput, memory efforts help to deliver new and advanced functionality throughout
capacity, and energy efficiency help businesses address rapidly the big data ecosystem. They also help to ensure the best possible
growing needs using affordable, mainstream computing systems. performance for big data applications running on Intel architecture.
Intel also integrates advanced security technologies that protect Intel also delivers software products that help address some of the
data more effectively, so you can integrate sensitive data into your most critical needs within the big data ecosystem.
big data analytics environment. Current security technologies in
Intel Xeon processors provide the following advantages. • Performance benchmarking for Hadoop clusters and applications.
The Intel® HiBench suite includes 10 benchmarks that IT organiza-
• Strong workload isolation on trusted infrastructure. Intel® tions and software vendors use to measure performance for specific,
Trusted Execution Technology (Intel® TXT) and Intel® Virtualization common tasks, such as sorting and word counting, and for more
Technology (Intel® VT) help to protect systems and software more comprehensive real-world functions, such as web searching, machine
effectively in virtualized and cloud environments. Intel VT provides learning, and data analytics. Intel engineers use the Intel HiBench suite
silicon-assisted workload isolation. Intel TXT can establish trusted to help with upstream Hadoop optimizations for Intel Architecture as
infrastructure pools by ensuring that Intel® Xeon® processor-based well as with Java* optimizations for Hadoop.
servers boot only into “known good states.”
7
White Paper: Turn Big Data into Big Value
To learn about Intel IT strategies and best practices for implementing big data analytics, read the Intel IT white paper, “Mining Big Data in
the Enterprise for Better Business Intelligence.”
Source: The claim of up to 32% reduction in I/O latency is based on Intel internal measurements of the average time for an I/O device read to local system memory under idle conditions for the Intel®
1
Xeon® processor E5-2600 product family versus the Intel® Xeon® processor 5600 series. 8 GT/s and 128b/130b encoding in the PCIe 3.0 specification enables double the interconnect bandwidth
over the PCIe 2.0 specification. For more information, read the PCI-SIG* press release, “PCI-SIG releases PCI Express 3.0 Specification.”
For details, see the Intel solution brief, “Fast, Low-Overhead Encryption for Apache Hadoop*”. Software and workloads used in performance tests may have been optimized for performance only on
2
Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products.
Copyright © 2013 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Xeon, and Intel Atom are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Printed in USA 0713/DF/HBD/PDF Please Recycle 329261-001US