HADOOP/BIG DATA
About Big Data
Big data is a general term used to describe the voluminous amount of unstructured and semistructured data a company creates -- data that would take too much time and cost too much money to load into a relational Database for analysis. The term is often used when speaking about petabytes and exabytes of data.
When dealing with larger datasets, organizations face difficulties in being able to create, manipulate, and manage Big Data. Big data is particularly a problem in business analytics because standard tools and procedures are not designed to search and analyze massive datasets
A primary goal for looking at big data is to discover repeatable business patterns. Unstructured data , most of it located in text files, accounts for at least 80% of an organizations data. If left unmanaged, the sheer volume of unstructured data thats generated each year within an enterprise can be costly in terms of storage . Unmanaged data can also pose a liability if information cannot be located in the event of a compliance audit or lawsuit.
Big data spans three dimensions
Volume Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.
Variety Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more
Velocity Often timesensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.
Customer challenges for securing Big Data
Awareness & Understanding are lacking
Customers are not actively talking about security concerns. Customers need help understanding threats in a Big Data environment
Companies policies & laws add complexity
Main considerations: Synchronizing retention and disposition policies across jurisdictions, moving data across countries. Customers need help navigating frameworks and changes
Storage Efficiency challenges for Big Data
DeDuplication
Challenge: In most instances, data is random and inconsistent, not duplicated Opportunity: There is a need for more intelligent identification of data
Compression
Challenge: Compression normally happens instead of deduplication, yet, will compress duplicated data regardless Opportunity: There is a need for an automated manner in doing both de-duplicating, and then compressing
About Hadoop
Hadoop is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. Solution for Big Data: Deals with complexities of high volume, Velocity & variety of data. It enables applications to work with thousands of nodes and petabytes of data. It is:-
Reliable : The software is fault tolerant, it expects and handles hardware and software failures
Scalable Designed for massive scale of processors, memory, and local attached storage
Distributed Handles replication. Offers massively parallel programming model, MapReduce
5
About Apache Hadoop Software Library
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver highavailability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availaible service on top of a cluster of computers, each of which may be prone to failures
Market Drivers for Apache Hadoop
Business Drivers High-value projects that require use of more data Belief that there is great ROI in mastering big data
Financial Drivers Growing cost of data systems as percentage of IT spendCost advantage of commodity hardware + opensource Enables departmental-level big data strategies
Trend
The OLD WAY
Operational systems keep only current records, short history
The New Trend
Keep raw data in Hadoop for a longtime Able to produce a new analytics view on-demand Keep a new copy of data that was previously on in silos Can directly do new reports, experiments at low incremental cost
Analytics systems keep only conformed/cleaned/digested data Unstructured data locked away in operational silos
Archives offline:-Inflexible, new questions require system redesigns
New products/services can be added very quickly
Agile outcome justifies new infrastructure
Hadoop is a part of a larger framework of related technologies
HDFS: Hadoop Distributed File System
HBase: Column oriented, non-relational, schema-less, distributed database modeled after Googles BigTable. Promises Random, real-time read/write access to Big Data Hive: Data warehouse system that provides SQL interface. Data structure can be projected ad hoc onto unstructured underlying data
Pig: A platform for manipulating and analyzing large data sets. High level language for analysts
ZooKeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
Organizations using Hadoop
Hadoop Developer Core contributor since Hadoops infancy Project Lead for Hadoop Distributed File System Facebook (Hadoop, Hive, Scribe) Yahoo! (Hadoop in Yahoo Search) Veritas (San Point Direct, Veritas File System) IBM Transarc (Andrew File System) UW Computer Science Alumni (Condor Project)
Why Hadoop Is needed?
Need to process Multi Petabyte Datasets
Expensive to build reliability in each application. Nodes fail every day
Failure is expected, rather than exceptional. The number of nodes in a cluster is not constant.
Need common infrastructure Efficient, reliable, Open Source Apache License
The above goals are same as Condor, but
Workloads are IO bound and not CPU bound
Hadoop is particularly useful when:Complex information processing is needed
Unstructured data needs to be turned into structured data
Queries cant be reasonably expressed using SQL Heavily recursive algorithms Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing Machine learning Data sets are too large to fit into database RAM, discs, or require too many cores (10s of TB up to PB) Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost
Results are not needed in real time
Fault tolerance is critical Significant custom coding would be required to handle job scheduling
Hadoop Is being used as a
Staging layer: The most common use of Hadoop in enterprise environments is as Hadoop ETL preprocessing, filtering, and transforming vast quantities of semistructured and unstructured data for loading into a data warehouse.
Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.
Content analytics layer: next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.
Karmasphere released the results of a survey of 102 Hadoop developers regarding adoption, use and future plans
What Data Projects is Hadoop Driving?
Are Companies Adopting Hadoop?
More than one-half (54%) of organizations surveyed are using or considering Hadoop for largescale data processing needs More than twice as many Hadoop users report being able to create new products and services and enjoy costs savings beyond those using other platforms; over 82% benefit from faster analyses and better utilization of computing resources 87% of Hadoop users are performing or planning new types of analyses with large scale data 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data Organizations use Hadoop in particular to work with unstructured data such as logs and event data (63%) More than two-thirds of Hadoop users perform advanced analysis data mining or algorithm development and testing
Hadoop At Linkedin:-
LinkedIn leverages Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIns 125 million member base. LinkedIn then uses Lucene to do real-time recommendations, and also Lucene on Hadoop to bridge offline analysis with user-facing services. The streams of user-generated information, referred to as a social media feeds, may contain valuable, real-time information on the LinkedIn member opinions, activities, and mood states.
Hadoop At Forsquare
Forsquare were finding problems in handling huge amount of data which they are handling. Their Business development managers, venue specialists, and upper management eggheads needed access to the data in order to inform some important decisions.
To enable easy access to data foursquare engineering decided to use Apache Hadoop and Apache Hive in combination with a custom data server (built in Ruby), all running in Amazon EC2. The data server is built using Rails, MongoDB, Redis, and Resque and communicates with Hive using the ruby Thrift client.
Hadoop @ Orbitz
Orbitz needed an infrastructure that provides:Long term storage of large data sets; Open access for developers and business analysts; Ad-hoc quering of data
Rapid deploying of reporting applications.
They moved to Hadoop and Hive to provide reliable and scalable storage and processing of data on inexpensive commodity hardware.
HDFS Architecture
Metadata ops Namenode Client Read Block ops Datanodes
replication
Metadata(Name, replicas..) (/home/foo/data,6. ..
Datanodes B Blocks
Rack1
Write Client
Rack2
7/30/2012
24