0% found this document useful (0 votes)

125 views24 pages

Hadoop V.01

Hadoop is an open-source software framework that allows for the distributed processing of large data sets across clusters of computers. It is designed to scale from single servers to thousands of machines, each offering local computation and storage. Hadoop can handle complex unstructured data and is useful when data volumes are too large for traditional databases or analytics are not needed in real time. Many large companies use Hadoop for applications like log analysis, social network analysis, and machine learning on huge data sets.

Uploaded by

Eshan Bhateja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views24 pages

Hadoop V.01

Uploaded by

Eshan Bhateja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

HADOOP/BIG DATA

About Big Data

Big data is a general term used to describe the voluminous amount of unstructured and semistructured data a company creates -- data that would take too much time and cost too much money to load into a relational Database for analysis. The term is often used when speaking about petabytes and exabytes of data.

When dealing with larger datasets, organizations face difficulties in being able to create, manipulate, and manage Big Data. Big data is particularly a problem in business analytics because standard tools and procedures are not designed to search and analyze massive datasets

A primary goal for looking at big data is to discover repeatable business patterns. Unstructured data , most of it located in text files, accounts for at least 80% of an organizations data. If left unmanaged, the sheer volume of unstructured data thats generated each year within an enterprise can be costly in terms of storage . Unmanaged data can also pose a liability if information cannot be located in the event of a compliance audit or lawsuit.

Big data spans three dimensions

Volume Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.

Variety Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more

Velocity Often timesensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.

Customer challenges for securing Big Data

Awareness & Understanding are lacking
Customers are not actively talking about security concerns. Customers need help understanding threats in a Big Data environment

Companies policies & laws add complexity

Main considerations: Synchronizing retention and disposition policies across jurisdictions, moving data across countries. Customers need help navigating frameworks and changes

Storage Efficiency challenges for Big Data

DeDuplication
Challenge: In most instances, data is random and inconsistent, not duplicated Opportunity: There is a need for more intelligent identification of data

Compression

Challenge: Compression normally happens instead of deduplication, yet, will compress duplicated data regardless Opportunity: There is a need for an automated manner in doing both de-duplicating, and then compressing

About Hadoop
Hadoop is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. Solution for Big Data: Deals with complexities of high volume, Velocity & variety of data. It enables applications to work with thousands of nodes and petabytes of data. It is:-

Reliable : The software is fault tolerant, it expects and handles hardware and software failures

Scalable Designed for massive scale of processors, memory, and local attached storage

Distributed Handles replication. Offers massively parallel programming model, MapReduce

About Apache Hadoop Software Library

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.

It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver highavailability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availaible service on top of a cluster of computers, each of which may be prone to failures

Market Drivers for Apache Hadoop

Business Drivers High-value projects that require use of more data Belief that there is great ROI in mastering big data

Financial Drivers Growing cost of data systems as percentage of IT spendCost advantage of commodity hardware + opensource Enables departmental-level big data strategies

Trend
The OLD WAY
Operational systems keep only current records, short history

The New Trend

Keep raw data in Hadoop for a longtime Able to produce a new analytics view on-demand Keep a new copy of data that was previously on in silos Can directly do new reports, experiments at low incremental cost

Analytics systems keep only conformed/cleaned/digested data Unstructured data locked away in operational silos

Archives offline:-Inflexible, new questions require system redesigns

New products/services can be added very quickly

Agile outcome justifies new infrastructure

Hadoop is a part of a larger framework of related technologies

HDFS: Hadoop Distributed File System

HBase: Column oriented, non-relational, schema-less, distributed database modeled after Googles BigTable. Promises Random, real-time read/write access to Big Data Hive: Data warehouse system that provides SQL interface. Data structure can be projected ad hoc onto unstructured underlying data

Pig: A platform for manipulating and analyzing large data sets. High level language for analysts

ZooKeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

Organizations using Hadoop

Hadoop Developer Core contributor since Hadoops infancy Project Lead for Hadoop Distributed File System Facebook (Hadoop, Hive, Scribe) Yahoo! (Hadoop in Yahoo Search) Veritas (San Point Direct, Veritas File System) IBM Transarc (Andrew File System) UW Computer Science Alumni (Condor Project)

Why Hadoop Is needed?

Need to process Multi Petabyte Datasets
Expensive to build reliability in each application. Nodes fail every day
Failure is expected, rather than exceptional. The number of nodes in a cluster is not constant.

Need common infrastructure Efficient, reliable, Open Source Apache License

The above goals are same as Condor, but

Workloads are IO bound and not CPU bound

Hadoop is particularly useful when:Complex information processing is needed

Unstructured data needs to be turned into structured data

Queries cant be reasonably expressed using SQL Heavily recursive algorithms Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing Machine learning Data sets are too large to fit into database RAM, discs, or require too many cores (10s of TB up to PB) Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost

Results are not needed in real time

Fault tolerance is critical Significant custom coding would be required to handle job scheduling

Hadoop Is being used as a

Staging layer: The most common use of Hadoop in enterprise environments is as Hadoop ETL preprocessing, filtering, and transforming vast quantities of semistructured and unstructured data for loading into a data warehouse.

Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.

Content analytics layer: next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.

Karmasphere released the results of a survey of 102 Hadoop developers regarding adoption, use and future plans

What Data Projects is Hadoop Driving?

Are Companies Adopting Hadoop?

More than one-half (54%) of organizations surveyed are using or considering Hadoop for largescale data processing needs More than twice as many Hadoop users report being able to create new products and services and enjoy costs savings beyond those using other platforms; over 82% benefit from faster analyses and better utilization of computing resources 87% of Hadoop users are performing or planning new types of analyses with large scale data 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data Organizations use Hadoop in particular to work with unstructured data such as logs and event data (63%) More than two-thirds of Hadoop users perform advanced analysis data mining or algorithm development and testing

Hadoop At Linkedin:-

LinkedIn leverages Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIns 125 million member base. LinkedIn then uses Lucene to do real-time recommendations, and also Lucene on Hadoop to bridge offline analysis with user-facing services. The streams of user-generated information, referred to as a social media feeds, may contain valuable, real-time information on the LinkedIn member opinions, activities, and mood states.

Hadoop At Forsquare

Forsquare were finding problems in handling huge amount of data which they are handling. Their Business development managers, venue specialists, and upper management eggheads needed access to the data in order to inform some important decisions.

To enable easy access to data foursquare engineering decided to use Apache Hadoop and Apache Hive in combination with a custom data server (built in Ruby), all running in Amazon EC2. The data server is built using Rails, MongoDB, Redis, and Resque and communicates with Hive using the ruby Thrift client.

Hadoop @ Orbitz

Orbitz needed an infrastructure that provides:Long term storage of large data sets; Open access for developers and business analysts; Ad-hoc quering of data

Rapid deploying of reporting applications.

They moved to Hadoop and Hive to provide reliable and scalable storage and processing of data on inexpensive commodity hardware.

HDFS Architecture
Metadata ops Namenode Client Read Block ops Datanodes
replication

Metadata(Name, replicas..) (/home/foo/data,6. ..

Datanodes B Blocks

Rack1

Write Client

Rack2

7/30/2012

Understanding The Big Data Problems and Their Solutions Using Hadoop and Map-Reduce
No ratings yet
Understanding The Big Data Problems and Their Solutions Using Hadoop and Map-Reduce
7 pages
Rao 2018
No ratings yet
Rao 2018
81 pages
Big Data Insights for Analysts
No ratings yet
Big Data Insights for Analysts
8 pages
The Role of Big Data
No ratings yet
The Role of Big Data
27 pages
1354 Nigeria Country Su
No ratings yet
1354 Nigeria Country Su
48 pages
Datamicron Big Data Analytics Solution
No ratings yet
Datamicron Big Data Analytics Solution
23 pages
Petrobangla: Daily Gas & Condensate Production and Distribution Report I. Production III. LPG Production
No ratings yet
Petrobangla: Daily Gas & Condensate Production and Distribution Report I. Production III. LPG Production
3 pages
Demand Dignity: Petrol, Pollution + Poverty in The Niger Delta
No ratings yet
Demand Dignity: Petrol, Pollution + Poverty in The Niger Delta
24 pages
Distillation and Refining Processes
No ratings yet
Distillation and Refining Processes
4 pages
A Dynamic Approach To Performance Management System in Academic Institutions
No ratings yet
A Dynamic Approach To Performance Management System in Academic Institutions
5 pages
Chapter 9. CH 09-10 Build A Model: Growth Sales
No ratings yet
Chapter 9. CH 09-10 Build A Model: Growth Sales
6 pages
MNLP (I.e., MINLP) Model For Optimal Synthesis and Operation of Utility Plants
No ratings yet
MNLP (I.e., MINLP) Model For Optimal Synthesis and Operation of Utility Plants
35 pages
DSX InfoSphere DataStage Is Big Data Integration 2013-05-13
50% (2)
DSX InfoSphere DataStage Is Big Data Integration 2013-05-13
30 pages
General: GDP Is Expected To Grow in The Region of 8.75% To 9.25%. The Minister
No ratings yet
General: GDP Is Expected To Grow in The Region of 8.75% To 9.25%. The Minister
5 pages
Unit II Lecture Notes
No ratings yet
Unit II Lecture Notes
26 pages
Nigeria The Dynamics of A Growing Gas Market - Presentation To The Africa Energy Forum 2008
No ratings yet
Nigeria The Dynamics of A Growing Gas Market - Presentation To The Africa Energy Forum 2008
9 pages
Big Data
100% (2)
Big Data
20 pages
Poseidon - Summary Crude Oil Assay Report: Source of Sample Light Hydrocarbon Analysis Assay Summary / TBP Data
No ratings yet
Poseidon - Summary Crude Oil Assay Report: Source of Sample Light Hydrocarbon Analysis Assay Summary / TBP Data
3 pages
Random Encoded Text Document
No ratings yet
Random Encoded Text Document
53 pages
Hadoop Notes Unit2
No ratings yet
Hadoop Notes Unit2
24 pages
Opus Quiz 1
No ratings yet
Opus Quiz 1
2 pages
Sns College of Engineering: Big Data Analytics
No ratings yet
Sns College of Engineering: Big Data Analytics
17 pages
Nigeria Gas-to-Power Strategy
No ratings yet
Nigeria Gas-to-Power Strategy
23 pages
ABC WeDig+Diagnostics 228003
No ratings yet
ABC WeDig+Diagnostics 228003
5 pages
Accountant TPIA Analysis Muhammad Reza Handyansyah
No ratings yet
Accountant TPIA Analysis Muhammad Reza Handyansyah
4 pages
Big Data: Challenges and Applications
No ratings yet
Big Data: Challenges and Applications
30 pages
Aliko Dangote: Africa's Richest Businessman
No ratings yet
Aliko Dangote: Africa's Richest Businessman
1 page
Feb 2018 Barclays Teach-In FINAL (2!15!2018)
No ratings yet
Feb 2018 Barclays Teach-In FINAL (2!15!2018)
31 pages
Leading The: in Gas Compression
No ratings yet
Leading The: in Gas Compression
2 pages
Power Sector Retreat Presentations - Jan 20-21 2012 - NNPC Gas
No ratings yet
Power Sector Retreat Presentations - Jan 20-21 2012 - NNPC Gas
29 pages
Study Material On Refining
No ratings yet
Study Material On Refining
20 pages
Business Licensing Fee Schedule
No ratings yet
Business Licensing Fee Schedule
1 page
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
Leadind Trends IN DATA ANALYTICS
No ratings yet
Leadind Trends IN DATA ANALYTICS
13 pages
World Refineries Database Brochure
No ratings yet
World Refineries Database Brochure
4 pages
Brownfield Refinery Policy-2023
No ratings yet
Brownfield Refinery Policy-2023
37 pages
Petroleum Exploration & Production POLICY 2011: Government of Pakistan Ministry of Petroleum & Natural Resources
No ratings yet
Petroleum Exploration & Production POLICY 2011: Government of Pakistan Ministry of Petroleum & Natural Resources
54 pages
Big Data Analysis Guide
No ratings yet
Big Data Analysis Guide
11 pages
Evolutionary Programming Based Economic Power Dispatch Solutions With Independent Power Producers
No ratings yet
Evolutionary Programming Based Economic Power Dispatch Solutions With Independent Power Producers
6 pages
Re-Engineering The Uk Construction Industry: The Process Protocol
No ratings yet
Re-Engineering The Uk Construction Industry: The Process Protocol
12 pages
Financial Time Series Forecasting With Deep Learning: A Systematic Literature Review: 2005-2019
No ratings yet
Financial Time Series Forecasting With Deep Learning: A Systematic Literature Review: 2005-2019
63 pages
6281 MTH BIM BOG UK Web 2nd Edition August 2018
No ratings yet
6281 MTH BIM BOG UK Web 2nd Edition August 2018
84 pages
Economic Evaluation and Sensitivity Analysis of Some Fuel Oil Upgrading Processes
No ratings yet
Economic Evaluation and Sensitivity Analysis of Some Fuel Oil Upgrading Processes
11 pages
ACF5950-Assignment-2801656-kaidi Zhang
No ratings yet
ACF5950-Assignment-2801656-kaidi Zhang
13 pages
SPE-203740-MS Implications of Petroleum Industry Fiscal Bill 2018 On Heavy Oil Field Economics
No ratings yet
SPE-203740-MS Implications of Petroleum Industry Fiscal Bill 2018 On Heavy Oil Field Economics
19 pages
Lubricant Guide for M&M Export Models
No ratings yet
Lubricant Guide for M&M Export Models
1 page
Business Strategy Assignment
No ratings yet
Business Strategy Assignment
10 pages
Big Data Processing and Analytics
No ratings yet
Big Data Processing and Analytics
29 pages
CS2032 Unit I Notes
No ratings yet
CS2032 Unit I Notes
23 pages
Petrochemical Industry Overview
No ratings yet
Petrochemical Industry Overview
3 pages
Public Private Partnership
No ratings yet
Public Private Partnership
17 pages
ExxonMobil XTO Merger Overview
100% (1)
ExxonMobil XTO Merger Overview
22 pages
Oil To Petrochemicals
No ratings yet
Oil To Petrochemicals
3 pages
Maximum FCC Diesel Yield With TSRFCC Technology
100% (1)
Maximum FCC Diesel Yield With TSRFCC Technology
2 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Hadoop PPT
100% (1)
Hadoop PPT
25 pages
Data Science
No ratings yet
Data Science
87 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Datareader in C#: John Hudai Godel
No ratings yet
Datareader in C#: John Hudai Godel
6 pages
Equip Sim User Manual
No ratings yet
Equip Sim User Manual
131 pages
Double Sense Manual 98
No ratings yet
Double Sense Manual 98
44 pages
Bernstein Polynomials
No ratings yet
Bernstein Polynomials
13 pages
Part 6 Chapter 5
No ratings yet
Part 6 Chapter 5
64 pages
Industrial Training Report
No ratings yet
Industrial Training Report
17 pages
Digital Thread Implementation and Tool Choices
No ratings yet
Digital Thread Implementation and Tool Choices
49 pages
Incose SD Sept2019 Presentation Charley Patton Mbse A Practical Approach v01
No ratings yet
Incose SD Sept2019 Presentation Charley Patton Mbse A Practical Approach v01
20 pages
Final Research Paper
No ratings yet
Final Research Paper
5 pages
Activate Methodology Summary
No ratings yet
Activate Methodology Summary
55 pages
Mathematics
No ratings yet
Mathematics
2 pages
Forms Design and Control
100% (1)
Forms Design and Control
50 pages
Oracle Cost Management Report
No ratings yet
Oracle Cost Management Report
36 pages
PNF Visunet HMI Monitor PDF
No ratings yet
PNF Visunet HMI Monitor PDF
12 pages
Supply Chain Flowchart
No ratings yet
Supply Chain Flowchart
8 pages
Mat 575
No ratings yet
Mat 575
6 pages
Introduction To The Internet of Things: By-Sa
No ratings yet
Introduction To The Internet of Things: By-Sa
63 pages
Electronic Gear
No ratings yet
Electronic Gear
6 pages
DTS Phase 4 F19BS058
No ratings yet
DTS Phase 4 F19BS058
35 pages
2.development of Blind Assistive Device in Shopping Malls
No ratings yet
2.development of Blind Assistive Device in Shopping Malls
4 pages
Survey On QoEQoS Correlation Models For Multimedia
No ratings yet
Survey On QoEQoS Correlation Models For Multimedia
21 pages
C1 Editable End-Of-Year Test
No ratings yet
C1 Editable End-Of-Year Test
9 pages
Chapter 13 Database Development Process - Database Design
No ratings yet
Chapter 13 Database Development Process - Database Design
7 pages
CCC Exam MCQ
No ratings yet
CCC Exam MCQ
95 pages
Report Gamification
No ratings yet
Report Gamification
22 pages
AFMAN 33-363 Management of Records PDF
No ratings yet
AFMAN 33-363 Management of Records PDF
59 pages
SRIHARI V RESUME Rev
No ratings yet
SRIHARI V RESUME Rev
3 pages
Project Realization
No ratings yet
Project Realization
8 pages
Student's Hospital Admin Project
No ratings yet
Student's Hospital Admin Project
25 pages
VINEET SHARMA CV Updated
No ratings yet
VINEET SHARMA CV Updated
3 pages

Hadoop V.01

Uploaded by

Hadoop V.01

Uploaded by

HADOOP/BIG DATA

About Big Data

Big data spans three dimensions

Customer challenges for securing Big Data

Companies policies & laws add complexity

Storage Efficiency challenges for Big Data

Distributed Handles replication. Offers massively parallel programming model, MapReduce

About Apache Hadoop Software Library

Market Drivers for Apache Hadoop

The New Trend

Archives offline:-Inflexible, new questions require system redesigns

New products/services can be added very quickly

Hadoop is a part of a larger framework of related technologies

Organizations using Hadoop

Why Hadoop Is needed?

Need common infrastructure Efficient, reliable, Open Source Apache License

The above goals are same as Condor, but

Hadoop is particularly useful when:Complex information processing is needed

Unstructured data needs to be turned into structured data

Results are not needed in real time

Hadoop Is being used as a

What Data Projects is Hadoop Driving?

Are Companies Adopting Hadoop?

Rapid deploying of reporting applications.

Metadata(Name, replicas..) (/home/foo/data,6. ..

You might also like