www.luxoft.com 
DWH & Big Data 
Odessa 
Vladimir Slobodianiuk 
Date: 2014
www.luxoft.com 
Agenda 
1 
2 
Big Data – what is it 
Hadoop vs RDBMS – pros and cons 
3 Hadoop & Enterprise architecture 
4 Hadoop as ETL engine 
5 Case Studies
www.luxoft.com 
Big Data 
– what is it
www.luxoft.com 
Current state 
 Big data - is an all-encompassing term for any collection of data sets so large and 
complex that it becomes difficult to process using traditional data processing 
applications.
www.luxoft.com 
Limitations & Problems 
 Big data is difficult to work with using 
most relational databases, requiring 
instead massively parallel software 
running on tens, hundreds, or even 
thousands of servers 
 eBay.com uses two data warehouses at 7.5 petabytes 
 Walmart handles more than 1 million customer 
transactions every hour 
 Facebook handles 50 billion photos from its user base 
 In 2012, the Obama administration announced the Big 
Data Research and Development Initiative
www.luxoft.com 
Hadoop vs RDBMS
www.luxoft.com 
CORE HADOOP - MapReduce 
In 2004, Google published a paper on a process called MapReduce 
 DISTRIBUTED 
COMPUTING 
FRAMEWORK 
 Process large jobs in 
parallel across many 
nodes and combine the 
results
www.luxoft.com 
Hadoop Structure 
 HDFS is a distributed file system designed to run on commodity hardware 
 HBase store data rows in labelled tables (sortable key and an arbitrary number of columns) 
 Hive provide data summarization, query, and analysis (SQL-like interface) 
 Pig is a platform for analyzing large data sets that consists of a high-level language
www.luxoft.com 
Hadoop vs RDBMS 
Hadoop RDBMS 
 Performance for relational data 
 Machine query optimization 
 Mature workload management 
 High concurrency interactive query 
processing 
How might this change in the future 
 Query Optimization Improvements in Hive 
– Statistics, better join ordering, more join types, etc 
 Startup Time Improvements 
– Simpler query plans to pass out 
 Runtime Performance Improvements 
 Schema-less Model 
 Human query optimization 
 Ability to create complex dataflow 
with multiple inputs and outputs 
 Parallelize many Analytic Functions
www.luxoft.com 
Hadoop & 
Enterprise architecture
www.luxoft.com 
Classic architecture approach
www.luxoft.com 
Hadoop & Enterprise architecture
www.luxoft.com 
Case Study 1 
Hadoop as ETL Data Quality tool 
BENEFITS 
 Reduced TCO (commodity hardware usage) 
 Traceability of all the data quality issues 
 Hadoop becomes clean data tool. 
PROBLEM 
Traditional tools show poor performance in exception 
and data cleansing. 
SOLUTION 
Hadoop transforms the data into single format and 
processes it using data cleansing workflows.
www.luxoft.com 
Case Study 2 
Know Your Customer PoC 
Business Challenge 
• Knowing the actual customer 
reaction to products is essential 
for business growth, but it’s 
difficult to get valuable insights. 
Social media is the place where 
customer really share their 
opinion 
SOLUTION 
Hadoop-based analysis tool that 
provides the ability to: 
• Find the events in the client 
streams, identify needed 
reaction 
• Propose a product to a client, 
based on his interests
www.luxoft.com 
Case Study 3 
Enterprise ETL & Hadoop Integration 
Goals: 
 MapReduce ETL jobs development 
without coding 
 Build, re-use, and check impact analysis 
with enhanced metadata capabilities 
 A windows-based graphical development 
environment 
 Comprehensive built-in transformations 
 A library of Use Case Accelerators to 
fast-track Hadoop productivity
www.luxoft.com 
Big Data: 
 
Cutting edge of DI technologies 
 
State-of-the-art design approaches 
 
A bit more than simple development, it's some of art, art 
of data management 
Summary
www.luxoft.com 
THANK YOU

More Related Content

PPTX
Cheetah:Data Warehouse on Top of MapReduce
PPTX
TechEvent DWH Modernization
PPT
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
PDF
GigaOm-sector-roadmap-cloud-analytic-databases-2017
PDF
Balance agility and governance with #TrueDataOps and The Data Cloud
PPTX
Integrated dwh 3
PDF
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
PPT
Kb 40 kevin_klineukug_reading20070717[1]
Cheetah:Data Warehouse on Top of MapReduce
TechEvent DWH Modernization
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
GigaOm-sector-roadmap-cloud-analytic-databases-2017
Balance agility and governance with #TrueDataOps and The Data Cloud
Integrated dwh 3
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
Kb 40 kevin_klineukug_reading20070717[1]

What's hot (19)

PPTX
Delivering Data Democratization in the Cloud with Snowflake
PDF
Company report xinglian
PDF
Data Mesh for Dinner
PPTX
Intro to Data Vault 2.0 on Snowflake
PPTX
Altis AWS Snowflake Practice
PPTX
Piranha vs. mammoth predator appliances that chew up big data
PDF
Rise of the Data Cloud
PPTX
Snowflake: The Good, the Bad, and the Ugly
PPTX
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
PPTX
Chug building a data lake in azure with spark and databricks
PDF
Building the Enterprise Data Lake - Important Considerations Before You Jump In
PDF
Demystifying Data Warehousing as a Service - DFW
PDF
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
PDF
Enabling a Data Mesh Architecture with Data Virtualization
PPTX
SnapLogic Cloud Integration
PPTX
Better Together: The New Data Management Orchestra
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
PDF
Building a Logical Data Fabric using Data Virtualization (ASEAN)
PDF
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Delivering Data Democratization in the Cloud with Snowflake
Company report xinglian
Data Mesh for Dinner
Intro to Data Vault 2.0 on Snowflake
Altis AWS Snowflake Practice
Piranha vs. mammoth predator appliances that chew up big data
Rise of the Data Cloud
Snowflake: The Good, the Bad, and the Ugly
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
Chug building a data lake in azure with spark and databricks
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Demystifying Data Warehousing as a Service - DFW
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
Enabling a Data Mesh Architecture with Data Virtualization
SnapLogic Cloud Integration
Better Together: The New Data Management Orchestra
Introducing the Snowflake Computing Cloud Data Warehouse
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Ad

Viewers also liked (18)

PPTX
Форс банковские системы
PDF
ReqLabs PechaKucha Евгений Сафроненко
PDF
Fieldbus Intro unity V1
PDF
Решения виртуализации Huawei FusionCloud VDI
PPT
Алексей Чумаков. Apache Cassandra на реальном проекте
PDF
Александр Соловьёв, Griddynamics.com
PPTX
Введение в Apache Cassandra
PDF
Создание географически-распределенных датацентров на базе инженерных систем
PDF
Технологии и продукты Oracle для обработки и анализа Больших Данных
PDF
SSAS: multidemention vs tabular mode
PDF
Big Data aggregation techniques
PDF
Social media art: как художники используют цифровую реальность
PPTX
Enterprise Architecture - Sergey Orlik (Microsoft Platforma 2011)
PPTX
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...
PPTX
Apache Cassandra. Ещё одно NoSQL хранилище (Владимир Климонтович)
PPTX
3 ibm bdw2015
PPT
Движение по хрупкому дну / Сергей Караткевич (servers.ru)
PPTX
Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Форс банковские системы
ReqLabs PechaKucha Евгений Сафроненко
Fieldbus Intro unity V1
Решения виртуализации Huawei FusionCloud VDI
Алексей Чумаков. Apache Cassandra на реальном проекте
Александр Соловьёв, Griddynamics.com
Введение в Apache Cassandra
Создание географически-распределенных датацентров на базе инженерных систем
Технологии и продукты Oracle для обработки и анализа Больших Данных
SSAS: multidemention vs tabular mode
Big Data aggregation techniques
Social media art: как художники используют цифровую реальность
Enterprise Architecture - Sergey Orlik (Microsoft Platforma 2011)
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...
Apache Cassandra. Ещё одно NoSQL хранилище (Владимир Климонтович)
3 ibm bdw2015
Движение по хрупкому дну / Сергей Караткевич (servers.ru)
Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Ad

Similar to FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft) (20)

PPTX
DWH & big data architecture approaches
PPTX
Владимир Слободянюк «DWH & BigData – architecture approaches»
PPTX
Big Data Practice_Planning_steps_RK
PDF
Modern data warehouse
PDF
Modern data warehouse
PPTX
Big data and apache hadoop adoption
PDF
Hadoop data-lake-white-paper
PPTX
Accelerating Big Data Analytics
PDF
PDF
Big Data , Big Problem?
PPTX
Hd insight overview
PDF
Infrastructure Considerations for Analytical Workloads
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
PPTX
A Glimpse of Bigdata - Introduction
PDF
What is hadoop
PDF
Hadoop and the Data Warehouse: Point/Counter Point
PPTX
Big Data and Hadoop
PPT
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
PPTX
Deutsche Telekom on Big Data
PDF
Hadoop Developer
DWH & big data architecture approaches
Владимир Слободянюк «DWH & BigData – architecture approaches»
Big Data Practice_Planning_steps_RK
Modern data warehouse
Modern data warehouse
Big data and apache hadoop adoption
Hadoop data-lake-white-paper
Accelerating Big Data Analytics
Big Data , Big Problem?
Hd insight overview
Infrastructure Considerations for Analytical Workloads
Oct 2011 CHADNUG Presentation on Hadoop
A Glimpse of Bigdata - Introduction
What is hadoop
Hadoop and the Data Warehouse: Point/Counter Point
Big Data and Hadoop
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Deutsche Telekom on Big Data
Hadoop Developer

More from GeeksLab Odessa (20)

PDF
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
PDF
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
PDF
DataScience Lab 2017_Блиц-доклад_Турский Виктор
PDF
DataScience Lab 2017_Обзор методов детекции лиц на изображение
PDF
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
PDF
DataScienceLab2017_Блиц-доклад
PDF
DataScienceLab2017_Блиц-доклад
PDF
DataScienceLab2017_Блиц-доклад
PDF
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
PDF
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
PDF
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
PDF
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
PDF
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
PDF
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
PDF
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
PDF
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
PDF
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
PDF
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
PDF
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
PPTX
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js

FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)

  • 1. www.luxoft.com DWH & Big Data Odessa Vladimir Slobodianiuk Date: 2014
  • 2. www.luxoft.com Agenda 1 2 Big Data – what is it Hadoop vs RDBMS – pros and cons 3 Hadoop & Enterprise architecture 4 Hadoop as ETL engine 5 Case Studies
  • 3. www.luxoft.com Big Data – what is it
  • 4. www.luxoft.com Current state  Big data - is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.
  • 5. www.luxoft.com Limitations & Problems  Big data is difficult to work with using most relational databases, requiring instead massively parallel software running on tens, hundreds, or even thousands of servers  eBay.com uses two data warehouses at 7.5 petabytes  Walmart handles more than 1 million customer transactions every hour  Facebook handles 50 billion photos from its user base  In 2012, the Obama administration announced the Big Data Research and Development Initiative
  • 7. www.luxoft.com CORE HADOOP - MapReduce In 2004, Google published a paper on a process called MapReduce  DISTRIBUTED COMPUTING FRAMEWORK  Process large jobs in parallel across many nodes and combine the results
  • 8. www.luxoft.com Hadoop Structure  HDFS is a distributed file system designed to run on commodity hardware  HBase store data rows in labelled tables (sortable key and an arbitrary number of columns)  Hive provide data summarization, query, and analysis (SQL-like interface)  Pig is a platform for analyzing large data sets that consists of a high-level language
  • 9. www.luxoft.com Hadoop vs RDBMS Hadoop RDBMS  Performance for relational data  Machine query optimization  Mature workload management  High concurrency interactive query processing How might this change in the future  Query Optimization Improvements in Hive – Statistics, better join ordering, more join types, etc  Startup Time Improvements – Simpler query plans to pass out  Runtime Performance Improvements  Schema-less Model  Human query optimization  Ability to create complex dataflow with multiple inputs and outputs  Parallelize many Analytic Functions
  • 10. www.luxoft.com Hadoop & Enterprise architecture
  • 12. www.luxoft.com Hadoop & Enterprise architecture
  • 13. www.luxoft.com Case Study 1 Hadoop as ETL Data Quality tool BENEFITS  Reduced TCO (commodity hardware usage)  Traceability of all the data quality issues  Hadoop becomes clean data tool. PROBLEM Traditional tools show poor performance in exception and data cleansing. SOLUTION Hadoop transforms the data into single format and processes it using data cleansing workflows.
  • 14. www.luxoft.com Case Study 2 Know Your Customer PoC Business Challenge • Knowing the actual customer reaction to products is essential for business growth, but it’s difficult to get valuable insights. Social media is the place where customer really share their opinion SOLUTION Hadoop-based analysis tool that provides the ability to: • Find the events in the client streams, identify needed reaction • Propose a product to a client, based on his interests
  • 15. www.luxoft.com Case Study 3 Enterprise ETL & Hadoop Integration Goals:  MapReduce ETL jobs development without coding  Build, re-use, and check impact analysis with enhanced metadata capabilities  A windows-based graphical development environment  Comprehensive built-in transformations  A library of Use Case Accelerators to fast-track Hadoop productivity
  • 16. www.luxoft.com Big Data:  Cutting edge of DI technologies  State-of-the-art design approaches  A bit more than simple development, it's some of art, art of data management Summary