Hadoop forHadoop for
D S i @BTData Science @BT
Big Data is transforming the economics of data processing 
In 1980 storing and querying a year’s 
worth of twitter data (if it existed) 
would have required a spend of 75%
of the Apollo program.
In 2014 it costs approximately theIn 2014 it costs approximately the 
same as a small used car (£6k).
First operational Hadoop cluster at Tinsley Park
Rest of BT Openreach
• 2 x 1.2 PByte Hadoop 
cluster
• Logical separation 
between Openreach
and rest of BT
• Based on learning 
from Research 
Hadoop cluster
© British Telecommunications plc
3
Research
cluster
activity
Research cluster  usage
© British Telecommunications plc
4
What are we doing with Hadoop
Data science to understand our networks and physical assets
• Fleet (vehicle maintenance / predictive parts ordering)
• Buildings (energy consumption, security)
• Networks (inventory, faults…)
'ETL'
• Ingest, transformation…
li• Data quality
StStorage
• Low cost, computational file store
© British Telecommunications plc
5
Predictive ModellingPredictive Modelling
in Hadoop
© British Telecommunications plc
6
From model development to production
Source data
build model
101011101010101001
000001110101111101
111110100101110000
100011111001000011
010101010000101011
111010101101010101
010101010110101010
evaluate model
production 
implementation
In productionmodify model p
execute/monitor
© British Telecommunications plc
7
In‐Hadoop model scoring
(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production A Zolotovitski Y Keselman)(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production, A. Zolotovitski, Y. Keselman)
© British Telecommunications plc
8
Example of building a predictive model
Example adapted from A Handbook of Statistical Analyses using R, 2nd ed.
• Goal ‐ Predict body fat content from common anthropometric measurementsGoal  Predict body fat content from common anthropometric measurements
• Reference measurement made using Dual Energy X‐ray Absorptiometry (DXA)
– This is very accurate– This is very accurate
– Isn’t practical for general use due to high costs and methodological effort
• Therefore useful to have a simple method to predict DXA measurement of body fatTherefore useful to have a simple method to predict DXA measurement of body fat
age DEXfat waistcirc hipcirc elbowbreadth kneebreadth
57 41.68 100.0 112.0 7.1 9.4
65 43.29 99.5 116.5 6.5 8.9
59 35.41 96.0 108.5 6.2 8.9
58 22.79 72.0 96.5 6.1 9.2
60 36.42 89.5 100.5 7.1 10.0
© British Telecommunications plc
9
… … … … … …
Build a predictive model in R
library(rpart)
lib ( t kit)library(partykit)
# get the bodyfat data from the mboost package
data("bodyfat", package="mboost")
# build the model
bodyfat_rpart = rpart(DEXfat ~ age + waistcirc + hipcirc +
elbowbreadth + kneebreadth, data=bodyfat,
control = rpart.control(minsplit=10))
# plot initial regression tree
plot(as party(bodyfat rpart) tp args=list(id=FALSE))plot(as.party(bodyfat_rpart), tp_args=list(id=FALSE))
# save the model
model = bodyfat_rpart
save(model, file = './model.Rdata')
# save the test data
drops=c("DEXfat", "anthro3a", "anthro3b", "anthro3c", "anthro4")
bodyfat.testdata = bodyfat[,!(names(bodyfat) %in% drops)]
write.table(bodyfat.testdata, "./testdata.csv", sep=",", row.names=F, col.names=F)
© British Telecommunications plc
10
5‐parameter regression tree
© British Telecommunications plc
11
In‐Hadoop model scoring
(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production A Zolotovitski Y Keselman)(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production, A. Zolotovitski, Y. Keselman)
© British Telecommunications plc
12
Model execution in Hive
• Use Hive streaming
– TRANSFORM() allows Hive to pipe data through a user provided script
– We will pipe data through ‘scorer.R’, the wrapper code around our regression tree model  
hive>
ADD FILE ./model.Rdata
ADD FILE / R
Make the model file 
and scoring script ADD FILE ./scorer.R
INSERT OVERWRITE TABLE default.modelscore
SELECT TRANSFORM(
t.age,
t.waistcirc,
The results of the 
query will go into this 
table in Hive
and scoring script 
available to Hive
,
t.hipcirc,
t.elbowbreadth,
t.kneebreadth
) USING'scorer.R' AS age, score
FROM
default modelinput t;
Model parameters 
are passed to scorer.R
Schema of the results
default.modelinput t;
The source data  Hive will stream the 
© British Telecommunications plc
14
comes from this table 
in Hive
source data through 
this script
Model execution in Hive
Research cluster
HP BL460 blades
2 master nodes
Ti t
2 master nodes
6 worker nodes
Time to
score
data (s)
Half a billion 
records scored inrecords scored in 
under 4 minutes
Number of records (millions)
183 Mbyte 8.9 GByte
© British Telecommunications plc
15
Persisting ModelsPersisting Models
in Hadoop
© British Telecommunications plc
16
Persisting Models in Hadoop
Serialize R model, store in DB, fetch at run time via RJDBC
• Specify model(s) to fetch with arguments to wrapper script
• Deserialize models once ‐ minimal overhead
• Model(s) applied to all input records
Serialize R models, store and stream via Hive
• Pre‐join input records with model in the limit – a different model per recordPre join input records with model  in the limit  a different model per record
• Stream records/models via Hive to wrapper script
• Cache models in wrapper scriptpp p
© British Telecommunications plc
17
Persisting and stream models via Hive
© British Telecommunications plc
18
Results…ongoing
Difficult to 
scheduleschedule
Testing…
~10 minutes to score:
10 million records
1 cached model – regression tree, 5 parameters, approx 12kBytes as JSON
© British Telecommunications plc
21
Extracting maximum value from Big Data
Plumbing
D i tt d t i tDesign patterns: data ingest, 
incremental processing, file 
formats, compression, new 
compute engines…p g
A l t t t t bl i i htA place to get trustable insight
Network/service observatory, 
data provenance & governance, 
collaborative analysis…y
Enabling data science ‘at scale’
In‐Hadoop modelling, in‐life A‐B 
trials
© British Telecommunications plc
22
trials…
Bits and pieces….
R Markdown
• Capture analytic workflow in a reproducible way
• Publish documents to interested community
R packagesR packages
• 'Hive' – simple wrapper around RJDBC to facilitate access to Hive
• TODO: packages to support model creation and persistenceTODO: packages to support model creation and persistence
© British Telecommunications plc
23
Observatory
Data lifecycle: Analysis:Data lifecycle:
Tools to support 
discovery, 
provenance, 
comprehension
Analysis:
Capture analytical 
workflow –
reproduce, test, 
validatecomprehension, 
usage
validate…
Observatory:
a place where we 
can observe and 
understand the 
health of our 
systems
Optimise: Models:
© British Telecommunications plc
24
Opt se
Optimise  storage of data assets
Models:
PMML
Q i ?Questions?

More Related Content

PDF
Hivemall Talk at TD tech talk #3
PDF
A Map of the PyData Stack
PDF
MLconf NYC Shan Shan Huang
PDF
Introduction to Spark: Or how I learned to love 'big data' after all.
PDF
Introduction to Apache Hivemall v0.5.0
PDF
The evolution of array computing in Python
PPTX
The Other HPC: High Productivity Computing in Polystore Environments
Hivemall Talk at TD tech talk #3
A Map of the PyData Stack
MLconf NYC Shan Shan Huang
Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Apache Hivemall v0.5.0
The evolution of array computing in Python
The Other HPC: High Productivity Computing in Polystore Environments

What's hot (20)

PPT
The Python Programming Language and HDF5: H5Py
PPT
Substituting HDF5 tools with Python/H5py scripts
PDF
Scikit-Learn in Particle Physics
PDF
ffbase, statistical functions for large datasets
PDF
An Introduction to Spark with Scala
PPTX
All AI Roads lead to Distribution - Dot AI
PPT
Lecture 28
PDF
Streaming Data in R
PPTX
Vertica
PDF
Incubating Apache Hivemall
PDF
Automated Machine Learning via Sequential Uniform Designs
PDF
The road ahead for scientific computing with Python
PPTX
Hadoop and Storm - AJUG talk
PDF
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
PDF
Briefing on the Modern ML Stack with R
PDF
Inventory theory presentation
PDF
Pycon tw 2013
PPTX
Data engineering and analytics using python
PDF
Seqpig script language for large bioinformatic datasets
The Python Programming Language and HDF5: H5Py
Substituting HDF5 tools with Python/H5py scripts
Scikit-Learn in Particle Physics
ffbase, statistical functions for large datasets
An Introduction to Spark with Scala
All AI Roads lead to Distribution - Dot AI
Lecture 28
Streaming Data in R
Vertica
Incubating Apache Hivemall
Automated Machine Learning via Sequential Uniform Designs
The road ahead for scientific computing with Python
Hadoop and Storm - AJUG talk
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Briefing on the Modern ML Stack with R
Inventory theory presentation
Pycon tw 2013
Data engineering and analytics using python
Seqpig script language for large bioinformatic datasets
Ad

Similar to Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming (20)

PDF
Introduction to Hivemall
PDF
Berlin buzzwords 2018 TensorFlow on Hops
PDF
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
PDF
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
PPTX
Dancing with the Elephant
PDF
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
PDF
Distributed TensorFlow on Hops (Papis London, April 2018)
PDF
Forecasting Network Capacity for Global Enterprise Backbone Networks using Ma...
PDF
Integrate SparkR with existing R packages to accelerate data science workflows
PDF
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
PDF
Massively Parallel Processing with Procedural Python (PyData London 2014)
PDF
Python Powered Data Science at Pivotal (PyData 2013)
PDF
SparkR Best Practices for R Data Scientists
PDF
SparkR best practices for R data scientist
PDF
Db tech show - hivemall
PDF
VerticaPy_original - Anritsu.pdf
PDF
Automatic and Interpretable Machine Learning with H2O and LIME
PDF
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
PDF
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
PDF
Massively Parallel Process with Prodedural Python by Ian Huston
Introduction to Hivemall
Berlin buzzwords 2018 TensorFlow on Hops
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Dancing with the Elephant
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Distributed TensorFlow on Hops (Papis London, April 2018)
Forecasting Network Capacity for Global Enterprise Backbone Networks using Ma...
Integrate SparkR with existing R packages to accelerate data science workflows
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Massively Parallel Processing with Procedural Python (PyData London 2014)
Python Powered Data Science at Pivotal (PyData 2013)
SparkR Best Practices for R Data Scientists
SparkR best practices for R data scientist
Db tech show - hivemall
VerticaPy_original - Anritsu.pdf
Automatic and Interpretable Machine Learning with H2O and LIME
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Massively Parallel Process with Prodedural Python by Ian Huston
Ad

More from huguk (20)

PDF
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
PDF
ether.camp - Hackathon & ether.camp intro
PPTX
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
PPTX
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
PDF
Extracting maximum value from data while protecting consumer privacy. Jason ...
PDF
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
PDF
Streaming Dataflow with Apache Flink
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
PDF
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
PDF
Jonathon Southam: Venture Capital, Funding & Pitching
PDF
Signal Media: Real-Time Media & News Monitoring
PDF
Dean Bryen: Scaling The Platform For Your Startup
PDF
Peter Karney: Intro to the Digital catapult
PDF
Cytora: Real-Time Political Risk Analysis
PDF
Cubitic: Predictive Analytics
PDF
Bird.i: Earth Observation Data Made Social
PDF
Aiseedo: Real Time Machine Intelligence
PDF
Secrets of Spark's success - Deenar Toraskar, Think Reactive
PDF
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
PPTX
Hadoop - Looking to the Future By Arun Murthy
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
ether.camp - Hackathon & ether.camp intro
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Extracting maximum value from data while protecting consumer privacy. Jason ...
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Streaming Dataflow with Apache Flink
Lambda architecture on Spark, Kafka for real-time large scale ML
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Jonathon Southam: Venture Capital, Funding & Pitching
Signal Media: Real-Time Media & News Monitoring
Dean Bryen: Scaling The Platform For Your Startup
Peter Karney: Intro to the Digital catapult
Cytora: Real-Time Political Risk Analysis
Cubitic: Predictive Analytics
Bird.i: Earth Observation Data Made Social
Aiseedo: Real Time Machine Intelligence
Secrets of Spark's success - Deenar Toraskar, Think Reactive
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
Hadoop - Looking to the Future By Arun Murthy

Recently uploaded (20)

PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Auditboard EB SOX Playbook 2023 edition.
PDF
Advancing precision in air quality forecasting through machine learning integ...
PPTX
Training Program for knowledge in solar cell and solar industry
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PPTX
Microsoft User Copilot Training Slide Deck
PDF
SaaS reusability assessment using machine learning techniques
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Convolutional neural network based encoder-decoder for efficient real-time ob...
Auditboard EB SOX Playbook 2023 edition.
Advancing precision in air quality forecasting through machine learning integ...
Training Program for knowledge in solar cell and solar industry
SGT Report The Beast Plan and Cyberphysical Systems of Control
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
NewMind AI Weekly Chronicles – August ’25 Week IV
Co-training pseudo-labeling for text classification with support vector machi...
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Improvisation in detection of pomegranate leaf disease using transfer learni...
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
Enhancing plagiarism detection using data pre-processing and machine learning...
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Microsoft User Copilot Training Slide Deck
SaaS reusability assessment using machine learning techniques
Rapid Prototyping: A lecture on prototyping techniques for interface design

Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming