Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Hadoop forHadoop for
D S i @BTData Science @BT

Big Data is transforming the economics of data processing
In 1980 storing and querying a year’s
worth of twitter data (if it existed)
would have required a spend of 75%
of the Apollo program.
In 2014 it costs approximately theIn 2014 it costs approximately the
same as a small used car (£6k).

First operational Hadoop cluster at Tinsley Park
Rest of BT Openreach
• 2 x 1.2 PByte Hadoop
cluster
• Logical separation
between Openreach
and rest of BT
• Based on learning
from Research
Hadoop cluster
© British Telecommunications plc
3
Research
cluster
activity

Research cluster usage
4

What are we doing with Hadoop
Data science to understand our networks and physical assets
• Fleet (vehicle maintenance / predictive parts ordering)
• Buildings (energy consumption, security)
• Networks (inventory, faults…)
'ETL'
• Ingest, transformation…
li• Data quality
StStorage
• Low cost, computational file store
5

Predictive ModellingPredictive Modelling
in Hadoop
6

From model development to production
Source data
build model
101011101010101001
000001110101111101
111110100101110000
100011111001000011
010101010000101011
111010101101010101
010101010110101010
evaluate model
production
implementation
In productionmodify model p
execute/monitor
7

In‐Hadoop model scoring
(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production A Zolotovitski Y Keselman)(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production, A. Zolotovitski, Y. Keselman)
8

Example of building a predictive model
Example adapted from A Handbook of Statistical Analyses using R, 2nd ed.
• Goal ‐ Predict body fat content from common anthropometric measurementsGoal Predict body fat content from common anthropometric measurements
• Reference measurement made using Dual Energy X‐ray Absorptiometry (DXA)
– This is very accurate– This is very accurate
– Isn’t practical for general use due to high costs and methodological effort
• Therefore useful to have a simple method to predict DXA measurement of body fatTherefore useful to have a simple method to predict DXA measurement of body fat
age DEXfat waistcirc hipcirc elbowbreadth kneebreadth
57 41.68 100.0 112.0 7.1 9.4
65 43.29 99.5 116.5 6.5 8.9
59 35.41 96.0 108.5 6.2 8.9
58 22.79 72.0 96.5 6.1 9.2
60 36.42 89.5 100.5 7.1 10.0
9
… … … … … …

Build a predictive model in R
library(rpart)
lib ( t kit)library(partykit)
# get the bodyfat data from the mboost package
data("bodyfat", package="mboost")
# build the model
bodyfat_rpart = rpart(DEXfat ~ age + waistcirc + hipcirc +
elbowbreadth + kneebreadth, data=bodyfat,
control = rpart.control(minsplit=10))
# plot initial regression tree
plot(as party(bodyfat rpart) tp args=list(id=FALSE))plot(as.party(bodyfat_rpart), tp_args=list(id=FALSE))
# save the model
model = bodyfat_rpart
save(model, file = './model.Rdata')
# save the test data
drops=c("DEXfat", "anthro3a", "anthro3b", "anthro3c", "anthro4")
bodyfat.testdata = bodyfat[,!(names(bodyfat) %in% drops)]
write.table(bodyfat.testdata, "./testdata.csv", sep=",", row.names=F, col.names=F)
10

5‐parameter regression tree
11

In‐Hadoop model scoring
(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production A Zolotovitski Y Keselman)(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production, A. Zolotovitski, Y. Keselman)
12

Model execution in Hive
• Use Hive streaming
– TRANSFORM() allows Hive to pipe data through a user provided script
– We will pipe data through ‘scorer.R’, the wrapper code around our regression tree model
hive>
ADD FILE ./model.Rdata
ADD FILE / R
Make the model file
and scoring script ADD FILE ./scorer.R
INSERT OVERWRITE TABLE default.modelscore
SELECT TRANSFORM(
t.age,
t.waistcirc,
The results of the
query will go into this
table in Hive
and scoring script
available to Hive
,
t.hipcirc,
t.elbowbreadth,
t.kneebreadth
) USING'scorer.R' AS age, score
FROM
default modelinput t;
Model parameters
are passed to scorer.R
Schema of the results
default.modelinput t;
The source data Hive will stream the
14
comes from this table
in Hive
source data through
this script

Model execution in Hive
Research cluster
HP BL460 blades
2 master nodes
Ti t
2 master nodes
6 worker nodes
Time to
score
data (s)
Half a billion
records scored inrecords scored in
under 4 minutes
Number of records (millions)
183 Mbyte 8.9 GByte
15

Persisting ModelsPersisting Models
in Hadoop
16

Persisting Models in Hadoop
Serialize R model, store in DB, fetch at run time via RJDBC
• Specify model(s) to fetch with arguments to wrapper script
• Deserialize models once ‐ minimal overhead
• Model(s) applied to all input records
Serialize R models, store and stream via Hive
• Pre‐join input records with model in the limit – a different model per recordPre join input records with model in the limit a different model per record
• Stream records/models via Hive to wrapper script
• Cache models in wrapper scriptpp p
17

Persisting and stream models via Hive
18

Results…ongoing
Difficult to
scheduleschedule
Testing…
~10 minutes to score:
10 million records
1 cached model – regression tree, 5 parameters, approx 12kBytes as JSON
21

Extracting maximum value from Big Data
Plumbing
D i tt d t i tDesign patterns: data ingest,
incremental processing, file
formats, compression, new
compute engines…p g
A l t t t t bl i i htA place to get trustable insight
Network/service observatory,
data provenance & governance,
collaborative analysis…y
Enabling data science ‘at scale’
In‐Hadoop modelling, in‐life A‐B
trials
22
trials…

Bits and pieces….
R Markdown
• Capture analytic workflow in a reproducible way
• Publish documents to interested community
R packagesR packages
• 'Hive' – simple wrapper around RJDBC to facilitate access to Hive
• TODO: packages to support model creation and persistenceTODO: packages to support model creation and persistence
23

Observatory
Data lifecycle: Analysis:Data lifecycle:
Tools to support
discovery,
provenance,
comprehension
Analysis:
Capture analytical
workflow –
reproduce, test,
validatecomprehension,
usage
validate…
Observatory:
a place where we
can observe and
understand the
health of our
systems
Optimise: Models:
24
Opt se
Optimise storage of data assets
Models:
PMML

Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

More Related Content

What's hot (20)

Similar to Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming (20)

More from huguk (20)

Recently uploaded (20)

Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming