H2O – The Open Source Math Engine
Big Data Science
with H2O in R
4/23/13
H2O –
Open Source Math
& Machine Learning
for Big Data
Anqi Fu, August 2013
Universe is sparse. Life is messy.
Data is sparse & messy.
- Lao Tzu
Introduction to Big Data
• There are about as many bits of information in our digital
universe as there are stars in our actual universe.
• The process to decode the human genome took 10 years.
It can now be done in a week.
• Big data means more than “lots of data”
H2O – The Open Source Math Engine
Better
Predictions
Same Interface
Installation
1. Install and run H2O
• Command line: java –Xmx2g –jar h2o.jar
• Pull up http://localhost:54321 in browser
2. Install the R package
• install.packages(c(“RCurl”, “rjson”, “bitops”))
• install.packages(“Path/To/Package/ h2o_1.2.3.tar.gz", repos = NULL,
type = "source")
3. In R console, type library(h2o)
• demo(package=“h2o”)
• demo(h2o.glm)
Replace this!
Always have H2O running first!
Basic R Script
1. Tell R where H2O is running:
localH2O = new(“H2OClient”, ip=“127.0.0.1”, port=54321)
2. Check connection:
h2o.checkClient(localH2O)
3. Pass H2OClient as parameter to import:
h2o.importFile(localH2O, path=“Path/To/Data”, …)
Overview of Objects
• H2OClient: ip=character, port=numeric
• H2OParsedData: h2o=H2OClient, key=character
• H2OGLMModel: key=character, data=H2OParsedData,
model=list(coefficients, deviance, aic, etc)
Example: myModel@model$coefficients
H2O
key=“prostate.hex”
key=“airlines.hex”
Overview of Methods
Standard R H2O
read.csv, read.table, etc h2o.importFile, h2o.importURL
summary summary (limited to data only)
glm, glmnet h2o.glm(y, x, data, family, nfolds,
alpha, lambda)
kmeans h2o.kmeans(data, centers, cols,
iter.max)
randomForest, cforest h2o.randomForest(y, x_ignore,
data, ntree, depth, classwt)
Demo 1: Basic GLM in H2O through R
Demo 1: Prostate Cancer Data
• Prostate cancer data set from Ohio State University
Comprehensive Cancer Center
• N = 380 patients, ages ranging from 43-79
• Goal: Predict presence of tumor from baseline exam of
patient (age, race, PSA, total gleason score, etc)
Prostate Cancer
Data:
y = CAPSULE
0 = no tumor
1 = tumor
x = PSA
(prostate-specific antigen)
Prostate Cancer
Logistic Regression Fit
Family: Binomial, Link: Logit
Data:
y = CAPSULE
0 = no tumor
1 = tumor
x = PSA
(prostate-specific antigen)
Goal:
Estimate probability
CAPSULE = 1
GLM Parameters
• y = response variable
• x = predictor variables (vector)
• family = binomial (default link = logit)
• data = H2OParsedData object
• nfolds = cross-validation
• lambda = weight on penalty factor
• alpha = elastic net mixing parameter
• alpha = 0 is ridge penalty (L2 norm)
• alpha = 1 is lasso penalty (L1 norm)
Under the Hood: Hacking R for H2O
Under the Hood
REST API
Data
(JSON)
Import
Parse
H2O
Data Scientist,
Analyst, etc
GLM Code Snippet
• Create an object to represent model
setClass("H2OGLMModel", representation(key="character",
data="H2OParsedData", model="list"))
• Declare new method for algorithm
setGeneric("h2o.glm", function(x, y, data, family, nfolds = 10, alpha
= 0.5, lambda = 1.0e-5) { standardGeneric("h2o.glm") })
Name Slots
Parameter Initial Value
GLM Code Snippet
setMethod("h2o.glm", signature(x="character", y="character",
data="H2OParsedData", …), function(x, y, data, …) {
• Send parameters to GLM.json page  GLM job started
res = h2o.__remoteSend(data@h2o, h2o.__PAGE_GLM, key
= data@key, y = y, x = paste(x, sep="", collapse=","), …)
• Keep polling and wait until job completed
while(h2o.__poll(data@h2o,
res$response$redirect_request_args$job) != -1) { Sys.sleep(1) }
• Query Inspect.json page with GLM model key to get results
res = h2o.__remoteSend(data@h2o, h2o.__PAGE_INSPECT,
key=res$destination_key)
http://cran.r-project.org/doc/contrib/Genolini-S4tutorialV0-5en.pdf
Demo 2: Data Munging and Remote H2O
Demo 2: Airlines Data
• Airlines data set 1987-2013 from RITA (25%)
• Goal: Predict if flight’s arrival will be delayed
• Examine slices of data directly
head(airlines.hex, n = 10); tail(airlines.hex)
summary(airlines.hex$DepTime)
• Take a subset of data to play with in R
airlines.small = as.data.frame(airlines.hex[1:1000,])
glm(IsArrDelayed ~ Dest + Origin, family = binomial, data =
airlines.small)
http://www.transtats.bts.gov/Fields.asp?Table_ID=236
Connecting to H2O Remotely
• Your slip of paper contains IP/port of your assigned cluster
• Point R to remote H2O client
remoteH2O = new(“H2OClient”, ip = “192.168.1.161”, port = 54321)
• All data operations occur on cluster
h2o.importFile(remoteH2O, path =
“Path/On/Remote/Server/To/Data”, …)
• Objects/methods operate just like before!
Roadmap
• Long-term Goal: Full H2O/R Integration
• Subset col by name/index: df[,c(1,2)]; df[,”name”]
• Add/Remove cols: df[,-c(1,2)]; df[,3] = df[,2] + 1
• Filter rows: df[df$cName < 5,]
• Combine data frames by row/col: rbind, cbind
• Apply functions: tapply, sapply, lapply
• Support for R libraries (plyr, ggplot2, etc)
• More Algorithms: GBM, PCA, Neural Networks
4/23/13
Questions and
Suggestions?

Big Data Science with H2O in R

  • 1.
    H2O – TheOpen Source Math Engine Big Data Science with H2O in R
  • 2.
    4/23/13 H2O – Open SourceMath & Machine Learning for Big Data Anqi Fu, August 2013
  • 3.
    Universe is sparse.Life is messy. Data is sparse & messy. - Lao Tzu
  • 4.
    Introduction to BigData • There are about as many bits of information in our digital universe as there are stars in our actual universe. • The process to decode the human genome took 10 years. It can now be done in a week. • Big data means more than “lots of data”
  • 5.
    H2O – TheOpen Source Math Engine Better Predictions Same Interface
  • 6.
    Installation 1. Install andrun H2O • Command line: java –Xmx2g –jar h2o.jar • Pull up http://localhost:54321 in browser 2. Install the R package • install.packages(c(“RCurl”, “rjson”, “bitops”)) • install.packages(“Path/To/Package/ h2o_1.2.3.tar.gz", repos = NULL, type = "source") 3. In R console, type library(h2o) • demo(package=“h2o”) • demo(h2o.glm) Replace this!
  • 7.
    Always have H2Orunning first!
  • 8.
    Basic R Script 1.Tell R where H2O is running: localH2O = new(“H2OClient”, ip=“127.0.0.1”, port=54321) 2. Check connection: h2o.checkClient(localH2O) 3. Pass H2OClient as parameter to import: h2o.importFile(localH2O, path=“Path/To/Data”, …)
  • 9.
    Overview of Objects •H2OClient: ip=character, port=numeric • H2OParsedData: h2o=H2OClient, key=character • H2OGLMModel: key=character, data=H2OParsedData, model=list(coefficients, deviance, aic, etc) Example: myModel@model$coefficients H2O key=“prostate.hex” key=“airlines.hex”
  • 10.
    Overview of Methods StandardR H2O read.csv, read.table, etc h2o.importFile, h2o.importURL summary summary (limited to data only) glm, glmnet h2o.glm(y, x, data, family, nfolds, alpha, lambda) kmeans h2o.kmeans(data, centers, cols, iter.max) randomForest, cforest h2o.randomForest(y, x_ignore, data, ntree, depth, classwt)
  • 11.
    Demo 1: BasicGLM in H2O through R
  • 12.
    Demo 1: ProstateCancer Data • Prostate cancer data set from Ohio State University Comprehensive Cancer Center • N = 380 patients, ages ranging from 43-79 • Goal: Predict presence of tumor from baseline exam of patient (age, race, PSA, total gleason score, etc)
  • 14.
    Prostate Cancer Data: y =CAPSULE 0 = no tumor 1 = tumor x = PSA (prostate-specific antigen)
  • 15.
    Prostate Cancer Logistic RegressionFit Family: Binomial, Link: Logit Data: y = CAPSULE 0 = no tumor 1 = tumor x = PSA (prostate-specific antigen) Goal: Estimate probability CAPSULE = 1
  • 16.
    GLM Parameters • y= response variable • x = predictor variables (vector) • family = binomial (default link = logit) • data = H2OParsedData object • nfolds = cross-validation • lambda = weight on penalty factor • alpha = elastic net mixing parameter • alpha = 0 is ridge penalty (L2 norm) • alpha = 1 is lasso penalty (L1 norm)
  • 17.
    Under the Hood:Hacking R for H2O
  • 18.
    Under the Hood RESTAPI Data (JSON) Import Parse H2O Data Scientist, Analyst, etc
  • 19.
    GLM Code Snippet •Create an object to represent model setClass("H2OGLMModel", representation(key="character", data="H2OParsedData", model="list")) • Declare new method for algorithm setGeneric("h2o.glm", function(x, y, data, family, nfolds = 10, alpha = 0.5, lambda = 1.0e-5) { standardGeneric("h2o.glm") }) Name Slots Parameter Initial Value
  • 20.
    GLM Code Snippet setMethod("h2o.glm",signature(x="character", y="character", data="H2OParsedData", …), function(x, y, data, …) { • Send parameters to GLM.json page  GLM job started res = h2o.__remoteSend(data@h2o, h2o.__PAGE_GLM, key = data@key, y = y, x = paste(x, sep="", collapse=","), …) • Keep polling and wait until job completed while(h2o.__poll(data@h2o, res$response$redirect_request_args$job) != -1) { Sys.sleep(1) } • Query Inspect.json page with GLM model key to get results res = h2o.__remoteSend(data@h2o, h2o.__PAGE_INSPECT, key=res$destination_key) http://cran.r-project.org/doc/contrib/Genolini-S4tutorialV0-5en.pdf
  • 21.
    Demo 2: DataMunging and Remote H2O
  • 22.
    Demo 2: AirlinesData • Airlines data set 1987-2013 from RITA (25%) • Goal: Predict if flight’s arrival will be delayed • Examine slices of data directly head(airlines.hex, n = 10); tail(airlines.hex) summary(airlines.hex$DepTime) • Take a subset of data to play with in R airlines.small = as.data.frame(airlines.hex[1:1000,]) glm(IsArrDelayed ~ Dest + Origin, family = binomial, data = airlines.small)
  • 24.
  • 25.
    Connecting to H2ORemotely • Your slip of paper contains IP/port of your assigned cluster • Point R to remote H2O client remoteH2O = new(“H2OClient”, ip = “192.168.1.161”, port = 54321) • All data operations occur on cluster h2o.importFile(remoteH2O, path = “Path/On/Remote/Server/To/Data”, …) • Objects/methods operate just like before!
  • 26.
    Roadmap • Long-term Goal:Full H2O/R Integration • Subset col by name/index: df[,c(1,2)]; df[,”name”] • Add/Remove cols: df[,-c(1,2)]; df[,3] = df[,2] + 1 • Filter rows: df[df$cName < 5,] • Combine data frames by row/col: rbind, cbind • Apply functions: tapply, sapply, lapply • Support for R libraries (plyr, ggplot2, etc) • More Algorithms: GBM, PCA, Neural Networks
  • 27.

Editor's Notes

  • #7 http://docs.0xdata.com/quickstart/quickstart_R.htmlPackages  Install package(s)  Select CRAN mirror (US CA1)  Search for RCurl, rjson and bitops
  • #9 Pull up R and demo this in the console, making sure everyone can follow along
  • #10 H2OParsedData: Each data set/calculation associated with unique hex key, object acts like a “pointer”Model: coefficients, deviance, aic, df.residual, etc
  • #17 As penalty factor increases, lasso gives more sparse results (zero values), while ridge causes all coefficients to fall (but not hit zero necessarily)