0% found this document useful (0 votes)

71 views

Deep Learning With H2O's R Package: Arno Candel Viraj Parmar

The document describes H2O's deep learning architecture and capabilities for R users. It provides details on H2O's deep learning features such as its multi-threaded parallel training protocol, advanced optimization techniques, regularization options, and deep autoencoders for unsupervised learning. It also demonstrates H2O's deep learning functionality through a MNIST digit classification use case, discussing model building, evaluation, and hyperparameter optimization.

Uploaded by

virajparmar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views

Deep Learning With H2O's R Package: Arno Candel Viraj Parmar

Uploaded by

virajparmar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Deep Learning with H2Os R Package

Arno Candel

Viraj Parmar

October 2014

Contents
1 Introduction
1.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Deep learning overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2
2
3
3

2 H2Os Deep Learning architecture

2.1 Summary of features . . . . . . . . . . . . . . . .
2.2 Training protocol . . . . . . . . . . . . . . . . . .
2.2.1 Initialization . . . . . . . . . . . . . . . .
2.2.2 Activation and loss functions . . . . . . .
2.2.3 Parallel distributed network training . . .
2.2.4 Specifying the number of training samples
2.3 Regularization . . . . . . . . . . . . . . . . . . .
2.4 Advanced optimization . . . . . . . . . . . . . . .
2.4.1 Momentum training . . . . . . . . . . . .
2.4.2 Rate annealing . . . . . . . . . . . . . . .
2.4.3 Adaptive learning . . . . . . . . . . . . .
2.5 Loading data . . . . . . . . . . . . . . . . . . . .
2.5.1 Standardization . . . . . . . . . . . . . . .
2.6 Additional parameters . . . . . . . . . . . . . . .

. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
per iteration
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

4
4
5
5
5
6
7
8
8
8
9
9
9
10
10

3 Use case: MNIST digit classification

3.1 MNIST overview . . . . . . . . . . . . . . .
3.2 Performing a trial run . . . . . . . . . . . .
3.2.1 Extracting and handling the results
3.3 Web interface . . . . . . . . . . . . . . . . .
3.3.1 Variable importances . . . . . . . . .
3.3.2 Java model . . . . . . . . . . . . . .
3.4 Grid search for model comparison . . . . . .
3.5 Checkpoint model . . . . . . . . . . . . . .
3.6 Achieving world-record performance . . . .

.
.
.
.
.
.
.
.
.

10
10
11
11
12
12
12
12
13
14

4 Deep autoencoders
4.1 Nonlinear dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Use case: anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15
15
15

5 Appendix A: Complete parameter list

6 Appendix B: References

.
.
.
.
.
.
.
.
.

Introduction

Deep Learning has been dominating recent machine learning competitions with better predictions. Unlike the neural networks of the past, modern Deep Learning has cracked the code
for training stability and generalization and scales on big data. It is the algorithm of choice
for highest predictive accuracy. H2O is the worlds fastest open-source in-memory platform for
machine learning and predictive analytics on big data.
This documentation presents the Deep Learning framework in H2O, as experienced through
the H2O R interface. Further documentation on H2Os system and algorithms can be found
at the H2O.ai website at http://docs.h2o.ai, especially the R User documentation, and fully
featured tutorials are available at http://learn.h2o.ai. The datasets, R code and instructions for
this document can be found at the H2O GitHub repository at
https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo/.
This introductory section provides instructions on getting H2O started from R, followed by a
brief overview of deep learning.

1.1

Installation

To install H2O, follow the Download link on H2Os website at http://h2o.ai/. For multi-node
operation, download the H2O zip file and deploy H2O on your cluster, following instructions
from the Full Documentation. For single-node operation, follow the instructions in the Install
in R tab. Open your R Console and run the following to install and start H2O directly from
R:
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
# Next, we download, install and initialize the H2O package for R (filling in the
*s with the latest version number obtained from the H2O download page)
install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/master/
****/R", getOption("repos"))))
library(h2o)
Initialize H2O with
h2o_server = h2o.init()
With this command, the H2O R module will start an instance of H2O automatically at localhost:54321. Alternatively, to specify a connection with an existing H2O cluster node (other
than localhost at port 54321) you must explicitly state the IP address and port number in the
h2o.init() call. An example is given below, but do not directly paste; you should specify the
IP and port number appropriate to your specific environment.
h2o_cluster = h2o.init(ip = "192.555.1.123", port = 12345, startH2O = FALSE)
2

An automatic demo is available to see h2o.deeplearning at work. Run the following command
to observe an example binary classification model built through H2Os Deep Learning.
demo(h2o.deeplearning)

1.2

Support

Users of the H2O package may submit general enquiries and bug reports to the H2O.ai support
address. Alternatively, specific bugs or issues may be filed to the H2O.ai JIRA.

1.3

Deep learning overview

First we present a brief overview of deep neural networks for supervised learning tasks. There
are several theoretical frameworks for deep learning, and here we summarize the feedforward
architecture used by H2O.

The basic unit in the model (shown above) is the neuron, a biologically inspired model of the human neuron. For humans, varying strengths of neurons output signals travel along the synaptic
junctions and are then aggregated
P as input for a connected neurons activation. In the model,
the weighted combination = nj=1 wi xi + b of input signals is aggregated, and then an output
signal f () transmitted by the connected neuron. The function f represents the nonlinear activation function used throughout the network, and the bias b accounts for the neurons activation
threshold.

Multi-layer, feedforward neural networks consist of many layers of interconnected neuron units:
beginning with an input layer to match the feature space; followed by multiple layers of nonlinearity; and terminating with a linear regression or classification layer to match the output
space. The inputs and outputs of the models units follow the basic logic of the single neuron
described above. Bias units are included in each non-output layer of the network. The weights
linking neurons and biases with other neurons fully determine the output of the entire network,
and learning occurs when these weights are adapted to minimize the error on labeled training
data. More specifically, for each training example j the objective is to minimize a loss function
L(W, B | j).
Here W is the collection {Wi }1:N 1 , where Wi denotes the weight matrix connecting layers i
and i + 1 for a network of N layers; similarly b is the collection {bi }1:N 1 , where bi denotes the
column vector of biases for layer i + 1.
This basic framework of multi-layer neural networks can be used to accomplish deep learning tasks. Deep learning architectures are models of hierarchical feature extraction, typically
involving multiple levels of nonlinearity. Such models are able to learn useful representations of
raw data, and have exhibited high performance on complex data such as images, speech, and
text (Bengio, 2009).

H2Os Deep Learning architecture

As described above, H2O follows the model of multi-layer, feedforward neural networks for
predictive modeling. This section provides a more detailed description of H2Os Deep Learning
features, parameter configurations, and computational implementation.

2.1

Summary of features

H2Os Deep Learning functionalities include:

purely supervised training protocol for regression and classification tasks
multi-threaded parallel computation to be run on either a single node or a multi-node
cluster
advanced training options including adaptive learning, momentum training, rate annealing,
and dropout
regularization options to prevent model overfitting
fast and memory-efficient Java implementations of the underlying algorithms
elegant web interface to mirror the model building and scoring process running in R
grid search for hyperparameter optimization and model selection
model checkpointing
model export in plain java code for deployment in production environments
4

additional parameters for model tuning

deep autoencoders for unsupervised feature learning and anomaly detection capabilities

2.2

Training protocol

The training protocol described below follows many of the ideas and advances in the recent deep
learning literature.
2.2.1

Initialization

Various deep learning architectures employ a combination of unsupervised pretraining followed

by supervised training, but H2O uses a purely supervised training protocol. The default initialization scheme is the uniform adaptive option, which is an optimized initialization based on
the size of the network. Alternatively, you may select a random initialization to be drawn from
either a uniform or normal distribution, for which a scaling parameter may be specified as well.
2.2.2

Activation and loss functions

In the introduction we introduced the nonlinear activation function f , for which the choices are
summarized in Table 1. Note here that xi and wi denote the firing P
neurons input values and
their weights, respectively; denotes the weighted combination = i wi xi + b.

Function
Tanh
Rectified Linear
Maxout

Table 1: Activation functions

Formula
e e
e +e

f () =
f () = max(0, )
f () = max(wi xi + b), rescale if max f () 1

Range
f () [1, 1]
f () R+
f () [, 1]

The tanh function is a rescaled and shifted logistic function and its symmetry around 0 allows the training algorithm to converge faster. The rectified linear activation function has
demonstrated high performance on image recognition tasks, and is a more biologically accurate
model of neuron activations (LeCun et al, 1998). Maxout activation works particularly well with
dropout, a regularization method discussed later in this vignette (Goodfellow et al, 2013). It
is difficult to determine a best activation function to use; each may outperform the others in
separate scenarios, but grid search models (also described later) can help to compare activation
functions and other parameters. The default activation function is the Rectifier. Each of these
activation functions can be operated with dropout regularization (see below).

The following choices for the loss function L(W, B | j) are summarized in Table 2. The system
default enforces the tables typical use rule based on whether regression or classification is being
performed. Note here that t(j) and o(j) are the predicted (target) output and actual output,
respectively, for training example j; further, let y denote the output units and O the output
5

layer.

Table 2: Loss functions

Formula

Function
Mean Squared Error
Cross Entropy

2.2.3

L(W, B|j) =

L(W,
B|j) = 12 kt(j) o(j) k22

P
(j)
(j)
(j)

ln(oy ) ty + ln(1 oy )
yO

Typical use
Regression
(1

(j)
ty )

Classification

Parallel distributed network training

The procedure to minimize the loss function L(W, B | j) is a parallelized version of stochastic gradient descent (SGD). Standard SGD can be summarized as follows, with the gradient
L(W, B | j) computed via backpropagation (LeCun et al, 1998). The constant indicates the
learning rate, which controls the step sizes during gradient descent.
Standard stochastic gradient descent

Initialize W, B
Iterate until convergence criterion reached
Get training example i
Update all weights wjk W , biases bjk B
wjk := wjk L(W,B|j)
wjk
bjk := bjk L(W,B|j)
bjk
Stochastic gradient descent is known to be fast and memory-efficient, but not easily parallelizable
without becoming slow. We utilize Hogwild!, the recently developed lock-free parallelization
scheme from Niu et al, 2011. Hogwild! follows a shared memory model where multiple cores,
each handling separate subsets (or all) of the training data, are able to make independent
contributions to the gradient updates L(W, B | j) asynchronously. In a multi-node system
this parallelization scheme works on top of H2Os distributed setup, where the training data is
distributed across the cluster. Each node operates in parallel on its local data until the final
parameters W, b are obtained by averaging. Below is a rough summary.
Parallel distributed and multi-threaded training with SGD in H2O Deep Learning

Initialize global model parameters W, B

Distribute training data T across nodes (can be disjoint or replicated)
Iterate until convergence criterion reached
For nodes n with training subset Tn , do in parallel:
Obtain copy of the global model parameters Wn , Bn
Select active subset Tna Tn (user-given number of samples per iteration)
Partition Tna into Tnac by cores nc
6

For cores nc on node n, do in parallel:

Get training example i Tnac
Update all weights wjk Wn , biases bjk Bn
wjk := wjk L(W,B|j)
wjk
bjk := bjk L(W,B|j)
bjk
Set W, B := Avgn Wn , Avgn Bn
Optionally score the model on (potentially sampled) train/validation scoring sets

Here, the weights and bias updates follow the asynchronous Hogwild! procedure to incrementally adjust each nodes parameters Wn , Bn after seeing example i. The Avgn notation
refers to the final averaging of these local parameters across all nodes to obtain the global model
parameters and complete training.
2.2.4

Specifying the number of training samples per iteration

H2O Deep Learning is scalable and can take advantage of a large cluster of compute nodes.
There are three modes in which to operate. The default behavior is to let every node train on
the entire (replicated) dataset, but automatically locally shuffling (and/or using a subset of)
the training examples for each iteration. For datasets that dont fit into each nodes memory
(also depending on the heap memory specified by the -Xmx option), it might not be possible to
replicate the data, and each compute node can be instructed to train only with local data. An
experimental single node mode is available for the case where slow final convergence is observed
due to the presence of too many nodes, but weve never seen this become necessary.
The number of training examples (globally) presented to the distributed SGD worker nodes between model averaging is controlled by the important parameter train samples per iteration.
One special value is -1, which results in all nodes processing all their local training data per
iteration. Note that if replicate training data is enabled (true by default), this will result
in training N epochs per iteration on N nodes, otherwise 1 epoch will be trained per iteration.
Another special value is 0, which always results in 1 epoch per iteration, independent of the
number of compute nodes. In general, any user-given positive number is permissible for this
parameter. For large datasets, it might make sense to specify a fraction of the dataset. For
example, if the training data contains 10 million rows, and we specify the number of training
samples per iteration as 100, 000 when running on 4 nodes, then each node will process 25, 000
examples per iteration, and it will take 40 such distributed iterations to process one epoch. If
the value is set too high, it might take too long between synchronization and model convergence
can be slow. If the value is set too low, network communication overhead will dominate the
runtime, and computational performance will suffer. The special value of -2 (the default) enables auto-tuning of this parameter based on the computational performance of the processors
and the network of the system and attempts to find a good balance between computation and
communication. Note that this parameter can affect the convergence rate during training.

2.3

Regularization

H2Os Deep Learning framework supports regularization techniques to prevent overfitting.

`1 and `2 regularization enforce the same penalties as they do with other models, that is, modifying the loss function so as to minimize some
L0 (W, B | j) = L(W, B | j) + 1 R1 (W, B | j) + 2 R2 (W, B | j)
For `1 regularization, R1 (W, B | j) represents of the sum of all `1 norms of the weights and
biases in the network; R2 (W, B | j) represents the sum of squares of all the weights and biases in
the network. The constants 1 and 2 are generally chosen to be very small, for example 105 .
The second type of regularization available for deep learning is a recent innovation called dropout
(Hinton et al., 2012). Dropout constrains the online optimization such that during forward
propagation for a given training example, each neuron in the network suppresses its activation
with probability P, generally taken to be less than 0.2 for input neurons and up to 0.5 for
hidden neurons. The effect is twofold: as with `2 regularization, the network weight values are
scaled toward 0; furthermore, each training example trains a different model, albeit sharing the
same global parameters. Thus dropout allows an exponentially large number of models to be
averaged as an ensemble, which can prevent overfitting and improve generalization. Note that
input dropout can be especially useful when the feature space is large and noisy.

2.4

Advanced optimization

H2O features both manual and automatic versions of advanced optimization. The manual mode
features include momentum training and rate annealing, while automatic mode features adaptive
learning rate.
2.4.1

Momentum training

Momentum modifies back propagation by allowing prior iterations to influence the current update. In particular, a velocity vector v is defined to modify the updates as follows, with
representing the parameters W, B; representing the momentum coefficient, and denoting
the learning rate.
vt+1 = vt L(t )
t+1 = t + vt+1
Using the momentum parameter can aid in avoiding local minima and the associated instability
(Sutskever et al, 2014). Too much momentum can lead to instabilities, which is why the momentum is best ramped up slowly.
A recommended improvement when using momentum updates is the Nesterov accelerated gradient method. Under this method the updates are further modified such that
vt+1 = vt L(t + vt )
Wt+1 = Wt + vt+1

2.4.2

Rate annealing

Throughout training, as the model approaches a minimum the chance of oscillation or optimum
skipping creates the need for a slower learning rate. Instead of specifying a constant learning
rate , learning rate annealing gradually reduces the learning rate t to freeze into local minima in the optimization landscape (Zeiler, 2012).
For H2O, the annealing rate is the inverse of the number of training samples it takes to cut
the learning rate in half (e.g., 106 means that it takes 106 training samples to halve the learning rate).
2.4.3

Adaptive learning

The implemented adaptive learning rate algorithm ADADELTA (Zeiler, 2012) automatically
combines the benefits of learning rate annealing and momentum training to avoid slow convergence. Specification of only two parameters and simplifies hyper parameter search. In some
cases, manually controlled (non-adaptive) learning rate and momentum specifications can lead
to better results, but require the hyperparameter search of up to 7 parameters. If the model
is built on a topology with many local minima or long plateaus, it is possible for a constant
learning rate to produce sub-optimal results. In general, however, we find adaptive learning rate
to produce the best results, and this option is kept as the default.
The first of two hyper parameters for adaptive learning is . It is similar to momentum and
relates to the memory to prior weight updates. Typical values are between 0.9 and 0.999. The
second of two hyper parameters for adaptive learning is similar to learning rate annealing
during initial training and momentum at later stages where it allows forward progress. Typical
values are between 1010 and 104 .

2.5

Loading data

Loading a dataset in R for use with H2O is slightly different from the usual methodology, as we
must convert our datasets into H2OParsedData objects. For an example, we use a toy weather
dataset included in the H2O GitHub repository for the H2O Deep Learning documentation at
https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo/.
First load the data to your current working directory in your R Console (do this henceforth for
dataset downloads), and then run the following command.
weather.hex = h2o.uploadFile(h2o_server, path = "weather.csv", header = TRUE, sep
= ",", key = "weather.hex")
To see a quick summary of the data, run the following command.
summary(weather.hex)

2.5.1

Standardization

Along with categorical encoding, H2O preprocesses data to be standardized for compatibility
with the activation functions. Recall Table 1s summary of each activation functions target
space. Since in general the activation function does not map into R, we first standardize our
data to be drawn from N (0, 1). Standardizing again after network propagation allows us to
compute more precise errors in this standardized space rather than in the raw feature space.

2.6

Additional parameters

This section has reviewed some background on the various parameter configurations in H2Os
Deep Learning architecture. H2O Deep Learning models may seem daunting since there are
dozens of possible parameter arguments when creating models. However, most parameters do
not need to be tuned or experimented with; the default settings are safe and recommended.
Those parameters for which experimentation is possible and perhaps necessary have mostly
been discussed here but there a couple more which deserve mention.
There is no default for both hidden layer size/number as well as epochs. Practice building
deep learning models with different network topologies and different datasets will lead to intuition for these parameters but two general rules of thumb should be applied. First, choose larger
network sizes, as they can perform higher-level feature extraction, and techniques like dropout
may train only subsets of the network at once. Second, use more epochs for greater predictive
accuracy, but only when able to afford the computational cost. Many example tests can be
found in the H2O GitHub repository for pointers on specific values and results for these (and
other) parameters.
For a full list of H2O Deep Learning model parameters and default values, see Appendix A.

3
3.1

Use case: MNIST digit classification

MNIST overview

The MNIST database is a famous academic dataset used to benchmark classification performance. The data consists of 60,000 training images and 10,000 test images, each a standardized
282 pixel greyscale image of a single handwritten digit. You can download the datasets from the
H2O GitHub repository for the H2O Deep Learning documentation at
https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo/.
Remember to save these .csv files to your working directory. Following the weather data example,
we begin by loading these datasets into R as H2OParsedData objects.
train_images.hex = h2o.uploadFile(h2o_server, path = "mnist_train.csv", header =
FALSE, sep = ",", key = "train_images.hex")
test_images.hex = h2o.uploadFile(h2o_server, path = "mnist_test.csv", header = FALSE,
sep = ",", key = "test_images.hex")

3.2

Performing a trial run

The trial run below is illustrative of the relative simplicity that underlies most H2O Deep
Learning model parameter configurations, thanks to the defaults. We use the first 282 = 784
values of each row to represent the full image, and the final value to denote the digit class. As
mentioned before, Rectified linear activation is popular with image processing and has performed
well on the MNIST database previously; and dropout has been known to enhance performance
on this dataset as well so we train our model accordingly.

#Train the model for digit classification

mnist_model = h2o.deeplearning(x = 1:784, y = 785, data = train_images.hex, activation
= "RectifierWithDropout", hidden = c(200,200,200), input_dropout_ratio = 0.2, l1
= 1e-5, validation = test_images.hex, epochs = 10)
The model is run for only 10 epochs since it is meant just as a trial run. In this trial run we
also specified the validation set to be the test set, but another option is to use n-fold validation
by specifying, for example, nfolds=5 instead of validation=test images.
3.2.1

Extracting and handling the results

We can extract the parameters of our model, examine the scoring process, and make predictions
on new data.
#View the specified parameters of your deep learning model
mnist_model@model$params
#Examine the performance of the trained model
mnist_model
The latter command returns the trained models training and validation error. The training
error value is based on the parameter score training samples, which specifies the number of
randomly sampled training points to be used for scoring; the default uses 10,000 points. The
validation error is based on the parameter score validation samples, which controls the same
value on the validation set and is set by default to be the entire validation set. In general choosing more sampled points leads to a better idea of the models performance on your dataset;
setting either of these parameters to 0 automatically uses the entire corresponding dataset for
scoring. Either way, however, you can control the minimum and maximum time spent on scoring
with the score interval and score duty cycle parameters.
These scoring parameters also affect the final model when the parameter override with best model
is turned on. This override sets the final model after training to be the model which achieved
the lowest validation error during training, based on the sampled points used for scoring. Since
the validation set is automatically set to be the training data if no other dataset is specified,
either the score training samples or score validation samples parameter will control the
error computation during training and, in turn, the chosen best model.

Once we have a satisfactory model, the h2o.predict() command can be used to compute
and store predictions on new data, which can then be used for further tasks in the interactive
data science process.
#Perform classification on the test set
prediction = h2o.predict(mnist_model, newdata=test_images.hex)
#Copy predictions from H2O to R
pred = as.data.frame(prediction)

3.3

Web interface

H2O R users have access to a slick web interface to mirror the model building process in R.
After loading data or training a model in R, point your browser to your IP address and port
number (e.g., localhost:12345) to launch the web interface. From here you can click on Admin
> Jobs to view your specific model details. You can also click on Data > View All to view
and keep track of your datasets in current use.
3.3.1

Variable importances

One useful feature is the variable importances option, which can be enabled with the additional
argument variable importances=TRUE. This features allows us to view the absolute and relative predictive strength of each feature in the prediction task. From R, you can access these
strengths with the command mnist model@model$varimp. You can also view a visualization of
the variable importances on the web interface.
3.3.2

Java model

Another important feature of the web interface is the Java (POJO) model, accessible from the
Java model button in the top right of a model summary page. This button allows access
to Java code which, when called from a main method in a Java program, builds the model.
Instructions for downloading and running this Java code are available from the web interface,
and example production scoring code is available as well.

3.4

Grid search for model comparison

H2O supports grid search capabilities for model tuning by allowing users to tweak certain parameters and observe changes in model behavior. This is done by specifying sets of values for
parameter arguments. For example, below is an example of a grid search:
#Create a set of network topologies
hidden_layers = list(c(200,200), c(100,300,100),c(500,500,500))
mnist_model_grid = h2o.deeplearning(x = 1:784, y = 785, data = train_images.hex,
activation = "RectifierWithDropout", hidden = hidden_layers, validation = test_images.hex,
epochs = 1, l1 = c(1e-5,1e-7), input_dropout_ratio = 0.2)

Here we specified three different network topologies and two different `1 norm weights. This
grid search model effectively trains six different models, over the possible combinations of these
parameters. Of course, sets of other parameters can be specified for a larger space of models.
This allows for more subtle insights in the model tuning and selection process, as we inspect and
compare our trained models after the grid search process is complete. To decide how and when
to choose different parameter configurations in a grid search, see Appendix A for parameter
descriptions and possible values.
#print out all prediction errors and run times of the models
mnist_model_grid
mnist_model_grid@model
#print out a *short* summary of each of the models (indexed by parameter)
mnist_model_grid@sumtable
#print out *full* summary of each of the models
all_params = lapply(mnist_model_grid@model, function(x) { x@model$params })
all_params
#access a particular parameter across all models
l1_params = lapply(mnist_model_grid@model, function(x) { x@model$params$l1 })
l1_params

3.5

Checkpoint model

Checkpoint model keys can be used to start off where you left off, if you feel that you want to
further train a particular model with more iterations, more data, different data, and so forth. If
we felt that our initial model should be trained further, we can use it (or its key) as a checkpoint
argument in a new model. In the command below, mnist model grid@model[[1]] indicates
the highest performance model from the grid search that we wish to train further. Note that
the training and validation datasets and the response column etc. have to match for checkpoint
restarts.
mnist_checkpoint_model = h2o.deeplearning(x=1:784, y=785, data=train_images.hex,
checkpoint=mnist_model_grid@model[[1]], validation = test_images.hex, epochs=9)
Checkpoint models are also applicable for the case when we wish to reload existing models that
were saved to disk in a previous session. For example, we can save and later load the best model
from the grid search by running the following commands.
#Specify a model and the file path where it is to be saved
h2o.saveModel(object = mnist_model_grid@model[[1]], name = "/tmp/mymodel", force
= TRUE)
#Alternatively, save the model key in some directory (here we use /tmp)
#h2o.saveModel(object = mnist_model_grid@model[[1]], dir = "/tmp", force = TRUE)

Later (e.g., after restarting H2O) we can load the saved model by indicating the host and saved
model file path. This assumes the saved model was saved with a compatible H2O version (no
changes to the H2O model implementation).
best_mnist_grid.load = h2o.loadModel(h2o_server, "/tmp/mymodel")
#Continue training the loaded model
best_mnist_grid.continue = h2o.deeplearning(x=1:784, y=785, data=train_images.hex,
checkpoint=best_mnist_grid.load, validation = test_images.hex, epochs=1)
Additionally, you can also use the command
model = h2o.getModel(h2o_server, key)
to retrieve a model from its H2O key. This command is useful, for example, if you have created
an H2O model using the web interface and wish to proceed with the modeling process in R.

3.6

Achieving world-record performance

Without distortions, convolutions, or other advanced image processing techniques, the best-ever
published test set error for the MNIST dataset is 0.83% by Microsoft. After training for 2, 000
epochs (took about 4 hours) on 4 compute nodes, we obtain 0.87% test set error and after
training for 7, 600 epochs (took about 9 hours) on 10 nodes, we obtain 0.83% test set error,
which is the current world-record, notably achieved using a distributed configuration and with a
simple 1-liner from R. Details can be found in our hands-on tutorial. Accuracies around 1% test
set errors are typically achieved within 1 hour when running on 1 node. The parallel scalability
of H2O for the MNIST dataset on 1 to 63 compute nodes is shown in the figure below.

4
4.1

Deep autoencoders
Nonlinear dimensionality reduction

So far we have discussed purely supervised deep learning tasks. However, deep learning can
also be used for unsupervised feature learning or, more specifically, nonlinear dimensionality
reduction (Hinton et al, 2006). Consider the diagram on the following page of a three-layer
neural network with one hidden layer. If we treat our input data as labeled with the same input
values, then the network is forced to learn the identity via a nonlinear, reduced representation
of the original data. Such an algorithm is called a deep autoencoder; these models have been
used extensively for unsupervised, layer-wise pretraining of supervised deep learning tasks, but
here we consider the autoencoders application for discovering anomalies in data.

4.2

Use case: anomaly detection

Consider the deep autoencoder model described above. Given enough training data resembling
some underlying pattern, the network will train itself to easily learn the identity when confronted
with that pattern. However, if some anomalous test point not matching the learned pattern
arrives, the autoencoder will likely have a high error in reconstructing this data, which indicates
it is anomalous data.
We use this framework to develop an anomaly detection demonstration using a deep autoencoder. The dataset is an ECG time series of heartbeats, and the goal is to determine which
heartbeats are outliers. The training data (20 good heartbeats) and the test data (training data with 3 bad heartbeats appended for simplicity) can be downloaded from the H2O
GitHub repository for the H2O Deep Learning documentation at http://bit.ly/1yywZzi. Each
row represents a single heartbeat. The autoencoder is trained as follows:

train_ecg.hex = h2o.uploadFile(h2o_server, path="ecg_train.csv", header=F, sep=",",

key="train_ecg.hex")
test_ecg.hex = h2o.uploadFile(h2o_server, path="ecg_test.csv", header=F, sep=",",
key="test_ecg.hex")
#Train deep autoencoder learning model on "normal" training data, y ignored

anomaly_model = h2o.deeplearning(x=1:210, y=1, train_ecg.hex, activation = "Tanh",

classification=F, autoencoder=T, hidden = c(50,20,50), l1=1E-4,
epochs=100)
#Compute reconstruction error with the Anomaly detection app (MSE between
output layer and input layer)
recon_error.hex = h2o.anomaly(test_ecg.hex, anomaly_model)
#Pull reconstruction error data into R and plot to find outliers (last 3
heartbeats)
recon_error = as.data.frame(recon_error.hex)
recon_error
plot.ts(recon_error)
#Note: Testing = Reconstructing the test dataset
test_recon.hex = h2o.predict(anomaly_model, test_ecg.hex)
head(test_recon.hex)

Appendix A: Complete parameter list

x: A vector containing the names of the predictors in the model. No default.
y: The name of the response variable in the model. No default.
data: An H2OParsedData object containing the training data. No default.
key: The unique hex key assigned to the resulting model. If none is given, a key will
automatically be generated.
override with best model: If enabled, override the final model with the best model found
during training. Default is true.
checkpoint: Model checkpoint (either key or H2ODeepLearningModel) to resume training
with.
classification: A logical value indicating whether the algorithm should conduct classification. Otherwise, regression is performed on a numeric response variable.
nfolds: Number of folds for cross-validation. If the number of folds is more than 1, then
validation must remain empty. Default is false.
validation: An H2OParsedData object indicating the validation dataset used to construct
confusion matrix. If left blank, default is the training data.
activation: The choice of nonlinear, differentiable activation function used throughout
the network. Options are Tanh, TanhWithDropout, Rectifier, RectifierWithDropout,
Maxout, MaxoutWithDropout, and the default is Rectifier. See section 2.2.2 for more
details.
hidden: The number and size of each hidden layer in the model. For example, if c(100,200,100)
is specified, a model with 3 hidden layers will be produced, and the middle hidden layer will
have 200 neurons. The default is c(200,200). For grid search, use list(c(10,10), c(20,20))
etc. See section 3.2 for more details. .
autoencoder: Default is false. See section 4 for more details.
use all factor levels: Use all factor levels of categorical variables. Otherwise, the first
factor level is omitted (without loss of accuracy). Useful for variable importances and
auto-enabled for autoencoder.
epochs: The number of passes over the training dataset to be carried out. It is recommended to start with lower values for initial grid searches. The value can be modified
during checkpoint restarts and allows continuation of selected models. Default is 10.
train samples per iteration: Default is -1, but performance might depend greatly on
this parameter. See section 2.2.4 for more details.

seed: The random seed controls sampling and initialization. Reproducible results are only
expected with single-threaded operation (i.e. when running on one node, turning off load
balancing and providing a small dataset that fits in one chunk). In general, the multithreaded asynchronous updates to the model parameters will result in (intentional) race
conditions and non-reproducible results. Note that deterministic sampling and initialization might still lead to some weak sense of determinism in the model. Default is a random
real number.
adaptive rate: The default enables this feature for adaptive learning rate. See section
2.4.3 for more details.
rho: The first of two hyperparameters for adaptive learning rate (when it is enabled). This
parameter is similar to momentum and relates to the memory to prior weight updates.
Typical values are between 0.9 and 0.999. Default value is 0.95. See section 2.4.3 for more
details.
epsilon: The second of two hyperparameters for adaptive learning rate (when it is enabled). This parameter is similar to learning rate annealing during initial training and
momentum at later stages where it allows forward progress. Typical values are between
1e-10 and 1e-4. This parameter is only active if adaptive learning rate is enabled. Default
is 1e-6. See section 2.4.3 for more details.
rate: The learning rate . Higher values lead to less stable models while lower values lead
to slower convergence. Default is 0.005.
rate annealing: Default value is 1e-6 (when adaptive learning is disabled). See section
2.4.2 for more details.
rate decay: Default is 1.0 (when adaptive learning is disabled). The learning rate decay
parameter controls the change of learning rate across layers.
momentum start: The momentum start parameter controls the amount of momentum at
the beginning of training (when adaptive learning is disabled). Default is 0. 2.4.1 for more
details.
momentum ramp: The momentum ramp parameter controls the amount of learning for
which momentum increases assuming momentum stable is larger than momentum start.
It can be enabled when adaptive learning is disabled. The ramp is measured in the number
of training samples. Default is 1e-6. See section 2.4.1 for more details.
momentum stable: The momentum stable parameter controls the final momentum value
reached after momentum ramp training samples (when adaptive learning is disabled). The
momentum used for training will remain the same for training beyond reaching that point.
Default is 0. See section 2.4.1 for more details.
neverov accelerated gradient: The default is true (when adaptive learning is disabled).
See Section 2.4.1 for more details.
input dropout ratio: The default is 0. See Section 2.3 for more details.

hidden dropout ratio: The default is 0. See Section 2.3 for more details.
l1: The default is 0. See section 2.3 for more details.
l2: The default is 0. See section 2.3 for more details.
max w2: A maximum on the sum of the squared incoming weights into any one neuron.
This tuning parameter is especially useful for unbound activation functions such as Maxout
or Rectifier. The default leaves this maximum unbounded.
initial weight distribution:The distribution from which initial weights are to be drawn.
The default is the uniform adaptive option. Other options are Uniform and Normal distributions. See section 2.2.1 for more details.
initial weight scale: The scale of the distribution function for Uniform or Normal
distributions. For Uniform, the values are drawn uniformly from (-initial weight scale,
initial weight scale). For Normal, the values are drawn from a Normal distribution with
a standard deviation of initial weight scale. The default is 1.0. See section 2.2.1 for more
details.
loss: The default is automatic based on the particular learning problem. See section 2.2.2
for more details.
score interval: The minimum time (in seconds) to elapse between model scoring. The
actual interval is determined by the number of training samples per iteration and the
scoring duty cycle. Default is 5.
score training samples: The number of training dataset points to be used for scoring.
Will be randomly sampled. Use 0 for selecting the entire training dataset. Default is
10000.
score validation samples: The number of validation dataset points to be used for scoring. Can be randomly sampled or stratified (if balance classes is set and score validation
sampling is set to stratify). Use 0 for selecting the entire training dataset (this is also the
default).
score duty cycle: Maximum fraction of wall clock time spent on model scoring on training and validation samples, and on diagnostics such as computation of feature importances
(i.e., not on training). Default is 0.1.
classification stop: The stopping criteria in terms of classification error (1-accuracy)
on the training data scoring dataset. When the error is at or below this threshold, training
stops. Default is 0.
regression stop: The stopping criteria in terms of regression error (MSE) on the training
data scoring dataset. When the error is at or below this threshold, training stops. Default
is 1e-6.
quiet mode: Enable quiet mode for less output to standard output. Default is false.

max confusion matrix size: For classification models, the maximum size (in terms of
classes) of the confusion matrix for it to be printed. This option is meant to avoid printing
extremely large confusion matrices. Default is 20.
max hit ratio k: The maximum number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable). Default is 10.
balance classes: For imbalanced data, balance training data class counts via over/undersampling. This can result in improved predictive accuracy. Default is false.
class sampling factors: Desired over/under-sampling ratios per class (lexicographic
order). Only when balance classes is enabled. If not specified, they will be automatically
computed to obtain class balance during training.
max after balance size: When classes are balanced, limit the resulting dataset size to
the specified multiple of the original dataset size. This is the maximum relative size of the
training data after balancing class counts (can be less than 1.0). Default is 5.0.
score validation sampling: Method used to sample validation dataset for scoring. The
possible methods are Uniform and Stratified. Default is Uniform.
diagnostics: Gather diagnostics for hidden layers, such as mean and RMS values of
learning rate, momentum, weights and biases. Default is true.
variable importances: Whether to compute variable importances for input features.
The implementation considers the weights connecting the input features to the first two
hidden layers. Default is false.
fast mode: Enable fast mode (minor approximation in back-propagation), should not
affect results significantly. Default is true.
ignore const cols: Ignore constant training columns (no information can be gained anyway). Default is true.
force load balance: Increase training speed on small datasets by splitting it into many
chunks to allow utilization of all cores. Default is true.
replicate training data: Replicate the entire training dataset onto every node for faster
training on small datasets. Default is true.
single node mode: Run on a single node for fine-tuning of model parameters. Can be
useful for faster convergence during checkpoint resumes after training on a very large count
of nodes (for fast initial convergence). Default is false.
shuffle training data: Enable shuffling of training data (on each node). This option is
recommended if training data is replicated on N nodes, and the number of training samples
per iteration is close to N times the dataset size, where all nodes train will (almost) all
the data. It is automatically enabled if the number of training samples per iteration is set
to -1 (or to N times the dataset size or larger), otherwise it is disabled by default.

max categorical features: Max. number of categorical features, enforced via hashing
(Experimental).
reproducible: Force reproducibility on small data (will be slow - only uses 1 thread)

Appendix B: References

H2O website
H2O documentation
H2O GitHub repository
H2O support
H2O JIRA
(Bengio, 2009
LeCun et al, 1998
Goodfellow et al, 2013
Niu et al, 2011
Hinton et al., 2012
Sutskever et al, 2014
Zeiler, 2012
H2O GitHub repository for the H2O Deep Learning documentation
MNIST database
Hinton et al, 2006

Predictive Analytics Complete Notes
No ratings yet
Predictive Analytics Complete Notes
82 pages
59894
No ratings yet
59894
76 pages
Sergios Karagiannakos - Deep Learning in Production (2022) - Libgen - Li
No ratings yet
Sergios Karagiannakos - Deep Learning in Production (2022) - Libgen - Li
223 pages
Jntuk R20 ML MANUAL
100% (1)
Jntuk R20 ML MANUAL
53 pages
Introduction To TensorFlow For Artificial Intelligence
No ratings yet
Introduction To TensorFlow For Artificial Intelligence
41 pages
An Ingression Into Deep Learning - Resp
No ratings yet
An Ingression Into Deep Learning - Resp
25 pages
Project Report
0% (1)
Project Report
53 pages
Deep Learning With H2O
No ratings yet
Deep Learning With H2O
31 pages
Deep Learning Booklet
No ratings yet
Deep Learning Booklet
55 pages
GBM Vignette
No ratings yet
GBM Vignette
28 pages
Deep Learning Through Examples: Arno Candel
No ratings yet
Deep Learning Through Examples: Arno Candel
48 pages
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
JAVA PROGRAMMING FOR BEGINNERS: Master Java Fundamentals and Build Your Own Applications (2023 Crash Course)
From Everand
JAVA PROGRAMMING FOR BEGINNERS: Master Java Fundamentals and Build Your Own Applications (2023 Crash Course)
Theo Houle
No ratings yet
Machine Learning Systems
No ratings yet
Machine Learning Systems
300 pages
Resources ML
No ratings yet
Resources ML
22 pages
Deep Learning-Powered Technologies Autonomous Driving, Artificial Intelligence of Things (AIoT), Augmented Reality, 5G Communications and Beyond
100% (1)
Deep Learning-Powered Technologies Autonomous Driving, Artificial Intelligence of Things (AIoT), Augmented Reality, 5G Communications and Beyond
216 pages
R Deep Learning Essentials - Sample Chapter
100% (3)
R Deep Learning Essentials - Sample Chapter
24 pages
3rd Unit DL Final Class Notes
No ratings yet
3rd Unit DL Final Class Notes
78 pages
Report Data
No ratings yet
Report Data
5 pages
RBooklet
No ratings yet
RBooklet
46 pages
Table of Content: (Page Numbers in PDF File)
No ratings yet
Table of Content: (Page Numbers in PDF File)
223 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
R Vignette
No ratings yet
R Vignette
47 pages
Full Download Zefs Guide to Deep Learning 1st Edition Roy Keyes PDF DOCX
No ratings yet
Full Download Zefs Guide to Deep Learning 1st Edition Roy Keyes PDF DOCX
36 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
22 pages
DeepLearningLab
No ratings yet
DeepLearningLab
11 pages
Chap 2
No ratings yet
Chap 2
49 pages
Deep Learning
No ratings yet
Deep Learning
100 pages
Deep Learning - Roy Keyes
No ratings yet
Deep Learning - Roy Keyes
163 pages
LBDL
No ratings yet
LBDL
185 pages
Accelerated Computing With HIP: Second Edition
From Everand
Accelerated Computing With HIP: Second Edition
Yifan Sun
No ratings yet
Expanded_Deep_Learning_Document-1
No ratings yet
Expanded_Deep_Learning_Document-1
11 pages
Plain JavaScript: Learning the Front-End
From Everand
Plain JavaScript: Learning the Front-End
Roger Beans-Rivet
No ratings yet
[FREE PDF sample] Zefs Guide to Deep Learning 1st Edition Roy Keyes ebooks
100% (3)
[FREE PDF sample] Zefs Guide to Deep Learning 1st Edition Roy Keyes ebooks
21 pages
3rd Unit DL Final Class Notes (1)
No ratings yet
3rd Unit DL Final Class Notes (1)
78 pages
Image Recognitiion
No ratings yet
Image Recognitiion
50 pages
DeepLearning - 1NT22CS078 - I Shania Jone
No ratings yet
DeepLearning - 1NT22CS078 - I Shania Jone
4 pages
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
From Everand
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
Mr Troy
No ratings yet
(Ebook) Deep Learning for Robot Perception and Cognition by Alexandros Iosifidis (editor), Anastasios Tefas (editor) ISBN 9780323857871, 0323857876 - Quickly download the ebook to never miss important content
100% (2)
(Ebook) Deep Learning for Robot Perception and Cognition by Alexandros Iosifidis (editor), Anastasios Tefas (editor) ISBN 9780323857871, 0323857876 - Quickly download the ebook to never miss important content
81 pages
Salman Technical Seminar
No ratings yet
Salman Technical Seminar
24 pages
Nnet - Ug 1 150 PDF
No ratings yet
Nnet - Ug 1 150 PDF
150 pages
Deep Learning Fundamentals
No ratings yet
Deep Learning Fundamentals
19 pages
(Ebook) Deep Learning with PyTorch, Second Edition (MEAP V03) by Howard Huang ISBN 9781633438859, 1633438856pdf download
100% (4)
(Ebook) Deep Learning with PyTorch, Second Edition (MEAP V03) by Howard Huang ISBN 9781633438859, 1633438856pdf download
46 pages
(IJCST-V9I4P17) :yew Kee Wong
No ratings yet
(IJCST-V9I4P17) :yew Kee Wong
4 pages
Instant Download (Ebook) Deep Learning for Medical Image Analysis by S. Kevin Zhou, Dinggang Shen, Hayit Greenspan ISBN 9780128104088, 0128104082 PDF All Chapters
100% (2)
Instant Download (Ebook) Deep Learning for Medical Image Analysis by S. Kevin Zhou, Dinggang Shen, Hayit Greenspan ISBN 9780128104088, 0128104082 PDF All Chapters
81 pages
Morgan & Claypool - Introduction To Deep Learning For Engineers Using Python and Google Clod Platform - 2020
No ratings yet
Morgan & Claypool - Introduction To Deep Learning For Engineers Using Python and Google Clod Platform - 2020
111 pages
Deep Learning with R, Second Edition Francois Chollet All Chapters Instant Download
100% (1)
Deep Learning with R, Second Edition Francois Chollet All Chapters Instant Download
50 pages
The Science of Deep Learning Iddo Drori - Instantly access the full ebook content in just a few seconds
100% (1)
The Science of Deep Learning Iddo Drori - Instantly access the full ebook content in just a few seconds
74 pages
Secrets of Deep Learning 1716536527
No ratings yet
Secrets of Deep Learning 1716536527
12 pages
Deep Learning Techniques and Application
No ratings yet
Deep Learning Techniques and Application
20 pages
deep_learning_research_paper
No ratings yet
deep_learning_research_paper
4 pages
Dive Into Deep Learning - D2l-En
100% (1)
Dive Into Deep Learning - D2l-En
660 pages
Lecture 4
No ratings yet
Lecture 4
45 pages
Deep Learning Module-01
No ratings yet
Deep Learning Module-01
17 pages
A Tour of TensorFlow
No ratings yet
A Tour of TensorFlow
17 pages
Deep Learning: Nicholas G. Polson Vadim O. Sokolov
No ratings yet
Deep Learning: Nicholas G. Polson Vadim O. Sokolov
18 pages
Lecture 0
No ratings yet
Lecture 0
33 pages
Python Deep Learning Tutorial
0% (1)
Python Deep Learning Tutorial
17 pages
20IT7301 - Deep Learning Syllabus
No ratings yet
20IT7301 - Deep Learning Syllabus
3 pages
Bone Fracture Detection
No ratings yet
Bone Fracture Detection
26 pages
Deep Learning - Intro, Methods & Applications
100% (1)
Deep Learning - Intro, Methods & Applications
37 pages
132487
No ratings yet
132487
37 pages
Prediction of Financial Markets Using Deep Learning: Masaryk University
No ratings yet
Prediction of Financial Markets Using Deep Learning: Masaryk University
66 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
94972872
No ratings yet
94972872
56 pages
Deep Learning: - Course Code: - Unit 1
No ratings yet
Deep Learning: - Course Code: - Unit 1
21 pages
SCT 1st 3 Clusters 2022
No ratings yet
SCT 1st 3 Clusters 2022
9 pages
Feasibility of Two-Factor Payment Authentication Using Eeg-Based Bcis
No ratings yet
Feasibility of Two-Factor Payment Authentication Using Eeg-Based Bcis
6 pages
Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
No ratings yet
Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
20 pages
Chapter 6 AI
No ratings yet
Chapter 6 AI
52 pages
Reconfigurable Hardware Design Approach For Economic Neural Network
No ratings yet
Reconfigurable Hardware Design Approach For Economic Neural Network
5 pages
Ad3451 Ml Unit 4 Notes
No ratings yet
Ad3451 Ml Unit 4 Notes
34 pages
Rethinking Eliminative Connectionism PDF
No ratings yet
Rethinking Eliminative Connectionism PDF
40 pages
Syllabus Sem-V
No ratings yet
Syllabus Sem-V
30 pages
Electricity Theft Detection in Smart Grids Based On Deep Neural Network
No ratings yet
Electricity Theft Detection in Smart Grids Based On Deep Neural Network
18 pages
Machine Learning ISA-2 Answer Bank
No ratings yet
Machine Learning ISA-2 Answer Bank
28 pages
A Survey of Randomized Algorithms For Training Neural Networks
No ratings yet
A Survey of Randomized Algorithms For Training Neural Networks
10 pages
Non-Mathematical Introduction to Using Neural Networks | Heaton Research
No ratings yet
Non-Mathematical Introduction to Using Neural Networks | Heaton Research
14 pages
AI ML Notes 2
No ratings yet
AI ML Notes 2
151 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
28 pages
CSC583 - Artificial Intelligence Algorithms: Topic 6 - Hybrid Intelligence Systems
No ratings yet
CSC583 - Artificial Intelligence Algorithms: Topic 6 - Hybrid Intelligence Systems
48 pages
Notes Unit 1
No ratings yet
Notes Unit 1
13 pages
Deep Representation Learning in Speech Processing Challenges, Recent Advances, and Future Trends
No ratings yet
Deep Representation Learning in Speech Processing Challenges, Recent Advances, and Future Trends
25 pages
Unit 3 Chapter 1 RNN
No ratings yet
Unit 3 Chapter 1 RNN
121 pages
Deep Learning Glossary
No ratings yet
Deep Learning Glossary
30 pages
Assignments For Week 6 2024
No ratings yet
Assignments For Week 6 2024
13 pages
Combating Depression in Students Using An Intelligent ChatBot A Cognitive Behavioral Therapy
No ratings yet
Combating Depression in Students Using An Intelligent ChatBot A Cognitive Behavioral Therapy
4 pages
Artificial Neural Network Methods for the Solution of Second Order Boundary Value Problems
No ratings yet
Artificial Neural Network Methods for the Solution of Second Order Boundary Value Problems
15 pages
Deep_learning_in_drug_discovery
No ratings yet
Deep_learning_in_drug_discovery
12 pages
State of Health Prediction of Lithium-Ion Battery Using Machine Learning Algorithms
No ratings yet
State of Health Prediction of Lithium-Ion Battery Using Machine Learning Algorithms
5 pages

Deep Learning With H2O's R Package: Arno Candel Viraj Parmar

Uploaded by

Deep Learning With H2O's R Package: Arno Candel Viraj Parmar

Uploaded by

Deep Learning with H2Os R Package

2 H2Os Deep Learning architecture

3 Use case: MNIST digit classification

5 Appendix A: Complete parameter list

Deep learning overview

H2Os Deep Learning architecture

H2Os Deep Learning functionalities include:

additional parameters for model tuning

Various deep learning architectures employ a combination of unsupervised pretraining followed

Activation and loss functions

Table 1: Activation functions

Table 2: Loss functions

Parallel distributed network training

Initialize global model parameters W, B

For cores nc on node n, do in parallel:

Specifying the number of training samples per iteration

H2Os Deep Learning framework supports regularization techniques to prevent overfitting.

Use case: MNIST digit classification

Performing a trial run

#Train the model for digit classification

Extracting and handling the results

Grid search for model comparison

Achieving world-record performance

Use case: anomaly detection

train_ecg.hex = h2o.uploadFile(h2o_server, path="ecg_train.csv", header=F, sep=",",

anomaly_model = h2o.deeplearning(x=1:210, y=1, train_ecg.hex, activation = "Tanh",

Appendix A: Complete parameter list

You might also like