Deep Learning With H2O's R Package: Arno Candel Viraj Parmar
Deep Learning With H2O's R Package: Arno Candel Viraj Parmar
Arno Candel
Viraj Parmar
October 2014
Contents
1 Introduction
1.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Deep learning overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
3
3
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
per iteration
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
5
5
5
6
7
8
8
8
9
9
9
10
10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
11
11
12
12
12
12
13
14
4 Deep autoencoders
4.1 Nonlinear dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Use case: anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
15
15
17
6 Appendix B: References
21
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction
Deep Learning has been dominating recent machine learning competitions with better predictions. Unlike the neural networks of the past, modern Deep Learning has cracked the code
for training stability and generalization and scales on big data. It is the algorithm of choice
for highest predictive accuracy. H2O is the worlds fastest open-source in-memory platform for
machine learning and predictive analytics on big data.
This documentation presents the Deep Learning framework in H2O, as experienced through
the H2O R interface. Further documentation on H2Os system and algorithms can be found
at the H2O.ai website at http://docs.h2o.ai, especially the R User documentation, and fully
featured tutorials are available at http://learn.h2o.ai. The datasets, R code and instructions for
this document can be found at the H2O GitHub repository at
https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo/.
This introductory section provides instructions on getting H2O started from R, followed by a
brief overview of deep learning.
1.1
Installation
To install H2O, follow the Download link on H2Os website at http://h2o.ai/. For multi-node
operation, download the H2O zip file and deploy H2O on your cluster, following instructions
from the Full Documentation. For single-node operation, follow the instructions in the Install
in R tab. Open your R Console and run the following to install and start H2O directly from
R:
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
# Next, we download, install and initialize the H2O package for R (filling in the
*s with the latest version number obtained from the H2O download page)
install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/master/
****/R", getOption("repos"))))
library(h2o)
Initialize H2O with
h2o_server = h2o.init()
With this command, the H2O R module will start an instance of H2O automatically at localhost:54321. Alternatively, to specify a connection with an existing H2O cluster node (other
than localhost at port 54321) you must explicitly state the IP address and port number in the
h2o.init() call. An example is given below, but do not directly paste; you should specify the
IP and port number appropriate to your specific environment.
h2o_cluster = h2o.init(ip = "192.555.1.123", port = 12345, startH2O = FALSE)
2
An automatic demo is available to see h2o.deeplearning at work. Run the following command
to observe an example binary classification model built through H2Os Deep Learning.
demo(h2o.deeplearning)
1.2
Support
Users of the H2O package may submit general enquiries and bug reports to the H2O.ai support
address. Alternatively, specific bugs or issues may be filed to the H2O.ai JIRA.
1.3
First we present a brief overview of deep neural networks for supervised learning tasks. There
are several theoretical frameworks for deep learning, and here we summarize the feedforward
architecture used by H2O.
The basic unit in the model (shown above) is the neuron, a biologically inspired model of the human neuron. For humans, varying strengths of neurons output signals travel along the synaptic
junctions and are then aggregated
P as input for a connected neurons activation. In the model,
the weighted combination = nj=1 wi xi + b of input signals is aggregated, and then an output
signal f () transmitted by the connected neuron. The function f represents the nonlinear activation function used throughout the network, and the bias b accounts for the neurons activation
threshold.
Multi-layer, feedforward neural networks consist of many layers of interconnected neuron units:
beginning with an input layer to match the feature space; followed by multiple layers of nonlinearity; and terminating with a linear regression or classification layer to match the output
space. The inputs and outputs of the models units follow the basic logic of the single neuron
described above. Bias units are included in each non-output layer of the network. The weights
linking neurons and biases with other neurons fully determine the output of the entire network,
and learning occurs when these weights are adapted to minimize the error on labeled training
data. More specifically, for each training example j the objective is to minimize a loss function
L(W, B | j).
Here W is the collection {Wi }1:N 1 , where Wi denotes the weight matrix connecting layers i
and i + 1 for a network of N layers; similarly b is the collection {bi }1:N 1 , where bi denotes the
column vector of biases for layer i + 1.
This basic framework of multi-layer neural networks can be used to accomplish deep learning tasks. Deep learning architectures are models of hierarchical feature extraction, typically
involving multiple levels of nonlinearity. Such models are able to learn useful representations of
raw data, and have exhibited high performance on complex data such as images, speech, and
text (Bengio, 2009).
As described above, H2O follows the model of multi-layer, feedforward neural networks for
predictive modeling. This section provides a more detailed description of H2Os Deep Learning
features, parameter configurations, and computational implementation.
2.1
Summary of features
2.2
Training protocol
The training protocol described below follows many of the ideas and advances in the recent deep
learning literature.
2.2.1
Initialization
In the introduction we introduced the nonlinear activation function f , for which the choices are
summarized in Table 1. Note here that xi and wi denote the firing P
neurons input values and
their weights, respectively; denotes the weighted combination = i wi xi + b.
Function
Tanh
Rectified Linear
Maxout
f () =
f () = max(0, )
f () = max(wi xi + b), rescale if max f () 1
Range
f () [1, 1]
f () R+
f () [, 1]
The tanh function is a rescaled and shifted logistic function and its symmetry around 0 allows the training algorithm to converge faster. The rectified linear activation function has
demonstrated high performance on image recognition tasks, and is a more biologically accurate
model of neuron activations (LeCun et al, 1998). Maxout activation works particularly well with
dropout, a regularization method discussed later in this vignette (Goodfellow et al, 2013). It
is difficult to determine a best activation function to use; each may outperform the others in
separate scenarios, but grid search models (also described later) can help to compare activation
functions and other parameters. The default activation function is the Rectifier. Each of these
activation functions can be operated with dropout regularization (see below).
The following choices for the loss function L(W, B | j) are summarized in Table 2. The system
default enforces the tables typical use rule based on whether regression or classification is being
performed. Note here that t(j) and o(j) are the predicted (target) output and actual output,
respectively, for training example j; further, let y denote the output units and O the output
5
layer.
Function
Mean Squared Error
Cross Entropy
2.2.3
L(W, B|j) =
L(W,
B|j) = 12 kt(j) o(j) k22
P
(j)
(j)
(j)
ln(oy ) ty + ln(1 oy )
yO
Typical use
Regression
(1
(j)
ty )
Classification
The procedure to minimize the loss function L(W, B | j) is a parallelized version of stochastic gradient descent (SGD). Standard SGD can be summarized as follows, with the gradient
L(W, B | j) computed via backpropagation (LeCun et al, 1998). The constant indicates the
learning rate, which controls the step sizes during gradient descent.
Standard stochastic gradient descent
Initialize W, B
Iterate until convergence criterion reached
Get training example i
Update all weights wjk W , biases bjk B
wjk := wjk L(W,B|j)
wjk
bjk := bjk L(W,B|j)
bjk
Stochastic gradient descent is known to be fast and memory-efficient, but not easily parallelizable
without becoming slow. We utilize Hogwild!, the recently developed lock-free parallelization
scheme from Niu et al, 2011. Hogwild! follows a shared memory model where multiple cores,
each handling separate subsets (or all) of the training data, are able to make independent
contributions to the gradient updates L(W, B | j) asynchronously. In a multi-node system
this parallelization scheme works on top of H2Os distributed setup, where the training data is
distributed across the cluster. Each node operates in parallel on its local data until the final
parameters W, b are obtained by averaging. Below is a rough summary.
Parallel distributed and multi-threaded training with SGD in H2O Deep Learning
Here, the weights and bias updates follow the asynchronous Hogwild! procedure to incrementally adjust each nodes parameters Wn , Bn after seeing example i. The Avgn notation
refers to the final averaging of these local parameters across all nodes to obtain the global model
parameters and complete training.
2.2.4
H2O Deep Learning is scalable and can take advantage of a large cluster of compute nodes.
There are three modes in which to operate. The default behavior is to let every node train on
the entire (replicated) dataset, but automatically locally shuffling (and/or using a subset of)
the training examples for each iteration. For datasets that dont fit into each nodes memory
(also depending on the heap memory specified by the -Xmx option), it might not be possible to
replicate the data, and each compute node can be instructed to train only with local data. An
experimental single node mode is available for the case where slow final convergence is observed
due to the presence of too many nodes, but weve never seen this become necessary.
The number of training examples (globally) presented to the distributed SGD worker nodes between model averaging is controlled by the important parameter train samples per iteration.
One special value is -1, which results in all nodes processing all their local training data per
iteration. Note that if replicate training data is enabled (true by default), this will result
in training N epochs per iteration on N nodes, otherwise 1 epoch will be trained per iteration.
Another special value is 0, which always results in 1 epoch per iteration, independent of the
number of compute nodes. In general, any user-given positive number is permissible for this
parameter. For large datasets, it might make sense to specify a fraction of the dataset. For
example, if the training data contains 10 million rows, and we specify the number of training
samples per iteration as 100, 000 when running on 4 nodes, then each node will process 25, 000
examples per iteration, and it will take 40 such distributed iterations to process one epoch. If
the value is set too high, it might take too long between synchronization and model convergence
can be slow. If the value is set too low, network communication overhead will dominate the
runtime, and computational performance will suffer. The special value of -2 (the default) enables auto-tuning of this parameter based on the computational performance of the processors
and the network of the system and attempts to find a good balance between computation and
communication. Note that this parameter can affect the convergence rate during training.
2.3
Regularization
2.4
Advanced optimization
H2O features both manual and automatic versions of advanced optimization. The manual mode
features include momentum training and rate annealing, while automatic mode features adaptive
learning rate.
2.4.1
Momentum training
Momentum modifies back propagation by allowing prior iterations to influence the current update. In particular, a velocity vector v is defined to modify the updates as follows, with
representing the parameters W, B; representing the momentum coefficient, and denoting
the learning rate.
vt+1 = vt L(t )
t+1 = t + vt+1
Using the momentum parameter can aid in avoiding local minima and the associated instability
(Sutskever et al, 2014). Too much momentum can lead to instabilities, which is why the momentum is best ramped up slowly.
A recommended improvement when using momentum updates is the Nesterov accelerated gradient method. Under this method the updates are further modified such that
vt+1 = vt L(t + vt )
Wt+1 = Wt + vt+1
2.4.2
Rate annealing
Throughout training, as the model approaches a minimum the chance of oscillation or optimum
skipping creates the need for a slower learning rate. Instead of specifying a constant learning
rate , learning rate annealing gradually reduces the learning rate t to freeze into local minima in the optimization landscape (Zeiler, 2012).
For H2O, the annealing rate is the inverse of the number of training samples it takes to cut
the learning rate in half (e.g., 106 means that it takes 106 training samples to halve the learning rate).
2.4.3
Adaptive learning
The implemented adaptive learning rate algorithm ADADELTA (Zeiler, 2012) automatically
combines the benefits of learning rate annealing and momentum training to avoid slow convergence. Specification of only two parameters and simplifies hyper parameter search. In some
cases, manually controlled (non-adaptive) learning rate and momentum specifications can lead
to better results, but require the hyperparameter search of up to 7 parameters. If the model
is built on a topology with many local minima or long plateaus, it is possible for a constant
learning rate to produce sub-optimal results. In general, however, we find adaptive learning rate
to produce the best results, and this option is kept as the default.
The first of two hyper parameters for adaptive learning is . It is similar to momentum and
relates to the memory to prior weight updates. Typical values are between 0.9 and 0.999. The
second of two hyper parameters for adaptive learning is similar to learning rate annealing
during initial training and momentum at later stages where it allows forward progress. Typical
values are between 1010 and 104 .
2.5
Loading data
Loading a dataset in R for use with H2O is slightly different from the usual methodology, as we
must convert our datasets into H2OParsedData objects. For an example, we use a toy weather
dataset included in the H2O GitHub repository for the H2O Deep Learning documentation at
https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo/.
First load the data to your current working directory in your R Console (do this henceforth for
dataset downloads), and then run the following command.
weather.hex = h2o.uploadFile(h2o_server, path = "weather.csv", header = TRUE, sep
= ",", key = "weather.hex")
To see a quick summary of the data, run the following command.
summary(weather.hex)
2.5.1
Standardization
Along with categorical encoding, H2O preprocesses data to be standardized for compatibility
with the activation functions. Recall Table 1s summary of each activation functions target
space. Since in general the activation function does not map into R, we first standardize our
data to be drawn from N (0, 1). Standardizing again after network propagation allows us to
compute more precise errors in this standardized space rather than in the raw feature space.
2.6
Additional parameters
This section has reviewed some background on the various parameter configurations in H2Os
Deep Learning architecture. H2O Deep Learning models may seem daunting since there are
dozens of possible parameter arguments when creating models. However, most parameters do
not need to be tuned or experimented with; the default settings are safe and recommended.
Those parameters for which experimentation is possible and perhaps necessary have mostly
been discussed here but there a couple more which deserve mention.
There is no default for both hidden layer size/number as well as epochs. Practice building
deep learning models with different network topologies and different datasets will lead to intuition for these parameters but two general rules of thumb should be applied. First, choose larger
network sizes, as they can perform higher-level feature extraction, and techniques like dropout
may train only subsets of the network at once. Second, use more epochs for greater predictive
accuracy, but only when able to afford the computational cost. Many example tests can be
found in the H2O GitHub repository for pointers on specific values and results for these (and
other) parameters.
For a full list of H2O Deep Learning model parameters and default values, see Appendix A.
3
3.1
The MNIST database is a famous academic dataset used to benchmark classification performance. The data consists of 60,000 training images and 10,000 test images, each a standardized
282 pixel greyscale image of a single handwritten digit. You can download the datasets from the
H2O GitHub repository for the H2O Deep Learning documentation at
https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo/.
Remember to save these .csv files to your working directory. Following the weather data example,
we begin by loading these datasets into R as H2OParsedData objects.
train_images.hex = h2o.uploadFile(h2o_server, path = "mnist_train.csv", header =
FALSE, sep = ",", key = "train_images.hex")
test_images.hex = h2o.uploadFile(h2o_server, path = "mnist_test.csv", header = FALSE,
sep = ",", key = "test_images.hex")
10
3.2
The trial run below is illustrative of the relative simplicity that underlies most H2O Deep
Learning model parameter configurations, thanks to the defaults. We use the first 282 = 784
values of each row to represent the full image, and the final value to denote the digit class. As
mentioned before, Rectified linear activation is popular with image processing and has performed
well on the MNIST database previously; and dropout has been known to enhance performance
on this dataset as well so we train our model accordingly.
We can extract the parameters of our model, examine the scoring process, and make predictions
on new data.
#View the specified parameters of your deep learning model
mnist_model@model$params
#Examine the performance of the trained model
mnist_model
The latter command returns the trained models training and validation error. The training
error value is based on the parameter score training samples, which specifies the number of
randomly sampled training points to be used for scoring; the default uses 10,000 points. The
validation error is based on the parameter score validation samples, which controls the same
value on the validation set and is set by default to be the entire validation set. In general choosing more sampled points leads to a better idea of the models performance on your dataset;
setting either of these parameters to 0 automatically uses the entire corresponding dataset for
scoring. Either way, however, you can control the minimum and maximum time spent on scoring
with the score interval and score duty cycle parameters.
These scoring parameters also affect the final model when the parameter override with best model
is turned on. This override sets the final model after training to be the model which achieved
the lowest validation error during training, based on the sampled points used for scoring. Since
the validation set is automatically set to be the training data if no other dataset is specified,
either the score training samples or score validation samples parameter will control the
error computation during training and, in turn, the chosen best model.
11
Once we have a satisfactory model, the h2o.predict() command can be used to compute
and store predictions on new data, which can then be used for further tasks in the interactive
data science process.
#Perform classification on the test set
prediction = h2o.predict(mnist_model, newdata=test_images.hex)
#Copy predictions from H2O to R
pred = as.data.frame(prediction)
3.3
Web interface
H2O R users have access to a slick web interface to mirror the model building process in R.
After loading data or training a model in R, point your browser to your IP address and port
number (e.g., localhost:12345) to launch the web interface. From here you can click on Admin
> Jobs to view your specific model details. You can also click on Data > View All to view
and keep track of your datasets in current use.
3.3.1
Variable importances
One useful feature is the variable importances option, which can be enabled with the additional
argument variable importances=TRUE. This features allows us to view the absolute and relative predictive strength of each feature in the prediction task. From R, you can access these
strengths with the command mnist model@model$varimp. You can also view a visualization of
the variable importances on the web interface.
3.3.2
Java model
Another important feature of the web interface is the Java (POJO) model, accessible from the
Java model button in the top right of a model summary page. This button allows access
to Java code which, when called from a main method in a Java program, builds the model.
Instructions for downloading and running this Java code are available from the web interface,
and example production scoring code is available as well.
3.4
H2O supports grid search capabilities for model tuning by allowing users to tweak certain parameters and observe changes in model behavior. This is done by specifying sets of values for
parameter arguments. For example, below is an example of a grid search:
#Create a set of network topologies
hidden_layers = list(c(200,200), c(100,300,100),c(500,500,500))
mnist_model_grid = h2o.deeplearning(x = 1:784, y = 785, data = train_images.hex,
activation = "RectifierWithDropout", hidden = hidden_layers, validation = test_images.hex,
epochs = 1, l1 = c(1e-5,1e-7), input_dropout_ratio = 0.2)
12
Here we specified three different network topologies and two different `1 norm weights. This
grid search model effectively trains six different models, over the possible combinations of these
parameters. Of course, sets of other parameters can be specified for a larger space of models.
This allows for more subtle insights in the model tuning and selection process, as we inspect and
compare our trained models after the grid search process is complete. To decide how and when
to choose different parameter configurations in a grid search, see Appendix A for parameter
descriptions and possible values.
#print out all prediction errors and run times of the models
mnist_model_grid
mnist_model_grid@model
#print out a *short* summary of each of the models (indexed by parameter)
mnist_model_grid@sumtable
#print out *full* summary of each of the models
all_params = lapply(mnist_model_grid@model, function(x) { x@model$params })
all_params
#access a particular parameter across all models
l1_params = lapply(mnist_model_grid@model, function(x) { x@model$params$l1 })
l1_params
3.5
Checkpoint model
Checkpoint model keys can be used to start off where you left off, if you feel that you want to
further train a particular model with more iterations, more data, different data, and so forth. If
we felt that our initial model should be trained further, we can use it (or its key) as a checkpoint
argument in a new model. In the command below, mnist model grid@model[[1]] indicates
the highest performance model from the grid search that we wish to train further. Note that
the training and validation datasets and the response column etc. have to match for checkpoint
restarts.
mnist_checkpoint_model = h2o.deeplearning(x=1:784, y=785, data=train_images.hex,
checkpoint=mnist_model_grid@model[[1]], validation = test_images.hex, epochs=9)
Checkpoint models are also applicable for the case when we wish to reload existing models that
were saved to disk in a previous session. For example, we can save and later load the best model
from the grid search by running the following commands.
#Specify a model and the file path where it is to be saved
h2o.saveModel(object = mnist_model_grid@model[[1]], name = "/tmp/mymodel", force
= TRUE)
#Alternatively, save the model key in some directory (here we use /tmp)
#h2o.saveModel(object = mnist_model_grid@model[[1]], dir = "/tmp", force = TRUE)
13
Later (e.g., after restarting H2O) we can load the saved model by indicating the host and saved
model file path. This assumes the saved model was saved with a compatible H2O version (no
changes to the H2O model implementation).
best_mnist_grid.load = h2o.loadModel(h2o_server, "/tmp/mymodel")
#Continue training the loaded model
best_mnist_grid.continue = h2o.deeplearning(x=1:784, y=785, data=train_images.hex,
checkpoint=best_mnist_grid.load, validation = test_images.hex, epochs=1)
Additionally, you can also use the command
model = h2o.getModel(h2o_server, key)
to retrieve a model from its H2O key. This command is useful, for example, if you have created
an H2O model using the web interface and wish to proceed with the modeling process in R.
3.6
Without distortions, convolutions, or other advanced image processing techniques, the best-ever
published test set error for the MNIST dataset is 0.83% by Microsoft. After training for 2, 000
epochs (took about 4 hours) on 4 compute nodes, we obtain 0.87% test set error and after
training for 7, 600 epochs (took about 9 hours) on 10 nodes, we obtain 0.83% test set error,
which is the current world-record, notably achieved using a distributed configuration and with a
simple 1-liner from R. Details can be found in our hands-on tutorial. Accuracies around 1% test
set errors are typically achieved within 1 hour when running on 1 node. The parallel scalability
of H2O for the MNIST dataset on 1 to 63 compute nodes is shown in the figure below.
14
4
4.1
Deep autoencoders
Nonlinear dimensionality reduction
So far we have discussed purely supervised deep learning tasks. However, deep learning can
also be used for unsupervised feature learning or, more specifically, nonlinear dimensionality
reduction (Hinton et al, 2006). Consider the diagram on the following page of a three-layer
neural network with one hidden layer. If we treat our input data as labeled with the same input
values, then the network is forced to learn the identity via a nonlinear, reduced representation
of the original data. Such an algorithm is called a deep autoencoder; these models have been
used extensively for unsupervised, layer-wise pretraining of supervised deep learning tasks, but
here we consider the autoencoders application for discovering anomalies in data.
4.2
Consider the deep autoencoder model described above. Given enough training data resembling
some underlying pattern, the network will train itself to easily learn the identity when confronted
with that pattern. However, if some anomalous test point not matching the learned pattern
arrives, the autoencoder will likely have a high error in reconstructing this data, which indicates
it is anomalous data.
We use this framework to develop an anomaly detection demonstration using a deep autoencoder. The dataset is an ECG time series of heartbeats, and the goal is to determine which
heartbeats are outliers. The training data (20 good heartbeats) and the test data (training data with 3 bad heartbeats appended for simplicity) can be downloaded from the H2O
GitHub repository for the H2O Deep Learning documentation at http://bit.ly/1yywZzi. Each
row represents a single heartbeat. The autoencoder is trained as follows:
15
16
17
seed: The random seed controls sampling and initialization. Reproducible results are only
expected with single-threaded operation (i.e. when running on one node, turning off load
balancing and providing a small dataset that fits in one chunk). In general, the multithreaded asynchronous updates to the model parameters will result in (intentional) race
conditions and non-reproducible results. Note that deterministic sampling and initialization might still lead to some weak sense of determinism in the model. Default is a random
real number.
adaptive rate: The default enables this feature for adaptive learning rate. See section
2.4.3 for more details.
rho: The first of two hyperparameters for adaptive learning rate (when it is enabled). This
parameter is similar to momentum and relates to the memory to prior weight updates.
Typical values are between 0.9 and 0.999. Default value is 0.95. See section 2.4.3 for more
details.
epsilon: The second of two hyperparameters for adaptive learning rate (when it is enabled). This parameter is similar to learning rate annealing during initial training and
momentum at later stages where it allows forward progress. Typical values are between
1e-10 and 1e-4. This parameter is only active if adaptive learning rate is enabled. Default
is 1e-6. See section 2.4.3 for more details.
rate: The learning rate . Higher values lead to less stable models while lower values lead
to slower convergence. Default is 0.005.
rate annealing: Default value is 1e-6 (when adaptive learning is disabled). See section
2.4.2 for more details.
rate decay: Default is 1.0 (when adaptive learning is disabled). The learning rate decay
parameter controls the change of learning rate across layers.
momentum start: The momentum start parameter controls the amount of momentum at
the beginning of training (when adaptive learning is disabled). Default is 0. 2.4.1 for more
details.
momentum ramp: The momentum ramp parameter controls the amount of learning for
which momentum increases assuming momentum stable is larger than momentum start.
It can be enabled when adaptive learning is disabled. The ramp is measured in the number
of training samples. Default is 1e-6. See section 2.4.1 for more details.
momentum stable: The momentum stable parameter controls the final momentum value
reached after momentum ramp training samples (when adaptive learning is disabled). The
momentum used for training will remain the same for training beyond reaching that point.
Default is 0. See section 2.4.1 for more details.
neverov accelerated gradient: The default is true (when adaptive learning is disabled).
See Section 2.4.1 for more details.
input dropout ratio: The default is 0. See Section 2.3 for more details.
18
hidden dropout ratio: The default is 0. See Section 2.3 for more details.
l1: The default is 0. See section 2.3 for more details.
l2: The default is 0. See section 2.3 for more details.
max w2: A maximum on the sum of the squared incoming weights into any one neuron.
This tuning parameter is especially useful for unbound activation functions such as Maxout
or Rectifier. The default leaves this maximum unbounded.
initial weight distribution:The distribution from which initial weights are to be drawn.
The default is the uniform adaptive option. Other options are Uniform and Normal distributions. See section 2.2.1 for more details.
initial weight scale: The scale of the distribution function for Uniform or Normal
distributions. For Uniform, the values are drawn uniformly from (-initial weight scale,
initial weight scale). For Normal, the values are drawn from a Normal distribution with
a standard deviation of initial weight scale. The default is 1.0. See section 2.2.1 for more
details.
loss: The default is automatic based on the particular learning problem. See section 2.2.2
for more details.
score interval: The minimum time (in seconds) to elapse between model scoring. The
actual interval is determined by the number of training samples per iteration and the
scoring duty cycle. Default is 5.
score training samples: The number of training dataset points to be used for scoring.
Will be randomly sampled. Use 0 for selecting the entire training dataset. Default is
10000.
score validation samples: The number of validation dataset points to be used for scoring. Can be randomly sampled or stratified (if balance classes is set and score validation
sampling is set to stratify). Use 0 for selecting the entire training dataset (this is also the
default).
score duty cycle: Maximum fraction of wall clock time spent on model scoring on training and validation samples, and on diagnostics such as computation of feature importances
(i.e., not on training). Default is 0.1.
classification stop: The stopping criteria in terms of classification error (1-accuracy)
on the training data scoring dataset. When the error is at or below this threshold, training
stops. Default is 0.
regression stop: The stopping criteria in terms of regression error (MSE) on the training
data scoring dataset. When the error is at or below this threshold, training stops. Default
is 1e-6.
quiet mode: Enable quiet mode for less output to standard output. Default is false.
19
max confusion matrix size: For classification models, the maximum size (in terms of
classes) of the confusion matrix for it to be printed. This option is meant to avoid printing
extremely large confusion matrices. Default is 20.
max hit ratio k: The maximum number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable). Default is 10.
balance classes: For imbalanced data, balance training data class counts via over/undersampling. This can result in improved predictive accuracy. Default is false.
class sampling factors: Desired over/under-sampling ratios per class (lexicographic
order). Only when balance classes is enabled. If not specified, they will be automatically
computed to obtain class balance during training.
max after balance size: When classes are balanced, limit the resulting dataset size to
the specified multiple of the original dataset size. This is the maximum relative size of the
training data after balancing class counts (can be less than 1.0). Default is 5.0.
score validation sampling: Method used to sample validation dataset for scoring. The
possible methods are Uniform and Stratified. Default is Uniform.
diagnostics: Gather diagnostics for hidden layers, such as mean and RMS values of
learning rate, momentum, weights and biases. Default is true.
variable importances: Whether to compute variable importances for input features.
The implementation considers the weights connecting the input features to the first two
hidden layers. Default is false.
fast mode: Enable fast mode (minor approximation in back-propagation), should not
affect results significantly. Default is true.
ignore const cols: Ignore constant training columns (no information can be gained anyway). Default is true.
force load balance: Increase training speed on small datasets by splitting it into many
chunks to allow utilization of all cores. Default is true.
replicate training data: Replicate the entire training dataset onto every node for faster
training on small datasets. Default is true.
single node mode: Run on a single node for fine-tuning of model parameters. Can be
useful for faster convergence during checkpoint resumes after training on a very large count
of nodes (for fast initial convergence). Default is false.
shuffle training data: Enable shuffling of training data (on each node). This option is
recommended if training data is replicated on N nodes, and the number of training samples
per iteration is close to N times the dataset size, where all nodes train will (almost) all
the data. It is automatically enabled if the number of training samples per iteration is set
to -1 (or to N times the dataset size or larger), otherwise it is disabled by default.
20
max categorical features: Max. number of categorical features, enforced via hashing
(Experimental).
reproducible: Force reproducibility on small data (will be slow - only uses 1 thread)
Appendix B: References
H2O website
H2O documentation
H2O GitHub repository
H2O support
H2O JIRA
(Bengio, 2009
LeCun et al, 1998
Goodfellow et al, 2013
Niu et al, 2011
Hinton et al., 2012
Sutskever et al, 2014
Zeiler, 2012
H2O GitHub repository for the H2O Deep Learning documentation
MNIST database
Hinton et al, 2006
21