Create a data frame analytics job
Added in 7.3.0
This API creates a data frame analytics job that performs an analysis on the
source indices and stores the outcome in a destination index.
By default, the query used in the source configuration is {"match_all": {}}
.
If the destination index does not exist, it is created automatically when you start the job.
If you supply only a subset of the regression or classification parameters, hyperparameter optimization occurs. It determines a value for each of the undefined parameters.
Path parameters
-
id
string Required Identifier for the data frame analytics job. This identifier can contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start and end with alphanumeric characters.
Body
Required
-
allow_lazy_start
boolean Specifies whether this job can start when there is insufficient machine learning node capacity for it to be immediately assigned to a node. If set to
false
and a machine learning node with capacity to run the job cannot be immediately found, the API returns an error. If set totrue
, the API does not return an error; the job waits in thestarting
state until sufficient machine learning node capacity is available. This behavior is also affected by the cluster-widexpack.ml.max_lazy_ml_nodes
setting. -
analysis
object Required Hide analysis attributes Show analysis attributes object
-
classification
object Hide classification attributes Show classification attributes object
-
alpha
number Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This parameter affects loss calculations by acting as a multiplier of the tree depth. Higher alpha values result in shallower trees and faster training times. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to zero.
-
dependent_variable
string Required Defines which field of the document is to be predicted. It must match one of the fields in the index being used to train. If this field is missing from a document, then that document will not be used for training, but a prediction with the trained model will be generated for it. It is also known as continuous target variable. For classification analysis, the data type of the field must be numeric (
integer
,short
,long
,byte
), categorical (ip
orkeyword
), orboolean
. There must be no more than 30 different values in this field. For regression analysis, the data type of the field must be numeric. -
downsample_factor
number Advanced configuration option. Controls the fraction of data that is used to compute the derivatives of the loss function for tree training. A small value results in the use of a small fraction of the data. If this value is set to be less than 1, accuracy typically improves. However, too small a value may result in poor convergence for the ensemble and so require more trees. By default, this value is calculated during hyperparameter optimization. It must be greater than zero and less than or equal to 1.
-
early_stopping_enabled
boolean Advanced configuration option. Specifies whether the training process should finish if it is not finding any better performing models. If disabled, the training process can take significantly longer and the chance of finding a better performing model is unremarkable.
-
eta
number Advanced configuration option. The shrinkage applied to the weights. Smaller values result in larger forests which have a better generalization error. However, larger forests cause slower training. By default, this value is calculated during hyperparameter optimization. It must be a value between 0.001 and 1.
-
eta_growth_rate_per_tree
number Advanced configuration option. Specifies the rate at which
eta
increases for each new tree that is added to the forest. For example, a rate of 1.05 increaseseta
by 5% for each extra tree. By default, this value is calculated during hyperparameter optimization. It must be between 0.5 and 2. -
feature_bag_fraction
number Advanced configuration option. Defines the fraction of features that will be used when selecting a random bag for each candidate split. By default, this value is calculated during hyperparameter optimization.
-
feature_processors
array[object] Advanced configuration option. A collection of feature preprocessors that modify one or more included fields. The analysis uses the resulting one or more features instead of the original document field. However, these features are ephemeral; they are not stored in the destination index. Multiple
feature_processors
entries can refer to the same document fields. Automatic categorical feature encoding still occurs for the fields that are unprocessed by a custom processor or that have categorical values. Use this property only if you want to override the automatic feature encoding of the specified fields.Hide feature_processors attributes Show feature_processors attributes object
-
frequency_encoding
object Hide frequency_encoding attributes Show frequency_encoding attributes object
-
feature_name
string Required -
field
string Required Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
frequency_map
object Required The resulting frequency map for the field value. If the field value is missing from the frequency_map, the resulting value is 0.
-
-
multi_encoding
object Hide multi_encoding attribute Show multi_encoding attribute object
-
processors
array[number] Required The ordered array of custom processors to execute. Must be more than 1.
-
-
n_gram_encoding
object Hide n_gram_encoding attributes Show n_gram_encoding attributes object
-
feature_prefix
string The feature name prefix. Defaults to ngram__.
-
field
string Required Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
length
number Specifies the length of the n-gram substring. Defaults to 50. Must be greater than 0.
-
n_grams
array[number] Required Specifies which n-grams to gather. It’s an array of integer values where the minimum value is 1, and a maximum value is 5.
-
start
number Specifies the zero-indexed start of the n-gram substring. Negative values are allowed for encoding n-grams of string suffixes. Defaults to 0.
-
custom
boolean
-
-
one_hot_encoding
object -
target_mean_encoding
object Hide target_mean_encoding attributes Show target_mean_encoding attributes object
-
default_value
number Required The default value if field value is not found in the target_map.
-
feature_name
string Required -
field
string Required Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
target_map
object Required The field value to target mean transition map.
-
-
-
gamma
number Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies a linear penalty associated with the size of individual trees in the forest. A high gamma value causes training to prefer small trees. A small gamma value results in larger individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.
-
lambda
number Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies an L2 regularization term which applies to leaf weights of the individual trees in the forest. A high lambda value causes training to favor small leaf weights. This behavior makes the prediction function smoother at the expense of potentially not being able to capture relevant relationships between the features and the dependent variable. A small lambda value results in large individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.
-
Advanced configuration option. A multiplier responsible for determining the maximum number of hyperparameter optimization steps in the Bayesian optimization procedure. The maximum number of steps is determined based on the number of undefined hyperparameters times the maximum optimization rounds per hyperparameter. By default, this value is calculated during hyperparameter optimization.
-
max_trees
number Advanced configuration option. Defines the maximum number of decision trees in the forest. The maximum value is 2000. By default, this value is calculated during hyperparameter optimization.
-
Advanced configuration option. Specifies the maximum number of feature importance values per document to return. By default, no feature importance calculation occurs.
-
prediction_field_name
string Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
randomize_seed
number Defines the seed for the random generator that is used to pick training data. By default, it is randomly generated. Set it to a specific value to use the same training data each time you start a job (assuming other related parameters such as
source
andanalyzed_fields
are the same). -
soft_tree_depth_limit
number Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This soft limit combines with the
soft_tree_depth_tolerance
to penalize trees that exceed the specified depth; the regularized loss increases quickly beyond this depth. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0. -
soft_tree_depth_tolerance
number Advanced configuration option. This option controls how quickly the regularized loss increases when the tree depth exceeds
soft_tree_depth_limit
. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.01. training_percent
string | number -
class_assignment_objective
string -
num_top_classes
number Defines the number of categories for which the predicted probabilities are reported. It must be non-negative or -1. If it is -1 or greater than the total number of categories, probabilities are reported for all categories; if you have a large number of categories, there could be a significant effect on the size of your destination index. NOTE: To use the AUC ROC evaluation method,
num_top_classes
must be set to -1 or a value greater than or equal to the total number of categories.
-
-
outlier_detection
object Hide outlier_detection attributes Show outlier_detection attributes object
-
compute_feature_influence
boolean Specifies whether the feature influence calculation is enabled.
-
feature_influence_threshold
number The minimum outlier score that a document needs to have in order to calculate its feature influence score. Value range: 0-1.
-
method
string The method that outlier detection uses. Available methods are
lof
,ldof
,distance_kth_nn
,distance_knn
, andensemble
. The default value is ensemble, which means that outlier detection uses an ensemble of different methods and normalises and combines their individual outlier scores to obtain the overall outlier score. -
n_neighbors
number Defines the value for how many nearest neighbors each method of outlier detection uses to calculate its outlier score. When the value is not set, different values are used for different ensemble members. This default behavior helps improve the diversity in the ensemble; only override it if you are confident that the value you choose is appropriate for the data set.
-
outlier_fraction
number The proportion of the data set that is assumed to be outlying prior to outlier detection. For example, 0.05 means it is assumed that 5% of values are real outliers and 95% are inliers.
-
standardization_enabled
boolean If true, the following operation is performed on the columns before computing outlier scores:
(x_i - mean(x_i)) / sd(x_i)
.
-
-
regression
object Hide regression attributes Show regression attributes object
-
alpha
number Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This parameter affects loss calculations by acting as a multiplier of the tree depth. Higher alpha values result in shallower trees and faster training times. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to zero.
-
dependent_variable
string Required Defines which field of the document is to be predicted. It must match one of the fields in the index being used to train. If this field is missing from a document, then that document will not be used for training, but a prediction with the trained model will be generated for it. It is also known as continuous target variable. For classification analysis, the data type of the field must be numeric (
integer
,short
,long
,byte
), categorical (ip
orkeyword
), orboolean
. There must be no more than 30 different values in this field. For regression analysis, the data type of the field must be numeric. -
downsample_factor
number Advanced configuration option. Controls the fraction of data that is used to compute the derivatives of the loss function for tree training. A small value results in the use of a small fraction of the data. If this value is set to be less than 1, accuracy typically improves. However, too small a value may result in poor convergence for the ensemble and so require more trees. By default, this value is calculated during hyperparameter optimization. It must be greater than zero and less than or equal to 1.
-
early_stopping_enabled
boolean Advanced configuration option. Specifies whether the training process should finish if it is not finding any better performing models. If disabled, the training process can take significantly longer and the chance of finding a better performing model is unremarkable.
-
eta
number Advanced configuration option. The shrinkage applied to the weights. Smaller values result in larger forests which have a better generalization error. However, larger forests cause slower training. By default, this value is calculated during hyperparameter optimization. It must be a value between 0.001 and 1.
-
eta_growth_rate_per_tree
number Advanced configuration option. Specifies the rate at which
eta
increases for each new tree that is added to the forest. For example, a rate of 1.05 increaseseta
by 5% for each extra tree. By default, this value is calculated during hyperparameter optimization. It must be between 0.5 and 2. -
feature_bag_fraction
number Advanced configuration option. Defines the fraction of features that will be used when selecting a random bag for each candidate split. By default, this value is calculated during hyperparameter optimization.
-
feature_processors
array[object] Advanced configuration option. A collection of feature preprocessors that modify one or more included fields. The analysis uses the resulting one or more features instead of the original document field. However, these features are ephemeral; they are not stored in the destination index. Multiple
feature_processors
entries can refer to the same document fields. Automatic categorical feature encoding still occurs for the fields that are unprocessed by a custom processor or that have categorical values. Use this property only if you want to override the automatic feature encoding of the specified fields.Hide feature_processors attributes Show feature_processors attributes object
-
frequency_encoding
object Hide frequency_encoding attributes Show frequency_encoding attributes object
-
feature_name
string Required -
field
string Required Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
frequency_map
object Required The resulting frequency map for the field value. If the field value is missing from the frequency_map, the resulting value is 0.
-
-
multi_encoding
object Hide multi_encoding attribute Show multi_encoding attribute object
-
processors
array[number] Required The ordered array of custom processors to execute. Must be more than 1.
-
-
n_gram_encoding
object Hide n_gram_encoding attributes Show n_gram_encoding attributes object
-
feature_prefix
string The feature name prefix. Defaults to ngram__.
-
field
string Required Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
length
number Specifies the length of the n-gram substring. Defaults to 50. Must be greater than 0.
-
n_grams
array[number] Required Specifies which n-grams to gather. It’s an array of integer values where the minimum value is 1, and a maximum value is 5.
-
start
number Specifies the zero-indexed start of the n-gram substring. Negative values are allowed for encoding n-grams of string suffixes. Defaults to 0.
-
custom
boolean
-
-
one_hot_encoding
object -
target_mean_encoding
object Hide target_mean_encoding attributes Show target_mean_encoding attributes object
-
default_value
number Required The default value if field value is not found in the target_map.
-
feature_name
string Required -
field
string Required Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
target_map
object Required The field value to target mean transition map.
-
-
-
gamma
number Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies a linear penalty associated with the size of individual trees in the forest. A high gamma value causes training to prefer small trees. A small gamma value results in larger individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.
-
lambda
number Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies an L2 regularization term which applies to leaf weights of the individual trees in the forest. A high lambda value causes training to favor small leaf weights. This behavior makes the prediction function smoother at the expense of potentially not being able to capture relevant relationships between the features and the dependent variable. A small lambda value results in large individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.
-
Advanced configuration option. A multiplier responsible for determining the maximum number of hyperparameter optimization steps in the Bayesian optimization procedure. The maximum number of steps is determined based on the number of undefined hyperparameters times the maximum optimization rounds per hyperparameter. By default, this value is calculated during hyperparameter optimization.
-
max_trees
number Advanced configuration option. Defines the maximum number of decision trees in the forest. The maximum value is 2000. By default, this value is calculated during hyperparameter optimization.
-
Advanced configuration option. Specifies the maximum number of feature importance values per document to return. By default, no feature importance calculation occurs.
-
prediction_field_name
string Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
randomize_seed
number Defines the seed for the random generator that is used to pick training data. By default, it is randomly generated. Set it to a specific value to use the same training data each time you start a job (assuming other related parameters such as
source
andanalyzed_fields
are the same). -
soft_tree_depth_limit
number Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This soft limit combines with the
soft_tree_depth_tolerance
to penalize trees that exceed the specified depth; the regularized loss increases quickly beyond this depth. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0. -
soft_tree_depth_tolerance
number Advanced configuration option. This option controls how quickly the regularized loss increases when the tree depth exceeds
soft_tree_depth_limit
. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.01. training_percent
string | number -
loss_function
string The loss function used during regression. Available options are
mse
(mean squared error),msle
(mean squared logarithmic error),huber
(Pseudo-Huber loss). -
loss_function_parameter
number A positive number that is used as a parameter to the
loss_function
.
-
-
-
analyzed_fields
object Hide analyzed_fields attributes Show analyzed_fields attributes object
-
includes
array[string] An array of strings that defines the fields that will be excluded from the analysis. You do not need to add fields with unsupported data types to excludes, these fields are excluded from the analysis automatically.
-
excludes
array[string] An array of strings that defines the fields that will be included in the analysis.
-
-
description
string A description of the job.
-
dest
object Required Hide dest attributes Show dest attributes object
-
index
string Required -
results_field
string Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
-
max_num_threads
number The maximum number of threads to be used by the analysis. Using more threads may decrease the time necessary to complete the analysis at the cost of using more CPU. Note that the process may use additional threads for operational functionality other than the analysis itself.
-
_meta
object Hide _meta attribute Show _meta attribute object
-
*
object Additional properties
-
-
model_memory_limit
string The approximate maximum amount of memory resources that are permitted for analytical processing. If your
elasticsearch.yml
file contains anxpack.ml.max_model_memory_limit
setting, an error occurs when you try to create data frame analytics jobs that havemodel_memory_limit
values greater than that setting. -
source
object Required Hide source attributes Show source attributes object
-
index
string | array[string] Required -
runtime_mappings
object Hide runtime_mappings attribute Show runtime_mappings attribute object
-
*
object Additional properties Hide * attributes Show * attributes object
-
fields
object For type
composite
-
fetch_fields
array[object] For type
lookup
-
format
string A custom format for
date
type runtime fields. -
input_field
string Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
target_field
string Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
target_index
string -
script
object Hide script attributes Show script attributes object
source
string | object One of: Hide attributes Show attributes
-
aggregations
object Defines the aggregations that are run as part of the search request.
-
collapse
object -
explain
boolean If
true
, the request returns detailed information about score computation as part of a hit. -
ext
object Configuration of search extensions defined by Elasticsearch plugins.
-
from
number The starting document offset, which must be non-negative. By default, you cannot page through more than 10,000 hits using the
from
andsize
parameters. To page through more hits, use thesearch_after
parameter. -
track_total_hits
boolean | number Number of hits matching the query to count accurately. If true, the exact number of hits is returned at the cost of some performance. If false, the response does not include the total number of hits matching the query. Defaults to 10,000 hits.
-
indices_boost
array[object] Boost the
_score
of documents from specified indices. The boost value is the factor by which scores are multiplied. A boost value greater than1.0
increases the score. A boost value between0
and1.0
decreases the score. -
docvalue_fields
array[object] An array of wildcard (
*
) field patterns. The request returns doc values for field names matching these patterns in thehits.fields
property of the response. -
min_score
number The minimum
_score
for matching documents. Documents with a lower_score
are not included in search results or results collected by aggregations. -
post_filter
object An Elasticsearch Query DSL (Domain Specific Language) object that defines a query.
-
profile
boolean Set to
true
to return detailed timing information about the execution of individual components in a search request. NOTE: This is a debugging tool and adds significant overhead to search execution. -
query
object An Elasticsearch Query DSL (Domain Specific Language) object that defines a query.
-
retriever
object -
script_fields
object Retrieve a script evaluation (based on different fields) for each hit.
-
search_after
array[number | string | boolean | null] A field value.
-
size
number The number of hits to return, which must not be negative. By default, you cannot page through more than 10,000 hits using the
from
andsize
parameters. To page through more hits, use thesearch_after
property. -
slice
object -
fields
array[object] An array of wildcard (
*
) field patterns. The request returns values for field names matching these patterns in thehits.fields
property of the response. -
suggest
object -
terminate_after
number The maximum number of documents to collect for each shard. If a query reaches this limit, Elasticsearch terminates the query early. Elasticsearch collects documents before sorting.
IMPORTANT: Use with caution. Elasticsearch applies this property to each shard handling the request. When possible, let Elasticsearch perform early termination automatically. Avoid specifying this property for requests that target data streams with backing indices across multiple data tiers.
If set to
0
(default), the query does not terminate early. -
timeout
string The period of time to wait for a response from each shard. If no response is received before the timeout expires, the request fails and returns an error. Defaults to no timeout.
-
track_scores
boolean If
true
, calculate and return document scores, even if the scores are not used for sorting. -
version
boolean If
true
, the request returns the document version as part of a hit. -
seq_no_primary_term
boolean If
true
, the request returns sequence number and primary term of the last modification of each hit. -
stored_fields
string | array[string] -
pit
object -
runtime_mappings
object -
stats
array[string] The stats groups to associate with the search. Each group maintains a statistics aggregation for its associated searches. You can retrieve these stats using the indices stats API.
-
-
id
string -
params
object Specifies any named parameters that are passed into the script as variables. Use parameters instead of hard-coded values to decrease compile time.
Hide params attribute Show params attribute object
-
*
object Additional properties
-
-
options
object Hide options attribute Show options attribute object
-
*
string Additional properties
-
-
type
string Required Values are
boolean
,composite
,date
,double
,geo_point
,geo_shape
,ip
,keyword
,long
, orlookup
.
-
-
-
_source
object Hide _source attributes Show _source attributes object
-
includes
array[string] An array of strings that defines the fields that will be excluded from the analysis. You do not need to add fields with unsupported data types to excludes, these fields are excluded from the analysis automatically.
-
excludes
array[string] An array of strings that defines the fields that will be included in the analysis.
-
-
query
object The Elasticsearch query domain-specific language (DSL). This value corresponds to the query object in an Elasticsearch search POST body. All the options that are supported by Elasticsearch can be used, as this object is passed verbatim to Elasticsearch. By default, this property has the following value:
{"match_all": {}}
.Query DSL
-
-
headers
object -
version
string
Responses
-
200 application/json
Hide response attributes Show response attributes object
-
authorization
object Hide authorization attributes Show authorization attributes object
-
api_key
object -
roles
array[string] If a user ID was used for the most recent update to the job, its roles at the time of the update are listed in the response.
-
service_account
string If a service account was used for the most recent update to the job, the account name is listed in the response.
-
-
allow_lazy_start
boolean Required -
analysis
object Required Hide analysis attributes Show analysis attributes object
-
classification
object Hide classification attributes Show classification attributes object
-
alpha
number Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This parameter affects loss calculations by acting as a multiplier of the tree depth. Higher alpha values result in shallower trees and faster training times. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to zero.
-
dependent_variable
string Required Defines which field of the document is to be predicted. It must match one of the fields in the index being used to train. If this field is missing from a document, then that document will not be used for training, but a prediction with the trained model will be generated for it. It is also known as continuous target variable. For classification analysis, the data type of the field must be numeric (
integer
,short
,long
,byte
), categorical (ip
orkeyword
), orboolean
. There must be no more than 30 different values in this field. For regression analysis, the data type of the field must be numeric. -
downsample_factor
number Advanced configuration option. Controls the fraction of data that is used to compute the derivatives of the loss function for tree training. A small value results in the use of a small fraction of the data. If this value is set to be less than 1, accuracy typically improves. However, too small a value may result in poor convergence for the ensemble and so require more trees. By default, this value is calculated during hyperparameter optimization. It must be greater than zero and less than or equal to 1.
-
early_stopping_enabled
boolean Advanced configuration option. Specifies whether the training process should finish if it is not finding any better performing models. If disabled, the training process can take significantly longer and the chance of finding a better performing model is unremarkable.
-
eta
number Advanced configuration option. The shrinkage applied to the weights. Smaller values result in larger forests which have a better generalization error. However, larger forests cause slower training. By default, this value is calculated during hyperparameter optimization. It must be a value between 0.001 and 1.
-
eta_growth_rate_per_tree
number Advanced configuration option. Specifies the rate at which
eta
increases for each new tree that is added to the forest. For example, a rate of 1.05 increaseseta
by 5% for each extra tree. By default, this value is calculated during hyperparameter optimization. It must be between 0.5 and 2. -
feature_bag_fraction
number Advanced configuration option. Defines the fraction of features that will be used when selecting a random bag for each candidate split. By default, this value is calculated during hyperparameter optimization.
-
feature_processors
array[object] Advanced configuration option. A collection of feature preprocessors that modify one or more included fields. The analysis uses the resulting one or more features instead of the original document field. However, these features are ephemeral; they are not stored in the destination index. Multiple
feature_processors
entries can refer to the same document fields. Automatic categorical feature encoding still occurs for the fields that are unprocessed by a custom processor or that have categorical values. Use this property only if you want to override the automatic feature encoding of the specified fields.Hide feature_processors attributes Show feature_processors attributes object
-
frequency_encoding
object -
multi_encoding
object -
n_gram_encoding
object -
one_hot_encoding
object -
target_mean_encoding
object
-
-
gamma
number Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies a linear penalty associated with the size of individual trees in the forest. A high gamma value causes training to prefer small trees. A small gamma value results in larger individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.
-
lambda
number Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies an L2 regularization term which applies to leaf weights of the individual trees in the forest. A high lambda value causes training to favor small leaf weights. This behavior makes the prediction function smoother at the expense of potentially not being able to capture relevant relationships between the features and the dependent variable. A small lambda value results in large individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.
-
Advanced configuration option. A multiplier responsible for determining the maximum number of hyperparameter optimization steps in the Bayesian optimization procedure. The maximum number of steps is determined based on the number of undefined hyperparameters times the maximum optimization rounds per hyperparameter. By default, this value is calculated during hyperparameter optimization.
-
max_trees
number Advanced configuration option. Defines the maximum number of decision trees in the forest. The maximum value is 2000. By default, this value is calculated during hyperparameter optimization.
-
Advanced configuration option. Specifies the maximum number of feature importance values per document to return. By default, no feature importance calculation occurs.
-
prediction_field_name
string Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
randomize_seed
number Defines the seed for the random generator that is used to pick training data. By default, it is randomly generated. Set it to a specific value to use the same training data each time you start a job (assuming other related parameters such as
source
andanalyzed_fields
are the same). -
soft_tree_depth_limit
number Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This soft limit combines with the
soft_tree_depth_tolerance
to penalize trees that exceed the specified depth; the regularized loss increases quickly beyond this depth. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0. -
soft_tree_depth_tolerance
number Advanced configuration option. This option controls how quickly the regularized loss increases when the tree depth exceeds
soft_tree_depth_limit
. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.01. training_percent
string | number -
class_assignment_objective
string -
num_top_classes
number Defines the number of categories for which the predicted probabilities are reported. It must be non-negative or -1. If it is -1 or greater than the total number of categories, probabilities are reported for all categories; if you have a large number of categories, there could be a significant effect on the size of your destination index. NOTE: To use the AUC ROC evaluation method,
num_top_classes
must be set to -1 or a value greater than or equal to the total number of categories.
-
-
outlier_detection
object Hide outlier_detection attributes Show outlier_detection attributes object
-
compute_feature_influence
boolean Specifies whether the feature influence calculation is enabled.
-
feature_influence_threshold
number The minimum outlier score that a document needs to have in order to calculate its feature influence score. Value range: 0-1.
-
method
string The method that outlier detection uses. Available methods are
lof
,ldof
,distance_kth_nn
,distance_knn
, andensemble
. The default value is ensemble, which means that outlier detection uses an ensemble of different methods and normalises and combines their individual outlier scores to obtain the overall outlier score. -
n_neighbors
number Defines the value for how many nearest neighbors each method of outlier detection uses to calculate its outlier score. When the value is not set, different values are used for different ensemble members. This default behavior helps improve the diversity in the ensemble; only override it if you are confident that the value you choose is appropriate for the data set.
-
outlier_fraction
number The proportion of the data set that is assumed to be outlying prior to outlier detection. For example, 0.05 means it is assumed that 5% of values are real outliers and 95% are inliers.
-
standardization_enabled
boolean If true, the following operation is performed on the columns before computing outlier scores:
(x_i - mean(x_i)) / sd(x_i)
.
-
-
regression
object Hide regression attributes Show regression attributes object
-
alpha
number Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This parameter affects loss calculations by acting as a multiplier of the tree depth. Higher alpha values result in shallower trees and faster training times. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to zero.
-
dependent_variable
string Required Defines which field of the document is to be predicted. It must match one of the fields in the index being used to train. If this field is missing from a document, then that document will not be used for training, but a prediction with the trained model will be generated for it. It is also known as continuous target variable. For classification analysis, the data type of the field must be numeric (
integer
,short
,long
,byte
), categorical (ip
orkeyword
), orboolean
. There must be no more than 30 different values in this field. For regression analysis, the data type of the field must be numeric. -
downsample_factor
number Advanced configuration option. Controls the fraction of data that is used to compute the derivatives of the loss function for tree training. A small value results in the use of a small fraction of the data. If this value is set to be less than 1, accuracy typically improves. However, too small a value may result in poor convergence for the ensemble and so require more trees. By default, this value is calculated during hyperparameter optimization. It must be greater than zero and less than or equal to 1.
-
early_stopping_enabled
boolean Advanced configuration option. Specifies whether the training process should finish if it is not finding any better performing models. If disabled, the training process can take significantly longer and the chance of finding a better performing model is unremarkable.
-
eta
number Advanced configuration option. The shrinkage applied to the weights. Smaller values result in larger forests which have a better generalization error. However, larger forests cause slower training. By default, this value is calculated during hyperparameter optimization. It must be a value between 0.001 and 1.
-
eta_growth_rate_per_tree
number Advanced configuration option. Specifies the rate at which
eta
increases for each new tree that is added to the forest. For example, a rate of 1.05 increaseseta
by 5% for each extra tree. By default, this value is calculated during hyperparameter optimization. It must be between 0.5 and 2. -
feature_bag_fraction
number Advanced configuration option. Defines the fraction of features that will be used when selecting a random bag for each candidate split. By default, this value is calculated during hyperparameter optimization.
-
feature_processors
array[object] Advanced configuration option. A collection of feature preprocessors that modify one or more included fields. The analysis uses the resulting one or more features instead of the original document field. However, these features are ephemeral; they are not stored in the destination index. Multiple
feature_processors
entries can refer to the same document fields. Automatic categorical feature encoding still occurs for the fields that are unprocessed by a custom processor or that have categorical values. Use this property only if you want to override the automatic feature encoding of the specified fields.Hide feature_processors attributes Show feature_processors attributes object
-
frequency_encoding
object -
multi_encoding
object -
n_gram_encoding
object -
one_hot_encoding
object -
target_mean_encoding
object
-
-
gamma
number Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies a linear penalty associated with the size of individual trees in the forest. A high gamma value causes training to prefer small trees. A small gamma value results in larger individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.
-
lambda
number Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies an L2 regularization term which applies to leaf weights of the individual trees in the forest. A high lambda value causes training to favor small leaf weights. This behavior makes the prediction function smoother at the expense of potentially not being able to capture relevant relationships between the features and the dependent variable. A small lambda value results in large individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.
-
Advanced configuration option. A multiplier responsible for determining the maximum number of hyperparameter optimization steps in the Bayesian optimization procedure. The maximum number of steps is determined based on the number of undefined hyperparameters times the maximum optimization rounds per hyperparameter. By default, this value is calculated during hyperparameter optimization.
-
max_trees
number Advanced configuration option. Defines the maximum number of decision trees in the forest. The maximum value is 2000. By default, this value is calculated during hyperparameter optimization.
-
Advanced configuration option. Specifies the maximum number of feature importance values per document to return. By default, no feature importance calculation occurs.
-
prediction_field_name
string Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
randomize_seed
number Defines the seed for the random generator that is used to pick training data. By default, it is randomly generated. Set it to a specific value to use the same training data each time you start a job (assuming other related parameters such as
source
andanalyzed_fields
are the same). -
soft_tree_depth_limit
number Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This soft limit combines with the
soft_tree_depth_tolerance
to penalize trees that exceed the specified depth; the regularized loss increases quickly beyond this depth. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0. -
soft_tree_depth_tolerance
number Advanced configuration option. This option controls how quickly the regularized loss increases when the tree depth exceeds
soft_tree_depth_limit
. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.01. training_percent
string | number -
loss_function
string The loss function used during regression. Available options are
mse
(mean squared error),msle
(mean squared logarithmic error),huber
(Pseudo-Huber loss). -
loss_function_parameter
number A positive number that is used as a parameter to the
loss_function
.
-
-
-
analyzed_fields
object Hide analyzed_fields attributes Show analyzed_fields attributes object
-
includes
array[string] An array of strings that defines the fields that will be excluded from the analysis. You do not need to add fields with unsupported data types to excludes, these fields are excluded from the analysis automatically.
-
excludes
array[string] An array of strings that defines the fields that will be included in the analysis.
-
-
create_time
number Time unit for milliseconds
-
description
string -
dest
object Required Hide dest attributes Show dest attributes object
-
index
string Required -
results_field
string Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
-
id
string Required -
max_num_threads
number Required -
_meta
object Hide _meta attribute Show _meta attribute object
-
*
object Additional properties
-
-
model_memory_limit
string Required -
source
object Required Hide source attributes Show source attributes object
-
index
string | array[string] Required -
runtime_mappings
object Hide runtime_mappings attribute Show runtime_mappings attribute object
-
*
object Additional properties Hide * attributes Show * attributes object
-
fields
object For type
composite
-
fetch_fields
array[object] For type
lookup
-
format
string A custom format for
date
type runtime fields. -
input_field
string Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
target_field
string Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
-
target_index
string -
script
object Hide script attributes Show script attributes object
source
string | object -
id
string -
params
object Specifies any named parameters that are passed into the script as variables. Use parameters instead of hard-coded values to decrease compile time.
Hide params attribute Show params attribute object
-
*
object Additional properties
-
-
options
object Hide options attribute Show options attribute object
-
*
string Additional properties
-
-
type
string Required Values are
boolean
,composite
,date
,double
,geo_point
,geo_shape
,ip
,keyword
,long
, orlookup
.
-
-
-
_source
object Hide _source attributes Show _source attributes object
-
includes
array[string] An array of strings that defines the fields that will be excluded from the analysis. You do not need to add fields with unsupported data types to excludes, these fields are excluded from the analysis automatically.
-
excludes
array[string] An array of strings that defines the fields that will be included in the analysis.
-
-
query
object The Elasticsearch query domain-specific language (DSL). This value corresponds to the query object in an Elasticsearch search POST body. All the options that are supported by Elasticsearch can be used, as this object is passed verbatim to Elasticsearch. By default, this property has the following value:
{"match_all": {}}
.Query DSL
-
-
version
string Required
-
curl \
--request PUT 'http://api.example.com/_ml/data_frame/analytics/{id}' \
--header "Authorization: $API_KEY" \
--header "Content-Type: application/json" \
--data '{"allow_lazy_start":true,"analysis":{"classification":{"alpha":42.0,"dependent_variable":"string","downsample_factor":42.0,"early_stopping_enabled":true,"eta":42.0,"eta_growth_rate_per_tree":42.0,"feature_bag_fraction":42.0,"feature_processors":[{"frequency_encoding":{"feature_name":"string","field":"string","frequency_map":{}},"multi_encoding":{"processors":[42.0]},"n_gram_encoding":{"feature_prefix":"string","field":"string","length":42.0,"n_grams":[42.0],"start":42.0,"custom":true},"one_hot_encoding":{"field":"string","hot_map":"string"},"target_mean_encoding":{"default_value":42.0,"feature_name":"string","field":"string","target_map":{}}}],"gamma":42.0,"lambda":42.0,"max_optimization_rounds_per_hyperparameter":42.0,"max_trees":42.0,"num_top_feature_importance_values":42.0,"prediction_field_name":"string","randomize_seed":42.0,"soft_tree_depth_limit":42.0,"soft_tree_depth_tolerance":42.0,"":"string","class_assignment_objective":"string","num_top_classes":42.0},"outlier_detection":{"compute_feature_influence":true,"feature_influence_threshold":42.0,"method":"string","n_neighbors":42.0,"outlier_fraction":42.0,"standardization_enabled":true},"regression":{"alpha":42.0,"dependent_variable":"string","downsample_factor":42.0,"early_stopping_enabled":true,"eta":42.0,"eta_growth_rate_per_tree":42.0,"feature_bag_fraction":42.0,"feature_processors":[{"frequency_encoding":{"feature_name":"string","field":"string","frequency_map":{}},"multi_encoding":{"processors":[42.0]},"n_gram_encoding":{"feature_prefix":"string","field":"string","length":42.0,"n_grams":[42.0],"start":42.0,"custom":true},"one_hot_encoding":{"field":"string","hot_map":"string"},"target_mean_encoding":{"default_value":42.0,"feature_name":"string","field":"string","target_map":{}}}],"gamma":42.0,"lambda":42.0,"max_optimization_rounds_per_hyperparameter":42.0,"max_trees":42.0,"num_top_feature_importance_values":42.0,"prediction_field_name":"string","randomize_seed":42.0,"soft_tree_depth_limit":42.0,"soft_tree_depth_tolerance":42.0,"":"string","loss_function":"string","loss_function_parameter":42.0}},"analyzed_fields":{"includes":["string"],"excludes":["string"]},"description":"string","dest":{"index":"string","results_field":"string"},"max_num_threads":42.0,"_meta":{"additionalProperty1":{},"additionalProperty2":{}},"model_memory_limit":"string","source":{"index":"string","runtime_mappings":{"additionalProperty1":{"fields":{"additionalProperty1":{"type":"boolean"},"additionalProperty2":{"type":"boolean"}},"fetch_fields":[{"field":"string","format":"string"}],"format":"string","input_field":"string","target_field":"string","target_index":"string","script":{"":"painless","id":"string","params":{"additionalProperty1":{},"additionalProperty2":{}},"options":{"additionalProperty1":"string","additionalProperty2":"string"}},"type":"boolean"},"additionalProperty2":{"fields":{"additionalProperty1":{"type":"boolean"},"additionalProperty2":{"type":"boolean"}},"fetch_fields":[{"field":"string","format":"string"}],"format":"string","input_field":"string","target_field":"string","target_index":"string","script":{"":"painless","id":"string","params":{"additionalProperty1":{},"additionalProperty2":{}},"options":{"additionalProperty1":"string","additionalProperty2":"string"}},"type":"boolean"}},"_source":{"includes":["string"],"excludes":["string"]},"query":{}},"headers":{},"version":"string"}'