Link Search Menu Expand Document Documentation Menu

Standard analyzer

The standard analyzer is the built-in default analyzer used for general-purpose full-text search in OpenSearch. It is designed to provide consistent, language-agnostic text processing by efficiently breaking down text into searchable terms.

The standard analyzer performs the following operations:

  • Tokenization: Uses the standard tokenizer, which splits text into words based on Unicode text segmentation rules, handling spaces, punctuation, and common delimiters.
  • Lowercasing: Applies the lowercase token filter to convert all tokens to lowercase, ensuring consistent matching regardless of input case.

This combination makes the standard analyzer ideal for indexing a wide range of natural language content without needing language-specific customizations.

Example: Creating an index with the standard analyzer

You can assign the standard analyzer to a text field when creating an index:

PUT /my_standard_index
{
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "standard"
      }
    }
  }
}

Parameters

The standard analyzer supports the following optional parameters.

Parameter Data type Default Description
max_token_length Integer 255 The maximum length that a token can be before it is split.
stopwords String or list of strings None A list of stopwords or a predefined stopword set for a language to remove during analysis. For example, _english_.
stopwords_path String None The path to a file containing stopwords to be used during analysis.

Only use one of the parameters stopwords or stopwords_path. If both are used, no error is returned but only the stopwords parameter is applied.

Example: Analyzer with parameters

The following example creates a products index and configures the max_token_length and stopwords parameters:

PUT /animals
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_manual_stopwords_analyzer": {
          "type": "standard",
          "max_token_length": 10,
          "stopwords": [
            "the", "is", "and", "but", "an", "a", "it"
          ]
        }
      }
    }
  }
}

Use the following _analyze API request to see how the my_manual_stopwords_analyzer processes text:

POST /animals/_analyze
{
  "analyzer": "my_manual_stopwords_analyzer",
  "text": "The Turtle is Large but it is Slow"
}

The returned tokens:

  • Have been split on spaces.
  • Have been lowercased.
  • Have had stopwords removed.
{
  "tokens": [
    {
      "token": "turtle",
      "start_offset": 4,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "large",
      "start_offset": 14,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "slow",
      "start_offset": 30,
      "end_offset": 34,
      "type": "<ALPHANUM>",
      "position": 7
    }
  ]
}
350 characters left

Have a question? .

Want to contribute? or .