Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Namespace Lucene.Net.Analysis.Ja

    Kuromoji is a morphological analyzer for Japanese text.

    This module provides support for Japanese text analysis, including features such as part-of-speech tagging, lemmatization, and compound word analysis.

    For an introduction to Lucene's analysis API, see the Lucene.Net.Analysis namespace documentation.

    Classes

    GraphvizFormatter

    Outputs the dot (graphviz) string for the viterbi lattice.

    JapaneseAnalyzer

    Analyzer for Japanese that uses morphological analysis.

    JapaneseBaseFormFilter

    Replaces term text with the IBaseFormAttribute.

    This acts as a lemmatizer for verbs and adjectives. To prevent terms from being stemmed use an instance of Lucene.Net.Analysis.Miscellaneous.SetKeywordMarkerFilter or a custom Lucene.Net.Analysis.TokenFilter that sets the Lucene.Net.Analysis.TokenAttributes.IKeywordAttribute before this Lucene.Net.Analysis.TokenStream.

    JapaneseBaseFormFilterFactory

    Factory for JapaneseBaseFormFilter.

    <fieldType name="text_ja" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.JapaneseTokenizerFactory"/>
        <filter class="solr.JapaneseBaseFormFilterFactory"/>
      </analyzer>
    </fieldType>

    JapaneseIterationMarkCharFilter

    Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.

    JapaneseIterationMarkCharFilterFactory

    Factory for JapaneseIterationMarkCharFilter.

    <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <analyzer>
        <charFilter class="solr.JapaneseIterationMarkCharFilterFactory normalizeKanji="true" normalizeKana="true"/>
        <tokenizer class="solr.JapaneseTokenizerFactory"/>
      </analyzer>
    </fieldType>

    JapaneseKatakanaStemFilter

    A Lucene.Net.Analysis.TokenFilter that normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC). Only katakana words longer than a minimum length are stemmed (default is four).

    JapaneseKatakanaStemFilterFactory

    Factory for JapaneseKatakanaStemFilter.

    <fieldType name="text_ja" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.JapaneseTokenizerFactory"/>
        <filter class="solr.JapaneseKatakanaStemFilterFactory"
                minimumLength="4"/>
      </analyzer>
    </fieldType>

    JapanesePartOfSpeechStopFilter

    Removes tokens that match a set of part-of-speech tags.

    JapanesePartOfSpeechStopFilterFactory

    Factory for JapanesePartOfSpeechStopFilter.

    <fieldType name="text_ja" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.JapaneseTokenizerFactory"/>
        <filter class="solr.JapanesePartOfSpeechStopFilterFactory"
                tags="stopTags.txt" 
                enablePositionIncrements="true"/>
      </analyzer>
    </fieldType>

    JapaneseReadingFormFilter

    A Lucene.Net.Analysis.TokenFilter that replaces the term attribute with the reading of a token in either katakana or romaji form. The default reading form is katakana.

    JapaneseReadingFormFilterFactory

    Factory for JapaneseReadingFormFilter.

    <fieldType name="text_ja" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.JapaneseTokenizerFactory"/>
        <filter class="solr.JapaneseReadingFormFilterFactory"
                useRomaji="false"/>
      </analyzer>
    </fieldType>

    JapaneseTokenizer

    Tokenizer for Japanese that uses morphological analysis.

    JapaneseTokenizerFactory

    Factory for JapaneseTokenizer.

    <fieldType name="text_ja" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.JapaneseTokenizerFactory"
          mode="NORMAL"
          userDictionary="user.txt"
          userDictionaryEncoding="UTF-8"
          discardPunctuation="true"
        />
        <filter class="solr.JapaneseBaseFormFilterFactory"/>
      </analyzer>
    </fieldType>

    Token

    Analyzed token with morphological data from its dictionary.

    Enums

    JapaneseTokenizerMode

    Tokenization mode: this determines how the tokenizer handles compound and unknown words.

    JapaneseTokenizerType

    Token type reflecting the original source of this token

    • Improve this Doc
    Back to top Copyright © 2022 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.