Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Namespace Lucene.Net.Analysis.Cn.Smart

    Analyzer for Simplified Chinese, which indexes words.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.

    • StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.

    • CJKAnalyzer (in the <xref:Lucene.Net.Analysis.Cjk> namespace of Lucene.Net.Analysis.Common): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.

    • SmartChineseAnalyzer (in this package): Index words (attempt to segment Chinese text into words) as tokens.

    Example phrase: "我是中国人"

    1. StandardAnalyzer: 我-是-中-国-人

    2. CJKAnalyzer: 我是-是中-中国-国人

    3. SmartChineseAnalyzer: 我-是-中国-人

    Classes

    AnalyzerProfile

    Manages analysis data configuration for SmartChineseAnalyzer

    SmartChineseAnalyzer has a built-in dictionary and stopword list out-of-box.

    NOTE: To use an alternate dicationary than the built-in one, put the "bigramdict.dct" and "coredict.dct" files in a subdirectory of your application named "smartcn-data". This subdirectory can be placed in any directory up to and including the root directory (if the OS permission allows). To place the files in an alternate location, set an environment variable named "smartcn.data.dir" with the name of the directory the "bigramdict.dct" and "coredict.dct" files can be located within.

    The default "bigramdict.dct" and "coredict.dct" files can be found at: https://issues.apache.org/jira/browse/LUCENE-1629.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    HMMChineseTokenizer

    Tokenizer for Chinese or mixed Chinese-English text.

    The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.

    HMMChineseTokenizerFactory

    Factory for HMMChineseTokenizer

    Note: this class will currently emit tokens for punctuation. So you should either add a Lucene.Net.Analysis.Miscellaneous.WordDelimiterFilter after to remove these (with concatenate off), or use the SmartChinese stoplist with a StopFilterFactory via:

    words="org/apache/lucene/analysis/cn/smart/stopwords.txt"

    Note

    This API is experimental and might change in incompatible ways in the next release.

    SentenceTokenizer

    Tokenizes input text into sentences.

    The output tokens can then be broken into words with WordTokenFilter

    Note

    This API is experimental and might change in incompatible ways in the next release.

    SmartChineseAnalyzer

    SmartChineseAnalyzer is an analyzer for Chinese or mixed Chinese-English text. The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.

    Segmentation is based upon the Hidden Markov Model. A large training corpus was used to calculate Chinese word frequency probability.

    This analyzer requires a dictionary to provide statistical data. SmartChineseAnalyzer has an included dictionary out-of-box.

    The included dictionary data is from ICTCLAS1.0. Thanks to ICTCLAS for their hard work, and for contributing the data under the Apache 2 License!

    Note

    This API is experimental and might change in incompatible ways in the next release.

    SmartChineseSentenceTokenizerFactory

    Factory for the SmartChineseAnalyzer SentenceTokenizer

    Note

    This API is experimental and might change in incompatible ways in the next release.

    SmartChineseWordTokenFilterFactory

    Factory for the SmartChineseAnalyzer WordTokenFilter

    Note: this class will currently emit tokens for punctuation. So you should either add a Lucene.Net.Analysis.Miscellaneous.WordDelimiterFilter after to remove these (with concatenate off), or use the SmartChinese stoplist with a Lucene.Net.Analysis.Core.StopFilterFactory via:

    words="org/apache/lucene/analysis/cn/smart/stopwords.txt"

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Utility

    SmartChineseAnalyzer utility constants and methods

    Note

    This API is experimental and might change in incompatible ways in the next release.

    WordTokenFilter

    A Lucene.Net.Analysis.TokenFilter that breaks sentences into words.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Enums

    CharType

    Internal SmartChineseAnalyzer character type constants.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    WordType

    Internal SmartChineseAnalyzer token type constants

    Note

    This API is experimental and might change in incompatible ways in the next release.

    • Improve this Doc
    Back to top Copyright © 2022 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.