Elasticsearch中使用IK中文分词_getremotewords error-CSDN博客

本文链接：https://blog.csdn.net/nanguo123/article/details/109493787

本文介绍如何在Elasticsearch中安装IKAnalysis插件，并详细阐述了词典配置、挂载及热更新的方法。通过示例展示了如何在k8s环境下进行配置，并分析了词典更新的原理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前言

Elasticsearch默认不支持中文分词，会把输入的中文按照一个个字符来看待，这种情况下的检索结果，往往都不能满足常规的业务需求。Elasticsearch支持各类插件，我们可以选择安装中文分词的插件，来满足业务需求。

中文分词插件会包含中文中常见的词语，例如输入“中国”，ES将不再认为这是两个词，“中”和“国”，而是当成一个词来检索。

在不同的业务场景下，会有不同的检索关键词，例如电影“我和我的祖国”，用户搜索时需要当做一个词来处理，而分词器会默认拆分为“我”、“和”、“祖国”等词，这种场景需要我们自定义词典，分词器会将自定义词典中国的词当做一个独立的词来看待，不会再去分词。

本文将以常用的IK Analysis为例，介绍其在k8s环境中的安装配置，自定义词典并实现热更新等过程，最后将基于源码分析IK词典热更新的原理。

安装IK Analysis

在私有云环境中，常常无法访问公网，因此我们可以提前下载对应版本的IK插件至本地，在打包es镜像时，安装ik插件，参考的Dockerfile如下：

FROM elasticsearch:7.9.0
COPY ./elasticsearch-analysis-ik-7.9.0.zip /home/
RUN sh -c '/bin/echo -e "y" | bin/elasticsearch-plugin install  file:/home/elasticsearch-analysis-ik-7.9.0.zip'

sh -c" 命令，它可以让 bash 将一个字串作为完整的命令来执行。elasticsearch-plugin在安装插件时需要手动输入“y”来确认，因此上述Dockerfile的RUN命令做了特殊处理。

词典配置

词典的配置文件IKAnalyzer.cfg.xml通常位于 {conf}/analysis-ik/config/IKAnalyzer.cfg.xml 或*{plugins}/elasticsearch-analysis-ik-/config/IKAnalyzer.cfg.xml，如下所示：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
 	<!--用户可以在这里配置远程扩展字典 -->
	<entry key="remote_ext_dict">location</entry>
 	<!--用户可以在这里配置远程扩展停止词字典-->
	<entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>
</properties>

挂载配置文件

在k8s环境中，将配置文件外挂出Pod之外是最佳实践，可以通过ConfigMap来实现。

kind: ConfigMap
apiVersion: v1
metadata:
  name: ik-dictionary-config
  namespace: elasticsearch
data:
  IKAnalyzer.cfg.xml: "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<!DOCTYPE properties SYSTEM \"http://java.sun.com/dtd/properties.dtd\">\r\n<properties>\r\n\t<comment>IK Analyzer 扩展配置</comment>\r\n\t<!--用户可以在这里配置自己的扩展字典 -->\r\n\t<!-- <entry key=\"ext_dict\">custom/mydict.dic;custom/single_word_low_freq.dic</entry> -->\r\n\t <!--用户可以在这里配置自己的扩展停止词字典-->\r\n\t<!-- <entry key=\"ext_stopwords\">custom/ext_stopword.dic</entry> -->\r\n \t<!--用户可以在这里配置远程扩展字典 -->\r\n\t<entry key=\"remote_ext_dict\">http://ik-dict-nginx/dict/movie.txt</entry>\r\n \t<!--用户可以在这里配置远程扩展停止词字典-->\r\n\t<!-- <entry key=\"remote_ext_stopwords\">http://xxx.com/xxx.dic</entry> -->\r\n</properties>"

将配置文件放于configmap中，在pod启动时，k8s会自动将配置文件挂载于指定目录下，实现了配置文件同应用的相分离，变更配置文件更加方便。

spec:
        volumes: 
          - name: dict-config
            configMap: 
              name: ik-dictionary-config
        containers:
        - name: elasticsearch
          volumeMounts:
            - name: dict-config
              mountPath: /usr/share/elasticsearch/config/analysis-ik/IKAnalyzer.cfg.xml
              subPath: IKAnalyzer.cfg.xml

需要在pod的yml文件中，指定subPath，才能将configmap以文件的形式挂载于指定目录之下。

热更新 IK 分词使用方法

IK插件支持热更新词典，通过上文在IK配置文件中提到的如下配置：

<!--用户可以在这里配置远程扩展字典 -->
	<entry key="remote_ext_dict">location</entry>
 	<!--用户可以在这里配置远程扩展停止词字典-->
<entry key="remote_ext_stopwords">location</entry>

其中location是指一个 url，比如 http://yoursite.com/getCustomDict，该请求只需满足以下两点即可完成分词热更新。

该 http 请求需要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，这两者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。
该 http 请求返回的内容格式是一行一个分词，换行符用 \n 即可。

满足上面两点要求就可以实现热更新分词了，不需要重启 ES 实例。

搭建词典服务器

根据上文热更新的要求，可以使用Nginx来实现词典服务器，Nginx会在客户端请求词典文件时自动返回相应的Last-Modified 和 ETag，无需手动干预。

在k8s中，首先创建一块云盘来存放词典文件，这里以NFS为例。

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfspv  ##pv名称
spec:
  capacity:
    storage: 5Gi 
  accessModes:
    - ReadWriteMany
  mountOptions:
    - port=10003  ## 接口servicePort字段
    - nfsvers=4
    - minorversion=0
    - rsize=1048576
    - wsize=1048576
    - hard
    - timeo=600
    - retrans=2
  nfs:
    path: /
    server: 10.172.*.* ## 接口ipAddress字段
  persistentVolumeReclaimPolicy: Recycle

创建相应的pvc，用于绑定pv

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
spec:
  accessModes:
     - ReadWriteMany
  resources:
     requests:
       storage: 5Gi

Nginx的配置文件，同样采用configmap的方式外挂出来

kind: ConfigMap
apiVersion: v1
metadata:
  name: ik-nginx-config
  namespace: elasticsearch
data:
  nginx.conf: |
    worker_processes  2;
    http {
      charset utf-8;
    server {
      listen 80;
      server_name *;
      location ^~ /dict/ {
		  charset utf-8;
		   root /mnt/cfs/;
	  }
      location / {
        root /usr/share/nginx/html
      }
    }
    }

更新词典

词典文件位于云盘中，同时所创建的pv的访问模式为ReadWriteMany，支持多个pod同时读写。基于上述原理，可以开发领一个工具来从业务系统提取相关词汇，并更新云盘中的词典文件。

词典未生效常见原因

请确保你的扩展词典的文本格式为 UTF8 编码
请确保词典服务器没有启动任何压缩技术，例如gzip

词典更新原理的源码分析

词典管理类位于包目录org.wltea.analyzer.dic下的Dictionary类

Dictionary类是一个单例

public class Dictionary {

	/*
	 * 词典单子实例
	 */
	private static Dictionary singleton;

	private DictSegment _MainDict;

	private DictSegment _QuantifierDict;

	private DictSegment _StopWords;

	/**
	 * 配置对象
	 */
	private Configuration configuration;

词典的配置文件不支持自定义，但可以位于插件目录和配置文件目录下

this.conf_dir = cfg.getEnvironment().configFile().resolve(AnalysisIkPlugin.PLUGIN_NAME);
Path configFile = conf_dir.resolve(FILE_NAME);

InputStream input = null;
try {
   //默认在插件目录下寻找
	logger.info("try load config from {}", configFile);
	input = new FileInputStream(configFile.toFile());
} catch (FileNotFoundException e) {
    //文件不存在时，从配置文件目录中寻找
	conf_dir = cfg.getConfigInPluginDir();
	configFile = conf_dir.resolve(FILE_NAME);
	try {
		logger.info("try load config from {}", configFile);
		input = new FileInputStream(configFile.toFile());
	} catch (FileNotFoundException ex) {
		// We should report origin exception
		logger.error("ik-analyzer", e);
	}
}

词典初始化时，采用Dictionary类的静态方法进行词典初始化，只有当Dictionary类被实际调用时，才会开始载入词典，这将延长首次分词操作的时间。该方法提供了一个在应用加载阶段就初始化字典的手段

public static synchronized void initial(Configuration cfg) {
		if (singleton == null) {
			synchronized (Dictionary.class) {
				if (singleton == null) {

					singleton = new Dictionary(cfg);
					singleton.loadMainDict();
					singleton.loadSurnameDict();
					singleton.loadQuantifierDict();
					singleton.loadSuffixDict();
					singleton.loadPrepDict();
					singleton.loadStopWordDict();

					if(cfg.isEnableRemoteDict()){
						// 建立监控线程
						for (String location : singleton.getRemoteExtDictionarys()) {
							// 10 秒是初始延迟可以修改的 60是间隔时间 单位秒
							pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
						}
						for (String location : singleton.getRemoteExtStopWordDictionarys()) {
							pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
						}
					}

				}
			}
		}
	}

如上代码，启动一个线程，用于定时更新词典文件，更新间隔为60s。

从远程服务器下载词典，根据源码，这里仅支持UTF8编码，同时没有处理解压缩逻辑，因此不支持gzip等压缩方式。

private static List<String> getRemoteWordsUnprivileged(String location) {

		List<String> buffer = new ArrayList<String>();
		RequestConfig rc = RequestConfig.custom().setConnectionRequestTimeout(10 * 1000).setConnectTimeout(10 * 1000)
				.setSocketTimeout(60 * 1000).build();
		CloseableHttpClient httpclient = HttpClients.createDefault();
		CloseableHttpResponse response;
		BufferedReader in;
		HttpGet get = new HttpGet(location);
		get.setConfig(rc);
		try {
			response = httpclient.execute(get);
			if (response.getStatusLine().getStatusCode() == 200) {

				String charset = "UTF-8";
				// 获取编码，默认为utf-8
				HttpEntity entity = response.getEntity();
				if(entity!=null){
					Header contentType = entity.getContentType();
					if(contentType!=null&&contentType.getValue()!=null){
						String typeValue = contentType.getValue();
						if(typeValue!=null&&typeValue.contains("charset=")){
							charset = typeValue.substring(typeValue.lastIndexOf("=") + 1);
						}
					}

					if (entity.getContentLength() > 0 || entity.isChunked()) {
						in = new BufferedReader(new InputStreamReader(entity.getContent(), charset));
						String line;
						while ((line = in.readLine()) != null) {
							buffer.add(line);
						}
						in.close();
						response.close();
						return buffer;
					}
			}
			}
			response.close();
		} catch (IllegalStateException | IOException e) {
			logger.error("getRemoteWords {} error", e, location);
		}
		return buffer;
	}

更新词典文件时，会直接创建一个新的Dictionary对象，利用用新对象重新加载整个字典文件，减少加载过程中对当前词典使用的影响，加载完毕后，替换原对象中的词典文件，实现词典文件的更新，因此词典不支持增量更新。

void reLoadMainDict() {
		logger.info("start to reload ik dict.");
		// 新开一个实例加载词典，减少加载过程对当前词典使用的影响
		Dictionary tmpDict = new Dictionary(configuration);
		tmpDict.configuration = getSingleton().configuration;
		tmpDict.loadMainDict();
		tmpDict.loadStopWordDict();
		_MainDict = tmpDict._MainDict;
		_StopWords = tmpDict._StopWords;
		logger.info("reload ik dict finished.");
	}