ES全文检索pdf、word、txt等文本文件内容

原创

已于 2023-11-09 22:48:08 修改 · 2.1k 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#全文检索 #elasticsearch

于 2023-11-09 22:46:59 首次发布

本文记录了如何使用Elasticsearch 8.6.2进行全文检索，特别是针对PDF、Word、TXT等文本文件的内容检索和高亮显示。通过内置的ingest attachment插件，实现了文件内容的抽取，并使用RestHighLevelClient进行文档上传和高亮查询操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

需求：
用ES对上传文件内容的检索和高亮显示。
之前从事于物联网行业，从多年前了解ES以后没有使用过，本篇文章就是为了记录小白用ES完成工作的过程。
Elasticsearch的介绍、安装和环境这里不过多介绍，网上有很多。
思考：
文本关键字搜索，文本需要上传elasticsearch。支持任意格式文件。纯文本文件应该很容易实现，而对于包含图片和文本的文件怎么处理?
es的文本抽取插件可以帮我们实现。
环境介绍：
由于是已有的环境，es版本已经确定好了，elasticsearch 8.6.2，看了一下官方网页，属于很新的版本（这样的版本意味遇到问题不好找原因和解决办法）
查询es版本信息

es解析文本需要用到ingest attachment插件解析文件中的文本，需要先把文件转base64，具体官网有介绍https://www.elastic.co/guide/en/elasticsearch/reference/8.7/attachment.html 本次使用的es8.6.2版本已经把插件集成进来了，无需单独下载安装。低版本安装attchment插件：在安装目录下，
./bin/elasticsearch-plugin install ingest-attachment

创建索引库

PUT /file2
{
   
   
  "mappings": {
   
   
    "properties": {
   
   
      "deptId":{
   
   
        "type": "long"
      },
      "title":{
   
   
        "type": "text",
        "analyzer": "ik_smart"
      },
      "summary": {
   
   
          "type": "text",
		      "analyzer": "ik_smart"
      },
      "attachment": {
   
   
        "properties": {
   
   
          "content":{
   
   
            "type": "text",
            "analyzer": "ik_smart",
            "index_options" : "offsets"
          }
        }
      }
    }
  }
}

attachment指定抽取解析的文本内容

PUT _ingest/pipeline/attachment
{
   
   
  "description" : "Extract attachment information",
  "processors" : [
    {
   
   
      "attachment" : {
   
   
        "field" : "content",
        "remove_binary": false,
        "indexed_chars" : -1
      }
    }
  ]
}

“field” : “content”,指定文本字段端
“remove_binary”: false,保存base64文件内容 true不保存
“indexed_chars” : -1 不限制解析文件管道流的最大大小，不设置默认100000
因为要使用高亮，选择RestHighLevelClient，所以需要引入依赖

		<dependency>
			<groupId>org.elasticsearch.client</groupId>
			<artifactId>elasticsearch-rest-high-level-client</artifactId>
			<version>7.17.4</version>
		</dependency>

创建RestHighLevelClient对象

RestHighLevelClient restClient= new RestHighLevelClient(RestClient.builder(new HttpHost(elasticsearchServerIp, elasticsearchServerPort, "http")));

上传文档内容

    @Async
    public void addOrUpdateNew(String fileUrl ,String title,String summary) {
   
   
        try {
   
   
        	//文件标题
            fileEntity.setTitle(title);
            //文件摘要
            fileEntity.setSummary(summary);
            //判断文件类型
            String fileType = getFileTypeByDefaultTika(fileUrl);
            if (fileType != null) {
   
   
                if (!fileType.contains("video") && !fileType.contains("image") && !"application/zip".equals(fileType)) {
   
   
                    byte[] bytes