如何通过Haystack建立pdf文档的全文索引

最新推荐文章于 2025-01-27 07:45:00 发布

码上的生活

最新推荐文章于 2025-01-27 07:45:00 发布

阅读量1.7k

点赞数

CC 4.0 BY-SA版权

文章标签： django 全文检索 pdf索引

本文链接：https://blog.csdn.net/sherlockzoom/article/details/51910110

Django/Haystack 专栏收录该内容

3 篇文章

订阅专栏

通过Haystack可以快速建立Django的全文检索。如果我们的模型里面models.py使用了文件上传(假设这里你上传的pdf)，并且你希望能够同时对这个pdf文件内容建立全文索引。那么应该怎么办？

如果你还不了解Haystack，或者不知道怎么快速建立基于Django 的全文检索下面的内容会很有帮助：

下面的内容建立在能快速建立基于Haystack全文索引的基础上

那么具体怎么做？

查阅官方文档看到有这么一个页面RichContentExtraction其中有这么一段话

For some projects it is desirable to index text content which is stored in structured files such as PDFs, Microsoft Office documents, images, etc. Currently only Solr’s ExtractingRequestHandler is directly supported by Haystack but the approach below could be used with any backend which supports this feature.

大概意思就是目前只有Solr的后端能直接处理pdf索引，但是其它后台只要实现相同的方法也可以支持这个功能。

以`Haystack`的`whoosh`后端为例

思路：如果我们能够读取出pdf内容，那么pdf就和其他的字段没有什么太大区别，也就可以建立全文索引。

python中读取pdf的包这里使用的是 textract

前面我们建立索引建立了一个search_indexes.py文件，内容大概应该是这样的

class BaseFileIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    title = indexes.CharField(model_attr='title')

    def get_model(self):
        return BaseFile

    def index_queryset(self, using=None):
        return self.get_model().objects.all()

步骤很简单只需要添加一个def prepars(self, obj)方法在这里处理pdf文件读取既可。

# <your_app>/search_indexes.py
class BaseFileIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    title = indexes.CharField(model_attr='title')

    def get_model(self):
        return BaseFile

    def index_queryset(self, using=None):
        return self.get_model().objects.all()

    def prepare(self, obj):
        data = super(BaseFileIndex, self).prepare(obj)
        extracted_data = search_utils.parse_to_string(obj.content.path)  # 解析pdf内容
        t = loader.select_template(('search/indexes/himadatabase/basefile_text.txt', ))  
        data['text'] = t.render({'object': obj, 'extracted': extracted_data})
        return data

# search_utils.py

# coding=utf-8
import textract


# 读取pdf到string
def parse_to_string(filename):
    return textract.process(filename)