使用Python SDK实现Amazon Textract文档分析实战指南-CSDN博客

本文链接：https://blog.csdn.net/gitblog_00860/article/details/148439824

使用Python SDK实现Amazon Textract文档分析实战指南

aws-doc-sdk-examples Welcome to the AWS Code Examples Repository. This repo contains code examples used in the AWS documentation, AWS SDK Developer Guides, and more. For more information, see the Readme.md file below. 项目地址: https://gitcode.com/gh_mirrors/aw/aws-doc-sdk-examples

技术背景

Amazon Textract是一项强大的AWS机器学习服务，能够自动从扫描文档、PDF和图像中提取文本、表单和表格数据。相比传统OCR技术，Textract不仅能识别文字，还能理解文档结构，保留表格行列关系和表单键值对的语义关联。

环境准备

在开始使用Python操作Textract前，需要确保：

已配置AWS CLI并设置好凭证文件
Python 3.6或更高版本环境
安装必要的依赖包：
```
pip install boto3 python-dotenv
```

建议使用虚拟环境隔离项目依赖：

python -m venv .venv
source .venv/bin/activate

核心API操作详解

基础文本提取

DetectDocumentText是最基础的API，适用于简单文档的文本提取：

import boto3

def extract_text(document_bytes):
    client = boto3.client('textract')
    response = client.detect_document_text(
        Document={'Bytes': document_bytes}
    )
    return '\n'.join(
        item['Text'] for item in response['Blocks'] 
        if item['BlockType'] == 'LINE'
    )

高级文档分析

AnalyzeDocument支持更复杂的文档结构解析：

def analyze_document(document_bytes, feature_types=['TABLES', 'FORMS']):
    client = boto3.client('textract')
    response = client.analyze_document(
        Document={'Bytes': document_bytes},
        FeatureTypes=feature_types
    )
    
    # 处理返回的Blocks数据结构
    tables = [b for b in response['Blocks'] if b['BlockType'] == 'TABLE']
    forms = [b for b in response['Blocks'] if b['BlockType'] == 'KEY_VALUE_SET']
    
    return {
        'tables': process_tables(tables),
        'forms': process_forms(forms)
    }

异步处理大文档

对于大型文档，应使用异步API避免超时：

def start_async_analysis(s3_bucket, s3_key):
    client = boto3.client('textract')
    response = client.start_document_analysis(
        DocumentLocation={
            'S3Object': {
                'Bucket': s3_bucket,
                'Name': s3_key
            }
        },
        FeatureTypes=['TABLES', 'FORMS']
    )
    return response['JobId']

def get_async_result(job_id):
    client = boto3.client('textract')
    response = client.get_document_analysis(JobId=job_id)
    while response['JobStatus'] == 'IN_PROGRESS':
        time.sleep(5)
        response = client.get_document_analysis(JobId=job_id)
    return response

实战场景解析

场景一：文档分析浏览器应用

构建交互式文档分析工具的技术要点：

前端使用React/Vue展示分析结果
后端API使用Flask/Django处理文件上传
使用WebSocket推送异步任务状态
可视化展示表格和表单的层级关系

关键实现代码：

@app.route('/analyze', methods=['POST'])
def analyze_file():
    file = request.files['file']
    job_id = textract_client.start_document_analysis(
        Document={'Bytes': file.read()},
        FeatureTypes=['TABLES', 'FORMS']
    )['JobId']
    return jsonify({'jobId': job_id})

场景二：结合Comprehend的实体识别

文档文本提取后进一步分析的技术路径：

先通过Textract提取文档文本内容
使用Comprehend检测文本中的实体（人名、地点、日期等）
建立实体与原始文档位置的映射关系
实现实体高亮显示功能

示例代码：

def detect_entities(text):
    comprehend = boto3.client('comprehend')
    entities = comprehend.detect_entities(
        Text=text,
        LanguageCode='en'
    )['Entities']
    return sorted(entities, key=lambda x: x['Score'], reverse=True)