用PHP实现发票识别-CSDN博客

本文链接：https://blog.csdn.net/francy6/article/details/148208037

发票识别通常涉及图像处理、文字识别（OCR）和数据提取。PHP 可以通过调用第三方 API 或结合本地库实现这一功能。以下是具体实现方法。

使用 Tesseract OCR 进行本地识别

Tesseract 是一个开源的 OCR 引擎，支持多种语言。PHP 可以通过 exec 或 shell 命令调用 Tesseract 进行发票文字识别。

<?php
// 安装 Tesseract OCR 后调用
$imagePath = 'invoice.jpg';
$outputPath = 'output.txt';

// 执行 Tesseract 命令
exec("tesseract $imagePath $outputPath -l eng+chi_sim");

// 读取识别结果
$text = file_get_contents($outputPath . '.txt');
echo $text;
?>

优化识别效果：

使用预处理工具（如 OpenCV）提升图像质量。
指定语言参数（如 -l eng+chi_sim 支持中英文）。

调用百度 OCR API 实现高精度识别

第三方 OCR API（如百度 AI）提供更精准的发票识别服务。以下是调用百度通用文字识别接口的示例。

<?php
$imagePath = 'invoice.jpg';
$apiKey = '你的API Key';
$secretKey = '你的Secret Key';

// 获取 access_token
$authUrl = "https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=$apiKey&client_secret=$secretKey";
$authResponse = json_decode(file_get_contents($authUrl), true);
$accessToken = $authResponse['access_token'];

// 调用 OCR 接口
$ocrUrl = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic?access_token=$accessToken";
$imageData = base64_encode(file_get_contents($imagePath));
$postData = "image=$imageData";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $ocrUrl);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);

// 解析结果
$result = json_decode($response, true);
print_r($result['words_result']);
?>

关键点：

需注册百度 AI 开放平台获取 API Key 和 Secret Key。
支持自定义模板识别（如增值税发票专用模板）。

发票关键信息提取

识别后的文本需通过正则表达式或规则提取关键字段（如发票代码、金额、日期）。

<?php
$text = "发票代码：144031900111  发票号码：12345678  金额：￥500.00";

// 提取发票代码
preg_match('/发票代码：(\d+)/', $text, $codeMatches);
$invoiceCode = $codeMatches[1] ?? '';

// 提取金额
preg_match('/金额：￥([\d.]+)/', $text, $amountMatches);
$amount = $amountMatches[1] ?? '';

echo "发票代码: $invoiceCode, 金额: $amount";
?>

增强方案：

结合 NLP 技术处理非结构化文本。
使用字典匹配提升字段提取准确率。

本地深度学习方案（需扩展支持）

通过 PHP 调用 Python 脚本实现深度学习模型（如 PaddleOCR）的发票识别。

<?php
$imagePath = 'invoice.jpg';
$scriptPath = 'ocr_script.py';

// 调用 Python 脚本
$command = "python $scriptPath --image $imagePath";
$output = shell_exec($command);

// 解析 JSON 输出
$result = json_decode($output, true);
print_r($result);
?>

Python 脚本示例（ocr_script.py）：

import paddleocr
import sys
import json

ocr = paddleocr.OCR()
result = ocr.ocr(sys.argv[2])
print(json.dumps(result))

总结

轻量级需求：Tesseract OCR 适合本地快速部署。
高精度场景：百度/腾讯 OCR API 提供更优效果。
复杂字段提取：结合正则与规则引擎处理结构化数据。
深度学习方案：通过 PHP 调用外部脚本扩展能力。

代码示例覆盖了从基础到进阶的实现路径，可根据实际需求选择方案。