-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Fix: handle missing w:styleId in DOCX to prevent KeyError #1417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@MonkeyCode-AI review it |
MonkeyCode-AI
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已添加对 PR 的评审意见。
packages/markitdown/src/markitdown/converters/_docx_converter.py
Outdated
Show resolved
Hide resolved
|
@microsoft-github-policy-service agree |
|
⏳ MonkeyCode-AI 正在分析,请稍等片刻... |
💬 代码审查评论: 本次改动通过捕获缺失 📊 统计
🟡 警告 (3)使用 `str(e)` 判断 KeyError 键值不稳健
|
❌ 代码审查需要修改: 本次改动通过捕获 KeyError 试图兜底缺失 w:styleId,但引入了对依赖参数兼容性和可重试流的潜在崩溃风险,需补齐兼容与测试后再合并。 📊 统计
🔴 严重 (1)新增参数 ignore_empty_styles 可能导致依赖不兼容
|
- Avoids retrying consumed streams by buffering DOCX input - Removes dependency-specific parameters - Adds logging and safe fallback for malformed DOCX files
|
I’ve updated the implementation to address the raised concerns: Removed dependency-specific parameters to avoid Mammoth version incompatibility Buffered the preprocessed DOCX into memory to ensure safe retry behavior and avoid consumed streams Replaced fragile exception matching with a safe, explicit fallback path Added logging to make fallback behavior observable instead of silent |
Issue
Issue #1413
Some DOCX files may have headings, table cells, or footnotes without a styleId (w:styleId).
The original converter crashes with:
KeyError: 'w:styleId'
Fix
Wrapped Mammoth conversion in try/except to catch missing w:styleId.
If missing, fallback safely and continue conversion.