Java操作Word文档全攻略：轻松读取doc与docx文件-CSDN博客

本文链接：https://blog.csdn.net/ztt123654/article/details/149594171

💝💝💝欢迎莅临我的博客，很高兴能够在这里和您见面！希望您在这里可以感受到一份轻松愉快的氛围，不仅可以获得有趣的内容和知识，也可以畅所欲言、分享您的想法和见解。
持续学习，不断总结，共同进步，为了踏实，做好当下事儿~
非常期待和您一起在这个小小的网络世界里共同探索、学习和成长。💝💝💝 ✨✨ 欢迎订阅本专栏 ✨✨

在这里插入图片描述

💖The Start💖点点关注，收藏不迷路💖

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

Java操作Word文档：读取doc和docx

在当今办公自动化和文档处理场景中，Java开发者经常需要处理Word文档（.doc和.docx格式）。无论是数据提取、文档转换还是内容分析，掌握高效的Word文档操作技术至关重要。本文将全面介绍Java中读取两种主流Word格式的技术方案、核心API和最佳实践，帮助开发者快速实现文档处理需求。

1. Word文档格式与Java生态支持

1.1 doc与docx格式差异

doc：采用二进制格式（OLE复合文档），是传统Office 97-2003的标准格式。特点包括：
- 结构复杂，直接解析困难
- 缺乏开放标准，兼容性较差
- 不支持现代文档特性（如高级排版、SVG图形）
docx：基于XML的OOXML格式（实际为ZIP压缩包），Office 2007+的标准格式。优势包括：
- 开放标准（ECMA-376/ISO 29500）
- 模块化设计（分离文档内容、样式、媒体等）
- 更小的文件体积

技术对比示例：

# docx文件解压后结构
unzip -l document.docx

输出显示典型的word/document.xml核心内容文件及_rels关系文件等。

1.2 常用Java库对比

库名称	支持格式	优点	缺点
Apache POI	doc/docx	官方支持，功能全面	API复杂，内存消耗大
docx4j	docx	面向OOXML设计，模板引擎支持	不支持旧版doc格式
Jacob (COM)	doc	Windows原生调用，高性能	仅限Windows系统

选型建议：

需要同时处理两种格式 → Apache POI
纯docx现代应用 → docx4j
遗留系统Windows环境 → Jacob

2. 读取docx文件（Apache POI方案）

2.1 基础环境搭建

Maven配置（建议使用最新稳定版）：

<dependency>  
  <groupId>org.apache.poi</groupId>  
  <artifactId>poi-ooxml</artifactId>  
  <version>5.2.3</version>  
</dependency>

Gradle配置：

implementation 'org.apache.poi:poi-ooxml:5.2.3'

2.2 核心对象模型解析

XWPFDocument：整个文档的容器对象，提供：

getParagraphs()  // 获取所有段落
getTables()      // 获取所有表格
getFootnotes()   // 获取脚注

XWPFParagraph：段落对象，包含：
- 文本内容（getText()）
- 段落样式（getAlignment()）
- 文本块集合（getRuns()）

XWPFRun：样式文本单元，可获取：

getFontSize()  // 字号
isBold()       // 加粗状态
getColor()     // 颜色值

XWPFTable：表格处理核心类，通过：

getRows()       // 获取行
getCell(row,col)// 定位单元格

2.3 完整代码示例

import org.apache.poi.xwpf.usermodel.*;

public class DocxReader {
    public static void main(String[] args) throws Exception {
        // 1. 加载文档
        XWPFDocument doc = new XWPFDocument(
            new FileInputStream("report.docx"));
        
        // 2. 段落处理
        System.out.println("==== 段落内容 ====");
        doc.getParagraphs().forEach(p -> {
            System.out.println(p.getText());
            
            // 获取文本样式
            p.getRuns().forEach(run -> {
                System.out.printf("[字体:%s 大小:%d]%n", 
                    run.getFontName(), run.getFontSize());
            });
        });
        
        // 3. 表格处理
        System.out.println("==== 表格数据 ====");
        for (XWPFTable table : doc.getTables()) {
            for (XWPFTableRow row : table.getRows()) {
                for (XWPFTableCell cell : row.getTableCells()) {
                    System.out.print(cell.getText() + " | ");
                }
                System.out.println();
            }
        }
        
        // 4. 释放资源
        doc.close();
    }
}

3. 处理传统doc文件（POI-HWPF模块）

3.1 特殊依赖说明

需额外添加poi-scratchpad模块：

<dependency>  
  <groupId>org.apache.poi</groupId>  
  <artifactId>poi-scratchpad</artifactId>  
  <version>5.2.3</version>  
</dependency>

3.2 关键API详解

HWPFDocument：文档入口类，提供：

getRange()            // 获取文档内容范围
getSummaryInformation() // 元数据（作者、标题等）

Range：核心内容容器，包含：

numParagraphs()      // 段落总数
getParagraph(index)  // 获取特定段落

CharacterRun：带格式的文本片段，可获取：
```
isBold()
isItalic()
getFontSize()
```

3.3 读取代码示例

import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.usermodel.*;

public class DocReader {
    public static void main(String[] args) throws Exception {
        // 1. 加载文档
        HWPFDocument doc = new HWPFDocument(
            new FileInputStream("legacy.doc"));
        
        // 2. 获取文档范围
        Range range = doc.getRange();
        
        // 3. 读取段落
        System.out.println("==== 文本内容 ====");
        for (int i = 0; i < range.numParagraphs(); i++) {
            Paragraph para = range.getParagraph(i);
            System.out.println(para.text());
        }
        
        // 4. 读取表格（需要特殊处理）
        TableIterator it = new TableIterator(range);
        while (it.hasNext()) {
            Table table = it.next();
            for (int r = 0; r < table.numRows(); r++) {
                TableRow row = table.getRow(r);
                for (int c = 0; c < row.numCells(); c++) {
                    TableCell cell = row.getCell(c);
                    System.out.print(cell.text().trim() + "\t");
                }
                System.out.println();
            }
        }
        
        // 5. 释放资源
        doc.close();
    }
}

4. 高级处理技巧

4.1 样式信息提取

获取docx文本样式（使用POI底层OOXML模型）：

XWPFParagraph paragraph = doc.getParagraphArray(0);
for (XWPFRun run : paragraph.getRuns()) {
    CTRPr pr = run.getCTR().getRPr();
    if (pr != null) {
        System.out.println("加粗: " + pr.isSetB());
        System.out.println("字体: " + run.getFontName());
    }
}

4.2 大文件优化策略

使用事件模型处理GB级文档：

import org.apache.poi.openxml4j.opc.*;
import org.apache.poi.xwpf.eventusermodel.*;

public class LargeDocxReader {
    public static void main(String[] args) throws Exception {
        OPCPackage pkg = OPCPackage.open("huge.docx");
        XWPFReader reader = new XWPFReader(pkg);
        
        // 自定义内容处理器
        XWPFReader.SAXParser parser = reader.getSAXParser();
        parser.parse(new XWPFVisitor() {
            @Override
            public void visitParagraph(XWPFParagraph paragraph) {
                System.out.println(paragraph.getText());
            }
        });
        
        pkg.close();
    }
}

4.3 混合格式处理方案

自动识别并处理不同格式：

public void processDocument(File file) throws Exception {
    String name = file.getName().toLowerCase();
    
    if (name.endsWith(".docx")) {
        try (XWPFDocument doc = new XWPFDocument(new FileInputStream(file))) {
            // docx处理逻辑
        }
    } else if (name.endsWith(".doc")) {
        try (HWPFDocument doc = new HWPFDocument(new FileInputStream(file))) {
            // doc处理逻辑
        }
    } else {
        throw new IllegalArgumentException("Unsupported format");
    }
}

5. 常见问题与解决方案

5.1 典型异常处理

异常类型	原因分析	解决方案
OLE2NotOfficeXmlFileException	文件格式不匹配	使用`Files.probeContentType()`检测实际类型
EncryptedDocumentException	文档加密	使用`Biff8EncryptionKey.setCurrentUserPassword("pass")`
OutOfMemoryError	内存不足	增加JVM内存或改用SAX解析

5.2 编码问题排查

处理中文乱码示例：

// 尝试不同编码方案
String[] encodings = {"GBK", "UTF-8", "ISO-8859-1"};
for (String enc : encodings) {
    try {
        String text = new String(paragraph.getBytes(), enc);
        System.out.println(enc + ": " + text);
        break;
    } catch (Exception e) {
        continue;
    }
}

6. 总结

技术选型矩阵：

需求场景推荐方案
全格式支持 Apache POI
高性能docx处理 docx4j + STAX解析
旧系统维护 POI-HWPF
性能优化要点：
1. 对于>50MB文件，必须使用事件驱动模型
2. 避免频繁创建文档对象（复用XWPF/HWPF实例）
3. 及时关闭文件流（try-with-resources语法）
扩展应用场景：
- 与Freemarker结合生成动态报表
- 使用Tika进行文档内容分析
- 集成阿里云OCR实现扫描件处理