使用pdfcraft将pdf文件转换为markdown格式,使用cpu与gpu2种配置方式转换同一文件。使用anaconda构建3.10版本的python环境1、环境2,cpu配置方式启动环境1,gpu配置方式启动环境2。
环境1的安装包参数:
onnx 1.13.0
onnxruntime 1.21.0
pdf-craft 0.0.15
torch 2.6.0
torchvision 0.21.0
环境2的安装包参数:
onnx 1.16.0
onnxruntime 1.20.0
pdf-craft 0.0.12
torch 2.6.0+cu126
torchaudio 2.6.0+cu126
torchvision 0.21.0
pdf-craft安装指令
pip install pdf-craft
onnxruntime的gpu版本安装指令为,版本适配参考NVIDIA - CUDA | onnxruntime
pip install onnxruntime-gpu==1.20.0
cpu配置方式的代码
from pdf_craft import PDFPageExtractor, MarkDownWriter
import time
from functools import wraps
def timeit(func):
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.perf_counter()
result = func(*args, **kwargs)
end_time = time.perf_counter()
elapsed_ms = (end_time - start_time) * 1000 # 转换为毫秒
print(f"Function '{func.__name__}' executed in {elapsed_ms:.2f} ms")
return result
return wrapper
# 创建 PDF 提取器对象
extractor = PDFPageExtractor(
device="cpu", # 如果使用 GPU,请改为 device="cuda:0"
model_dir_path="C:\ExtractionModel" # 模型下载和存储路径
)
markdown_path = r'QT中的元对象系统-cpu.md'
pdf_path = r'QT中的元对象系统.pdf'
# 创建 Markdown 写入器对象
@timeit
def ParseMD(pdf_path, markdown_path):
# 创建 Markdown 写入器对象
with MarkDownWriter(markdown_path, "images", "utf-8") as md:
for block in extractor.extract(pdf=pdf_path):
md.write(block)
ParseMD(pdf_path,markdown_path)
gpu配置方式的代码
from pdf_craft import PDFPageExtractor, MarkDownWriter
import time
from functools import wraps
def timeit(func):
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.perf_counter()
result = func(*args, **kwargs)
end_time = time.perf_counter()
elapsed_ms = (end_time - start_time) * 1000 # 转换为毫秒
print(f"Function '{func.__name__}' executed in {elapsed_ms:.2f} ms")
return result
return wrapper
# 创建 PDF 提取器对象
extractor = PDFPageExtractor(
device="cuda:0", # 如果使用 GPU,请改为 device="cuda:0"
# device="cpu", # 如果使用 GPU,请改为 device="cuda:0"
model_dir_path="C:\ExtractionModel" # 模型下载和存储路径
)
markdown_path = r'QT中的元对象系统_gpu.md'
pdf_path = r'QT中的元对象系统.pdf'
@timeit
def ParseMD(pdf_path, markdown_path):
# 创建 Markdown 写入器对象
with MarkDownWriter(markdown_path, "images", "utf-8") as md:
for block in extractor.extract(pdf=pdf_path):
md.write(block)
@timeit
def example_function(n):
time.sleep(n / 1000) # 转换为秒
return "done"
ParseMD(pdf_path,markdown_path)
首次运行pdf-craft会下载关联模型的参数文件到C:\ExtractionModel,可根据具体需要调整,在文章顶部可直接下载。
cpu方式运行的日志记录
0: 1024x736 2 titles, 6 plain texts, 8609.1ms
Speed: 24.0ms preprocess, 8609.1ms inference, 27.0ms postprocess per image at shape (1, 3, 1024, 736)
0: 1024x736 12 plain texts, 6749.0ms
Speed: 16.0ms preprocess, 6749.0ms inference, 5.0ms postprocess per image at shape (1, 3, 1024, 736)
0: 1024x736 1 title, 9 plain texts, 8267.8ms
Speed: 20.0ms preprocess, 8267.8ms inference, 5.0ms postprocess per image at shape (1, 3, 1024, 736)
0: 1024x736 1 title, 15 plain texts, 1 figure, 10182.0ms
Speed: 26.0ms preprocess, 10182.0ms inference, 6.0ms postprocess per image at shape (1, 3, 1024, 736)
0: 1024x736 4 plain texts, 7552.4ms
Speed: 23.0ms preprocess, 7552.4ms inference, 7.0ms postprocess per image at shape (1, 3, 1024, 736)
Function 'ParseMD' executed in 383678.58 ms
gpu方式运行的日志记录
0: 1024x736 2 titles, 6 plain texts, 91.7ms
Speed: 8.0ms preprocess, 91.7ms inference, 55.3ms postprocess per image at shape (1, 3, 1024, 736)
0: 1024x736 12 plain texts, 15.9ms
Speed: 4.0ms preprocess, 15.9ms inference, 2.0ms postprocess per image at shape (1, 3, 1024, 736)
0: 1024x736 1 title, 9 plain texts, 15.9ms
Speed: 4.0ms preprocess, 15.9ms inference, 3.0ms postprocess per image at shape (1, 3, 1024, 736)
0: 1024x736 1 title, 15 plain texts, 1 figure, 16.9ms
Speed: 4.0ms preprocess, 16.9ms inference, 1.0ms postprocess per image at shape (1, 3, 1024, 736)
0: 1024x736 4 plain texts, 14.9ms
Speed: 5.0ms preprocess, 14.9ms inference, 2.0ms postprocess per image at shape (1, 3, 1024, 736)
Function 'ParseMD' executed in 471143.47 ms
gpu方式反而更慢...很奇怪