LayoutLMv3 实战：只做 OCR 没用，你要的是“键值对提取” (KIE)

传统的 BERT 只能处理一维的文本序列。它不知道“发票代码”这四个字是在左上角还是右下角。 LayoutLM 的核心魔法在于多模态输入：

Text (文本)：OCR 识别出的字。
Layout (布局)：OCR 识别出的坐标框 (x0, y0, x1, y1)。
Image (图像)：原始文档的图像片段（v3 版本加入）。

它把这三者融合在一起，能学会：“通常在右上角的、字体较大的、由数字组成的内容，大概率是发票号码”。

1. 任务定义：BIO 标注体系

在工程落地中，我们通常把 KIE 问题转化为 Token Classification (序列标注) 问题。你需要定义一套标签体系，通常使用 BIO 格式：

B-TOTAL: 总金额的开始 (Begin)
I-TOTAL: 总金额的中间 (Inside)
O: 无关字符 (Outside)

例子： OCR 结果：["合", "计", ":", "￥", "100", ".", "00"] 标签：[O, O, O, B-TOTAL, I-TOTAL, I-TOTAL, I-TOTAL]

这样，模型预测完后，你只需要把所有标为 TOTAL 的 Token 拼起来，就是你要的结果。

2. 环境准备与数据流

LayoutLMv3 不能独立工作，它需要一个 OCR 引擎（如 PaddleOCR 或 Tesseract）作为前置。

流水线： Image -> OCR Engine -> (Words, Bboxes) -> LayoutLMv3 -> Entities (JSON)

安装依赖：

Bash

pip install transformers torch pillow

3. 核心代码：从 OCR 到 LayoutLM 输入

这是最容易踩坑的地方。LayoutLM 要求坐标框必须归一化到 0-1000 的整数区间。

Python

from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
from PIL import Image
import torch

# 1. 归一化函数 (必须!)
def normalize_bbox(bbox, width, height):
    return [
        int(1000 * (bbox[0] / width)),
        int(1000 * (bbox[1] / height)),
        int(1000 * (bbox[2] / width)),
        int(1000 * (bbox[3] / height)),
    ]

# 2. 模拟 OCR 结果 (实际项目中这里来自 PaddleOCR)
image_path = "invoice.jpg"
image = Image.open(image_path).convert("RGB")
width, height = image.size

# 假设 OCR 识别结果如下:
words = ["Invoice", "No:", "12345", "Date:", "2023-01-01"]
# 对应的坐标 [x0, y0, x1, y1]
boxes = [
    [10, 10, 50, 20],   # Invoice
    [60, 10, 80, 20],   # No:
    [90, 10, 150, 20],  # 12345 (这是我们要提取的)
    [10, 30, 50, 40],   # Date:
    [60, 30, 150, 40]   # 2023-01-01
]

# 归一化坐标
normalized_boxes = [normalize_bbox(box, width, height) for box in boxes]

# 3. 加载模型和处理器
# 微软官方提供了预训练模型，但你需要 fine-tune 才能识别你的自定义票据
# 这里假设你已经微调好了一个模型，或者加载官方的 funsd 模型演示
model_id = "microsoft/layoutlmv3-base-finetuned-funsd"
processor = LayoutLMv3Processor.from_pretrained(model_id)
model = LayoutLMv3ForTokenClassification.from_pretrained(model_id)

# 4. 构造输入 (Encoding)
# LayoutLMv3 会自动处理切词、padding 和 mask
encoding = processor(
    image,
    words,
    boxes=normalized_boxes,
    return_tensors="pt"
)

# 5. 推理 (Inference)
with torch.no_grad():
    outputs = model(**encoding)

# 6. 获取预测结果
logits = outputs.logits
predictions = logits.argmax(-1).squeeze().tolist()
labels = [model.config.id2label[pred] for pred in predictions]

# 打印结果
for word, label in zip(words, labels):
    print(f"{word}: {label}")

输出示例：

Plaintext

Invoice: O
No:: O
12345: B-HEADER  <-- 成功识别!
Date:: O
2023-01-01: B-DATE

4. 后处理：Token 转 JSON (The Dirty Work)

模型输出的是一堆标签，你需要写一段逻辑把它们“聚合”回 JSON。这是工程中最脏的部分，因为涉及到了 Token 切分后的合并。

Python

def extract_entities(words, labels):
    entities = {}
    current_entity_key = None
    current_entity_value = []

    for word, label in zip(words, labels):
        # 如果是 O (Outside)，跳过
        if label == "O":
            if current_entity_key:
                # 保存上一个实体
                entities[current_entity_key] = " ".join(current_entity_value)
                current_entity_key = None
                current_entity_value = []
            continue

        # 解析标签，例如 B-TOTAL -> key=TOTAL
        # 注意：这里简化了逻辑，实际要处理 B- 开头和 I- 开头的衔接
        entity_key = label.split("-")[-1]

        if entity_key == current_entity_key:
            current_entity_value.append(word)
        else:
            # 遇到新实体，先保存旧的
            if current_entity_key:
                entities[current_entity_key] = " ".join(current_entity_value)
            
            # 开启新实体
            current_entity_key = entity_key
            current_entity_value = [word]

    # 保存最后一个
    if current_entity_key:
        entities[current_entity_key] = " ".join(current_entity_value)
        
    return entities

# 最终得到 JSON
result_json = extract_entities(words, labels)
print(result_json)
# {'HEADER': '12345', 'DATE': '2023-01-01'}

5. 什么时候该用 LayoutLM？

LayoutLM 虽强，但它不是银弹。

场景 A：通用发票/卡证（如增值税发票、身份证）
- 不要用 LayoutLM。这些版式是固定的，直接用百度/腾讯的 API，或者用 PaddleOCR 自带的特定模型。
场景 B：海关报关单、国际物流单据、非标合同
- 必须用 LayoutLM。这些文档版式千奇百怪，且字段极多（几十个 KV 对）。写正则会死人的。
场景 C：纯文本（如小说）
- 不要用。直接用 BERT 即可，LayoutLM 的坐标信息在这里是噪音。

总结

LayoutLMv3 是 IDP (Intelligent Document Processing) 领域的里程碑。它把原本只能靠“人工规则”堆砌的后处理环节，变成了一个可训练、可泛化的深度学习模型。

作为工程师，你的工作重心将从 “写正则表达式” 转移到 “清洗标注数据” 上。虽然都很累，但后者能让你在面对新版式时，只需重新训练，而不用重写代码。

LayoutLMv3 实战：只做 OCR 没用，你要的是“键值对提取” (KIE)

1. 任务定义：BIO 标注体系

2. 环境准备与数据流

3. 核心代码：从 OCR 到 LayoutLM 输入

4. 后处理：Token 转 JSON (The Dirty Work)

5. 什么时候该用 LayoutLM？

总结

关于作者

zhangmu

相关文章

手机拍图太烂？DocTr 与 GAN 帮你把弯曲的纸“熨平”

Intel 的魔法：用 OpenVINO 让 PaddleOCR 在 CPU 上跑得飞起

别让页眉页脚毁了你的 RAG：OCR 文档切片 (Chunking) 的工程策略

LayoutLMv3 实战：只做 OCR 没用，你要的是“键值对提取” (KIE)

1. 任务定义：BIO 标注体系

2. 环境准备与数据流

3. 核心代码：从 OCR 到 LayoutLM 输入

4. 后处理：Token 转 JSON (The Dirty Work)

5. 什么时候该用 LayoutLM？

总结

关于作者

zhangmu

相关文章

手机拍图太烂？DocTr 与 GAN 帮你把弯曲的纸“熨平”

Intel 的魔法：用 OpenVINO 让 PaddleOCR 在 CPU 上跑得飞起

别让页眉页脚毁了你的 RAG：OCR 文档切片 (Chunking) 的工程策略

联系我们