数据准备阶段

all2txt模块下，将pdf2txt、epub2txt等非结构化/半结构化的文件转成txt，并且可以很好的解决单栏、双栏，以及图表穿插中文本的顺序等导致问题文本内容不连贯的问题。

同时解析后的元素种类有"Table（表格）", "FigureCaption（图片标题）", "NarrativeText【正文】", "ListItem【参考文献】", " Title【章节标题】", "Address【邮箱地址】","PageBreak", "Header【页眉】", "Footer【页脚】", "UncategorizedText【arxiv竖排编号】", " Image(图)", "Formula（公式）" 等，工具脚本提供保留全文，以及按照类别解析保存两种形式。

下边以pdf2txt为例（epub2txt同理）：

1、保留全文（默认）

python pdf2txt.py -i "input_path" -o "output_file"

结果为

Fig. 1: The overall architecture of LayoutParser...
Fig. 2: The relationship between the three types of...
Fig. 3: Layout detection and OCR results visualization...
[1] Abadi, M., Agarwal, A., Barham, P., Brevdo...
[2] Alberti, M., Pondenkandath, V., W¨ursch...
[3] Antonacopoulos, A., Bridson, D., Papadopoulos...

2、按不同type类别保留

python pdf2txt.py -i "input_path" -o "output_file" --process_all

结果为

{
    "FigureCaption":[
        "Fig. 1: The overall architecture of LayoutParser...",
        "Fig. 2: The relationship between the three types of...",
        "Fig. 3: Layout detection and OCR results visualization..."
    ],
    "ListItem":[
        "[1] Abadi, M., Agarwal, A., Barham, P., Brevdo...",
        "[2] Alberti, M., Pondenkandath, V., W¨ursch...",
        "[3] Antonacopoulos, A., Bridson, D., Papadopoulos..."
    ]
}

根据不同的type类别，用户可以自动选择提取哪种类型的数据

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_zh.md

README_zh.md

数据准备阶段

Files

README_zh.md

Latest commit

History

README_zh.md

File metadata and controls

数据准备阶段