Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

识别到的标题没有text字段以及标题识别不准确的问题 #177

Open
ZzYAmbition opened this issue Nov 21, 2024 · 1 comment

Comments

@ZzYAmbition
Copy link

这是识别pdf生成的json一部分
image
缺少text字段。
用的模型如下:
image

标题识别不准确

这种换行标题以及标题和正文在一起有办法识别吗?

image
image
下面是识别用到的文件
中医药单用_联合抗生素治疗社区获得性肺炎临床实践指南_李得民.pdf
桂枝茯苓胶囊临床应用指南(2021年)_《中成药治疗优势病种临床应用指南》标准化项目组.pdf

@wufan-tb
Copy link
Collaborator

wufan-tb commented Nov 25, 2024

标题和text是独立检测的,不会放在一起保存(比如layout如果有10个类,你可以理解为text是第11个类),后处理阶段会比较text的框和layout的框,从而把有文字的布局框中的文字提取出来,可以试试把配置文件中的merge2markdown设置为True看下效果,如果文档排版比较复杂的话,也可以试试MinerU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants