Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

表格识别速度非常慢,比不开表格模型慢了十几倍 #926

Closed
charliedream1 opened this issue Nov 11, 2024 · 11 comments
Closed
Labels
bug Something isn't working

Comments

@charliedream1
Copy link

Description of the bug | 错误描述

表格识别速度非常慢,比不开表格模型慢了十几倍。请问有哪些配置需要注意?目前没有装paddle-gpu版,看着表格不需要paddle就没装,是这个导致的?还是还有别的设置需要注意。表格识别时间都在200-400,导致10页的PDF,20分钟都转不完。

How to reproduce the bug | 如何复现

开启和关闭表格识别,对比时间

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

@charliedream1 charliedream1 added the bug Something isn't working label Nov 11, 2024
@myhloli
Copy link
Collaborator

myhloli commented Nov 11, 2024

本周我们将会发布0.9.3,接入了rapid table表格识别,单表识别在1~2s,速度更快,效果更准。

@myhloli myhloli closed this as completed Nov 11, 2024
@charliedream1
Copy link
Author

charliedream1 commented Nov 11, 2024 via email

@charliedream1
Copy link
Author

charliedream1 commented Nov 11, 2024 via email

@myhloli
Copy link
Collaborator

myhloli commented Nov 11, 2024

测试了下,任何方案对复杂表格的解析效果都很差,目前只能优先保证简单表格的解析功能。

@charliedream1
Copy link
Author

复杂表格用markdown好像不太好表示,如果用html来表示,是否能更好一些?

@myhloli
Copy link
Collaborator

myhloli commented Nov 12, 2024

复杂表格用markdown好像不太好表示,如果用html来表示,是否能更好一些?

目前表格就是使用html表示的

@chaoStart
Copy link

我之前根据pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple下载的magic-pdf时,版本为0.17.0,后面我去下载magic-doc之后,发现之前的magic-pdf --version报错了。于是我卸载magic-pdf,又重新安装magic-pdf,发现我的magic-pdf 版本变成了0.10.6,在解析pdf文件比之前慢了很多,而且识别表格速度更慢了。我在magic-pdf.json文件修改了参数tablemaster

@chaoStart
Copy link

请问有什么好的解决办法吗?

@chaoStart
Copy link

解析一个带有表格的文件,花了2h,我崩溃了

@myhloli
Copy link
Collaborator

myhloli commented Dec 27, 2024

@chaoStart 卸载paddlepaddle和paddlepaddle-gpu,重装paddlepaddle-gpu,如果还是慢就把表格配置改成rapid_table

@boboyunz
Copy link

boboyunz commented Jan 5, 2025

我分别测四了 rapid_table 、tablemaster、struct_eqtable 对于复杂的表格 目前tablemaster 效果是最好的 但是速度确实很慢 识别一个表格需要200-600秒 目前paddlepaddle 用的CPU 但是minerU 开启了CUDA 请高手指点 如何能够提高 tablemaster 的计算速度?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants