-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Docling reader and node parser #16406
Conversation
Signed-off-by: Panos Vagenas <[email protected]>
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Signed-off-by: Panos Vagenas <[email protected]>
Thanks @logan-markewich for the quick reaction! 🙏 Regarding the unit tests, I guess it's still some silly path resolution glitch - or could it be there is something filtering out non-Python / test data files (e.g. JSON) on the fly? 🤔 Besides, Docling supports Python 3.10 and upwards & I see that e.g. the coverage GH Action is hard-coded to 3.9. Perhaps you have some idea how to handle this? |
@logan-markewich Or otherwise can you guide how to best address the two points below? I would namely also like to include the doc-related changes for:
However I experienced some discrepancies when building docs locally:
|
@vagenas we take care of docs manually, docs are manually published when a release of The prepare_for_build script takes care of everything, but I generally only run it before a release (hence it adding stuff unrelated to docling) This usually happens every 3ish days |
Description
Docling extracts PDF documents to a rich representation (incl. layout, tables etc.), which it can export to Markdown or JSON.
As outlined in the Docling Technical Report, Docling is based on two models developed by IBM Research, namely a DocLayNet-based layout analysis model and the TableFormer table recognition model.
This PR adds Docling support to LlamaIndex by introducing:
llama_index.readers.docling.DoclingReader
, which can export to Markdown and JSON, andllama_index.node_parser.docling.DoclingNodeParser
), which can parse the above-mentioned JSON format to LlamaIndex nodes.By using these extensions, LlamaIndex users will be able to leverage Docling's conversion quality as well as as the rich metadata it can extract — as showcased in the example notebook of this PR.
Dependencies
New Package?
Did I fill in the
tool.llamahub
section in thepyproject.toml
and provide a detailed README.md for my new integration or package?Version Bump?
Did I bump the version in the
pyproject.toml
file of the package I am updating? (Except for thellama-index-core
package)Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Suggested Checklist:
make format; make lint
to appease the lint gods