feat: add Docling reader and node parser #16406

vagenas · 2024-10-07T15:53:47Z

Description

Docling extracts PDF documents to a rich representation (incl. layout, tables etc.), which it can export to Markdown or JSON.
As outlined in the Docling Technical Report, Docling is based on two models developed by IBM Research, namely a DocLayNet-based layout analysis model and the TableFormer table recognition model.

This PR adds Docling support to LlamaIndex by introducing:

a Docling Reader (llama_index.readers.docling.DoclingReader, which can export to Markdown and JSON, and
a Docling Node Parser (llama_index.node_parser.docling.DoclingNodeParser), which can parse the above-mentioned JSON format to LlamaIndex nodes.

By using these extensions, LlamaIndex users will be able to leverage Docling's conversion quality as well as as the rich metadata it can extract — as showcased in the example notebook of this PR.

Dependencies

docling: Docling PDF conversion
docling-core: Document document data model and core transformations (e.g. chunking)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

Yes
No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Yes
No

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Added new unit/integration tests
Added new notebook (that tests end-to-end)
I stared at the code and made sure it makes sense

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
- ⚠️ added docstrings & an example notebook, but docs deployment within MkDocs remains to be clarified
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

Signed-off-by: Panos Vagenas <[email protected]>

review-notebook-app · 2024-10-07T15:53:53Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Signed-off-by: Panos Vagenas <[email protected]>

vagenas · 2024-10-07T19:32:42Z

Thanks @logan-markewich for the quick reaction! 🙏

Regarding the unit tests, I guess it's still some silly path resolution glitch - or could it be there is something filtering out non-Python / test data files (e.g. JSON) on the fly? 🤔

Besides, Docling supports Python 3.10 and upwards & I see that e.g. the coverage GH Action is hard-coded to 3.9. Perhaps you have some idea how to handle this?

vagenas · 2024-10-07T22:37:41Z

@logan-markewich
Do you folks take care of updating the docs separately?

Or otherwise can you guide how to best address the two points below?

I would namely also like to include the doc-related changes for:

docs/mkdocs.yml
docs/docs/api_reference/node_parser/docling.md
docs/docs/api_reference/readers/docling.md
docs/docs/modules_guides/loading/connector/modules.md

However I experienced some discrepancies when building docs locally:

(some .md files were causing the build to fail (e.g. docs/docs/api_reference/tools/oracleai.md) so I had to remove them locally for it to work, and
calling python docs/prepare_for_build.py to refresh docs/mkdocs.yml also adds other entries, not related to Docling.

logan-markewich · 2024-10-07T22:58:22Z

@vagenas we take care of docs manually, docs are manually published when a release of llama-index-core / llama-index is made (this is what /stable of the docs points to)

The prepare_for_build script takes care of everything, but I generally only run it before a release (hence it adding stuff unrelated to docling)

This usually happens every 3ish days

feat: add Docling reader and node parser

15a16e7

Signed-off-by: Panos Vagenas <[email protected]>

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Oct 7, 2024

logan-markewich self-assigned this Oct 7, 2024

logan-markewich and others added 2 commits October 7, 2024 11:40

BUILD files

11b1e1d

fix test file paths, switch to monkeypatch

c3d0012

Signed-off-by: Panos Vagenas <[email protected]>

logan-markewich added 5 commits October 7, 2024 14:50

Remove lock files

a4af005

make tests pass

2767299

remove data dir

52fd5e3

fix node parser tests

3879b3e

Merge branch 'main' into add-docling

ec1fc89

logan-markewich added 2 commits October 7, 2024 17:00

fix coverage check

f607ee3

one more try

87cfd50

logan-markewich approved these changes Oct 8, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 8, 2024

logan-markewich merged commit 0b19dea into run-llama:main Oct 8, 2024
11 checks passed

logan-markewich pushed a commit that referenced this pull request Oct 8, 2024

feat: add Docling reader and node parser (#16406)

b0564af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Docling reader and node parser #16406

feat: add Docling reader and node parser #16406

vagenas commented Oct 7, 2024

review-notebook-app bot commented Oct 7, 2024

vagenas commented Oct 7, 2024

vagenas commented Oct 7, 2024

logan-markewich commented Oct 7, 2024 •

edited

Loading

feat: add Docling reader and node parser #16406

feat: add Docling reader and node parser #16406

Conversation

vagenas commented Oct 7, 2024

Description

Dependencies

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

review-notebook-app bot commented Oct 7, 2024

vagenas commented Oct 7, 2024

vagenas commented Oct 7, 2024

logan-markewich commented Oct 7, 2024 • edited Loading

logan-markewich commented Oct 7, 2024 •

edited

Loading