Add default document loader and parser for RAG #624

AgentGenie · 2025-01-22T19:20:24Z

Why are these changes needed?

Add document loader and parser (Docling) for RAG.

Related issue number

#438

Checks

I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/docs/contributor-guide/documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

marklysze · 2025-01-22T19:54:29Z

Thanks @AgentGenie!

A couple of things:

Can you change function names from docline to docling
Tests are failing with missing packages, @davorrunje is there anything in particular needed here when using new packages

On the note of selenium, it may also be a worthwhile option to look at using the Crawl4AI package to scrape the page when loading a URL. They say "Creates smart, concise Markdown optimized for RAG and fine-tuning applications.".

marklysze · 2025-01-22T20:50:48Z

Thanks for updating the parser name, can you also add an extra and packages to the pyproject.toml for all the additional packages required.

davorrunje · 2025-01-23T13:21:58Z

Thanks for updating the parser name, can you also add an extra and packages to the pyproject.toml for all the additional packages required.

I fixed the packaging and related stuff, will push my changes soon.

marklysze

Thanks @AgentGenie and @davorrunje

marklysze · 2025-01-23T21:22:51Z

Gemini test failure unrelated.

AgentGenie added 2 commits January 22, 2025 11:14

Add document loader with HTML rendering

6ac9502

Add Docling parser util for RAG

0702d27

AgentGenie requested a deployment to openai1 January 22, 2025 19:20 — with GitHub Actions Waiting

AgentGenie requested a review from marklysze January 22, 2025 19:20

Update parser name

47a035e

AgentGenie requested a deployment to openai1 January 22, 2025 20:38 — with GitHub Actions Waiting

Merge remote-tracking branch 'origin/main' into document_parser

462e1e2

refactoring

3af7735

davorrunje temporarily deployed to openai1 January 23, 2025 13:56 — with GitHub Actions Inactive

davorrunje had a problem deploying to openai1 January 23, 2025 13:56 — with GitHub Actions Failure

davorrunje temporarily deployed to openai1 January 23, 2025 13:56 — with GitHub Actions Inactive

marklysze approved these changes Jan 23, 2025

View reviewed changes

marklysze merged commit e2e6607 into main Jan 23, 2025
236 of 237 checks passed

marklysze deleted the document_parser branch January 23, 2025 21:22

davorrunje assigned marklysze Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add default document loader and parser for RAG #624

Add default document loader and parser for RAG #624

AgentGenie commented Jan 22, 2025

marklysze commented Jan 22, 2025 •

edited

Loading

marklysze commented Jan 22, 2025

davorrunje commented Jan 23, 2025

marklysze left a comment

marklysze commented Jan 23, 2025

Add default document loader and parser for RAG #624

Add default document loader and parser for RAG #624

Conversation

AgentGenie commented Jan 22, 2025

Why are these changes needed?

Related issue number

Checks

marklysze commented Jan 22, 2025 • edited Loading

marklysze commented Jan 22, 2025

davorrunje commented Jan 23, 2025

marklysze left a comment

Choose a reason for hiding this comment

marklysze commented Jan 23, 2025

marklysze commented Jan 22, 2025 •

edited

Loading