Document metadata, document IDs, delete from index and tests #36

adharm · 2024-01-12T06:48:03Z

This got bigger than I expected but here's where I landed on after a lot of back and forth (and some non-trivial COVID brain):

This fixes #25, where one can add optional document metadata on a per document basis, and it'll be returned in the search results. A few significant changes:

.index() takes documents instead of collection
document_ids are required, like chroma. Since there's a 1 to many relationship between documents and the passages they're split into and then indexed, we need a way to tie back documents fed to RAGPretrainedModel to do CRUD on the index
metadata is optional but there's a check for length between documents, document_ids and document_metadatas
delete_from_index() is now possible at the document level, and add_to_index() was updated to match. It needs more work though.
Some basic tests were added to make sure metadata was consistently being saved and served back when documents were split or not.

I know this wasn't what was originally envisioned in #28, so open to suggestions!

except for add_to_index and delete_from_index

… notebook

possible documents in the collection without metadata

bclavie · 2024-01-12T12:54:13Z

Hey!

First, thank you so much for this! This is the first community feature PR 🤗

I like the overall approach, thank you.

I will try and review more in-depth later to make sure I haven't missed anything!

After the first read, I'd have a few comments and one major comment/request change: I think collection should remain just a data-free collection of strings (in the json containing just a single list format), at least for now/until a new major version, for two main reasons:

Ensure compatibility with all the indexes already created, because at the moment this would break loading an existing index: self.collection = [x["content"] for x in self.collection_with_ids]
Follow the upstream ColBERT convention, where collection of passages are always assumed to be an indexed list where the index is the passage ID.

I think having document_ids to identify passages that are chunks of a longer document is a great approach and chroma's definitely got it right here. For the time being and to keep things as smooth as can be, I'd like it to be internal however, and maybe stored in an extra file at the moment? I think the most naive approach would be to have it in metadata.json, but since this method would require len(metadata) == len(collection), that'd also mean duplicating all the other metadatas for every single chunk from the same document, so it's a really bad idea storage wise.

(note: anything I refer to as p_id is referring to the passage id, e.g. chunk id/index in collection.json's list)
Something like collection.json (list), passage_to_doc.json (dict mapping, p_id: doc_id), metadata.json (dict, doc_id: metadata_dict})

I would also like the argument passed to index to remain named collection, even though it's technically a bit of a misnomer since it still accepts document -- there's now a bit of an install base so I feel breaking argument name changes should only be pushed when absolutely necessary, or happen in major releases w/ deprecation warnings beforehand.

As for delete_from_index, I appreciate the feature! Everything CRUD is in a very rough/use-at-your-risk situation right now, so no strong objection from me on it being just as rough as add_to_index, this is a growing feature! I'll try and play with it later.

Thinking out loud: having document_ids means we could also have an optional return_entire_document: bool = False flag, where if set to True, the results given are the full documents rather than just the most relevant chunk!

Also, I can't express enough appreciation for this PR having a built-in test! This is definitely hypocritical of me but it's going to be in the contributing guidelines as something strongly appreciated 😄

okhat · 2024-01-12T12:56:35Z

I have nothing to add to Ben, just wanted to thank you @anirudhdharmarajan for this amazing PR!

0-hero · 2024-01-12T15:38:14Z

Thanks for the implementation @anirudhdharmarajan. Made some changes to work with the llama_hub implementation and works great. Really improves the utility of the library

adharm · 2024-01-12T17:30:43Z

Thank you all for the kind feedback!

@bclavie totally understandable about not changing arg names, happy to revert the names and keep the pid<>docid mapping in a separate file so that collection stays the same.

Also fine with making the document_ids internal. I don't have strong feelings about it for this rev since the CRUD functionality is still so experimental and that's really the only place users would currently use them.

That return flag is a good idea! I can throw that in there. Having the metadata also opens up filtering capabilities, but I'll save that for another time haha.

bclavie · 2024-01-12T18:30:19Z

@bclavie totally understandable about not changing arg names, happy to revert the names and keep the pid<>docid mapping in a separate file so that collection stays the same.

This'd be great!

Also fine with making the document_ids internal. I don't have strong feelings about it for this rev since the CRUD functionality is still so experimental and that's really the only place users would currently use them.

Yeah that's also where I'm coming from... I've got no issue with users being able to pass explicit document_ids and/or returning them with the results, but I think they should be wholly separate from the collection and only manifest through the mapping for now!

That return flag is a good idea! I can throw that in there.

Feel free to if you feel up for it, but don't feel obligated, this PR is already more than doing its job 😄

- Made document ids independent of collection and saved as it's own map file - Added full document return flag - Updated tests

adharm · 2024-01-13T05:07:34Z

Moved to draft while I fix up the last remaining bits, it'll be around Sunday when it'll be ready for review again.

bclavie · 2024-01-13T12:02:21Z

Sounds great, thanks a lot!

FYI - I've just merged the initial CI step and enforcing ruff import sorting (&formatting/linting, but this was already the case locally). Might cause some conflicts but they should be relatively minor!

- Updated README and basic usage notebook - Removed return_entire_source_document functionaliity because document splitting introduces overlaps

adharm · 2024-01-16T00:08:51Z

@bclavie Tests and code are updated! I held off on the return_entire_document functionality because with the way documents are split by default, you get passage overlap so you can't just join all the passages back together without duplication. Deduplication is computation overhead, and saving the original docs before preprocessing is almost doubling the storage needed so I figured we'd want to talk about it more before implementing it.

and added TODOs to move tests

LarsAC · 2024-01-16T07:45:01Z

On the return_entire_document feature: in most use cases I want users to see the passage that contains the specific information fed to the LLM as part of the answer. With a document id I can always go back to the original source and would rather point a user to the PDF (etc.) rather than show an extracted raw text.

…ouille into custom-metadata

bclavie · 2024-01-16T20:41:14Z

Hey, this is brilliant! It looks good at first glance. I'll review properly in a bit (next few days most likely) and merge/revert with comments if I spot anything then. Thank you so much!

nigh8w0lf · 2024-01-20T19:49:53Z

Was looking for the document ID feature to do de dupe after doing query decomposition into sub queries, thanks for adding this @anirudhdharmarajan

PrimoUomo89 · 2024-01-21T00:20:01Z

Watching this one daily. Appreciate the work!

adharm · 2024-01-23T16:15:02Z

@bclavie gentle bump on this PR!

bclavie · 2024-01-23T16:16:52Z

Hey @anirudhdharmarajan, definitely haven’t forgotten, sorry! I’ve spent the last week or so trying to recover from a bad infection, hoping to be able to get this in in the next few days (for real this time)!

adharm · 2024-01-23T16:18:39Z

Ahhh, take the time to recover fully, no rush @bclavie! I was in a similar spot a couple of weeks ago, it can be really miserable.

…pently

bclavie · 2024-01-24T16:43:22Z

lgtm overall! There are some things that I think could be made a bit simpler in the future but this is a first go I'm very happy with, thanks @anirudhdharmarajan.

I've pushed a minor change to remove some code duplication & make sure process_corpus can still be used without document_ids. Let me know if you're happy with the changes!

adharm · 2024-01-24T16:48:00Z

Good catch on process_corpus!

Changes look solid to me, thanks for reviewing @bclavie

adharm added 9 commits January 9, 2024 16:29

first pass changes to include document metadata

79bcfb1

Merge branch 'bclavie:main' into custom-metadata

da39445

Added working tests for all metadata and document id functionality

4066f9a

except for add_to_index and delete_from_index

Added working delete_from_index functionality

8166270

Fixed tests and functionality for add_to_index, and updated the intro…

b281d77

… notebook

Updated docs and doc strings

e2bac39

Merge branch 'main' into custom-metadata

24b44a6

Updated metadata return logic to account for

d8bc623

possible documents in the collection without metadata

Added check for document metadata length vs documents

52ec19e

adharm marked this pull request as ready for review January 12, 2024 06:50

bclavie self-requested a review January 12, 2024 12:54

- Reverted names

4b170eb

- Made document ids independent of collection and saved as it's own map file - Added full document return flag - Updated tests

adharm marked this pull request as draft January 13, 2024 05:05

adharm added 2 commits January 15, 2024 16:56

- Fixed tests

3bb2559

- Updated README and basic usage notebook - Removed return_entire_source_document functionaliity because document splitting introduces overlaps

Merge branch 'main' into custom-metadata

a5d321f

adharm marked this pull request as ready for review January 16, 2024 00:05

adharm added 3 commits January 15, 2024 17:11

Updated with ruff import sorting

57aae50

Added ruff ignore for test fixture setup

86f782b

renamed test file to match other files better

bed6e16

and added TODOs to move tests

adharm added 2 commits January 16, 2024 07:29

Merge branch 'main' into custom-metadata

aaeb1b4

fix: formatted files that failed ruff CI check

1c95eb4

adharm added 2 commits January 16, 2024 09:05

style: formatted files that failed ruff CI check

34d1532

Merge branch 'custom-metadata' of github.com:anirudhdharmarajan/RAGat…

1c3de4e

…ouille into custom-metadata

bclavie added 2 commits January 24, 2024 17:38

chore: remove code duplication and ensure process_corpus can run inde…

9496f45

…pently

chore: typo

96a51d7

bclavie approved these changes Jan 24, 2024

View reviewed changes

bclavie merged commit a245fba into AnswerDotAI:main Jan 24, 2024
2 checks passed

adharm deleted the custom-metadata branch January 24, 2024 16:51

0-hero mentioned this pull request Jan 24, 2024

RAGatouille metadata support run-llama/llama-hub#893

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document metadata, document IDs, delete from index and tests #36

Document metadata, document IDs, delete from index and tests #36

adharm commented Jan 12, 2024 •

edited

Loading

bclavie commented Jan 12, 2024

okhat commented Jan 12, 2024

0-hero commented Jan 12, 2024

adharm commented Jan 12, 2024

bclavie commented Jan 12, 2024

adharm commented Jan 13, 2024

bclavie commented Jan 13, 2024

adharm commented Jan 16, 2024 •

edited

Loading

LarsAC commented Jan 16, 2024

bclavie commented Jan 16, 2024

nigh8w0lf commented Jan 20, 2024

PrimoUomo89 commented Jan 21, 2024 •

edited

Loading

adharm commented Jan 23, 2024

bclavie commented Jan 23, 2024

adharm commented Jan 23, 2024

bclavie commented Jan 24, 2024

adharm commented Jan 24, 2024

Document metadata, document IDs, delete from index and tests #36

Document metadata, document IDs, delete from index and tests #36

Conversation

adharm commented Jan 12, 2024 • edited Loading

bclavie commented Jan 12, 2024

okhat commented Jan 12, 2024

0-hero commented Jan 12, 2024

adharm commented Jan 12, 2024

bclavie commented Jan 12, 2024

adharm commented Jan 13, 2024

bclavie commented Jan 13, 2024

adharm commented Jan 16, 2024 • edited Loading

LarsAC commented Jan 16, 2024

bclavie commented Jan 16, 2024

nigh8w0lf commented Jan 20, 2024

PrimoUomo89 commented Jan 21, 2024 • edited Loading

adharm commented Jan 23, 2024

bclavie commented Jan 23, 2024

adharm commented Jan 23, 2024

bclavie commented Jan 24, 2024

adharm commented Jan 24, 2024

adharm commented Jan 12, 2024 •

edited

Loading

adharm commented Jan 16, 2024 •

edited

Loading

PrimoUomo89 commented Jan 21, 2024 •

edited

Loading