Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document metadata, document IDs, delete from index and tests #36

Merged
merged 21 commits into from
Jan 24, 2024

Conversation

adharm
Copy link
Collaborator

@adharm adharm commented Jan 12, 2024

This got bigger than I expected but here's where I landed on after a lot of back and forth (and some non-trivial COVID brain):

This fixes #25, where one can add optional document metadata on a per document basis, and it'll be returned in the search results. A few significant changes:

  • .index() takes documents instead of collection
  • document_ids are required, like chroma. Since there's a 1 to many relationship between documents and the passages they're split into and then indexed, we need a way to tie back documents fed to RAGPretrainedModel to do CRUD on the index
  • metadata is optional but there's a check for length between documents, document_ids and document_metadatas
  • delete_from_index() is now possible at the document level, and add_to_index() was updated to match. It needs more work though.
  • Some basic tests were added to make sure metadata was consistently being saved and served back when documents were split or not.

I know this wasn't what was originally envisioned in #28, so open to suggestions!

@adharm adharm marked this pull request as ready for review January 12, 2024 06:50
@bclavie
Copy link
Collaborator

bclavie commented Jan 12, 2024

Hey!

First, thank you so much for this! This is the first community feature PR 🤗

I like the overall approach, thank you.

I will try and review more in-depth later to make sure I haven't missed anything!

After the first read, I'd have a few comments and one major comment/request change: I think collection should remain just a data-free collection of strings (in the json containing just a single list format), at least for now/until a new major version, for two main reasons:

  1. Ensure compatibility with all the indexes already created, because at the moment this would break loading an existing index: self.collection = [x["content"] for x in self.collection_with_ids]
  2. Follow the upstream ColBERT convention, where collection of passages are always assumed to be an indexed list where the index is the passage ID.

I think having document_ids to identify passages that are chunks of a longer document is a great approach and chroma's definitely got it right here. For the time being and to keep things as smooth as can be, I'd like it to be internal however, and maybe stored in an extra file at the moment? I think the most naive approach would be to have it in metadata.json, but since this method would require len(metadata) == len(collection), that'd also mean duplicating all the other metadatas for every single chunk from the same document, so it's a really bad idea storage wise.

(note: anything I refer to as p_id is referring to the passage id, e.g. chunk id/index in collection.json's list)
Something like collection.json (list), passage_to_doc.json (dict mapping, p_id: doc_id), metadata.json (dict, doc_id: metadata_dict})

I would also like the argument passed to index to remain named collection, even though it's technically a bit of a misnomer since it still accepts document -- there's now a bit of an install base so I feel breaking argument name changes should only be pushed when absolutely necessary, or happen in major releases w/ deprecation warnings beforehand.


As for delete_from_index, I appreciate the feature! Everything CRUD is in a very rough/use-at-your-risk situation right now, so no strong objection from me on it being just as rough as add_to_index, this is a growing feature! I'll try and play with it later.


Thinking out loud: having document_ids means we could also have an optional return_entire_document: bool = False flag, where if set to True, the results given are the full documents rather than just the most relevant chunk!


Also, I can't express enough appreciation for this PR having a built-in test! This is definitely hypocritical of me but it's going to be in the contributing guidelines as something strongly appreciated 😄

@bclavie bclavie self-requested a review January 12, 2024 12:54
@okhat
Copy link
Collaborator

okhat commented Jan 12, 2024

I have nothing to add to Ben, just wanted to thank you @anirudhdharmarajan for this amazing PR!

@0-hero
Copy link

0-hero commented Jan 12, 2024

Thanks for the implementation @anirudhdharmarajan. Made some changes to work with the llama_hub implementation and works great. Really improves the utility of the library

@adharm
Copy link
Collaborator Author

adharm commented Jan 12, 2024

Thank you all for the kind feedback!

@bclavie totally understandable about not changing arg names, happy to revert the names and keep the pid<>docid mapping in a separate file so that collection stays the same.

Also fine with making the document_ids internal. I don't have strong feelings about it for this rev since the CRUD functionality is still so experimental and that's really the only place users would currently use them.

That return flag is a good idea! I can throw that in there. Having the metadata also opens up filtering capabilities, but I'll save that for another time haha.

@bclavie
Copy link
Collaborator

bclavie commented Jan 12, 2024

@bclavie totally understandable about not changing arg names, happy to revert the names and keep the pid<>docid mapping in a separate file so that collection stays the same.

This'd be great!

Also fine with making the document_ids internal. I don't have strong feelings about it for this rev since the CRUD functionality is still so experimental and that's really the only place users would currently use them.

Yeah that's also where I'm coming from... I've got no issue with users being able to pass explicit document_ids and/or returning them with the results, but I think they should be wholly separate from the collection and only manifest through the mapping for now!

That return flag is a good idea! I can throw that in there.

Feel free to if you feel up for it, but don't feel obligated, this PR is already more than doing its job 😄

- Made document ids independent of collection and saved as it's own map file
- Added full document return flag
- Updated tests
@adharm adharm marked this pull request as draft January 13, 2024 05:05
@adharm
Copy link
Collaborator Author

adharm commented Jan 13, 2024

Moved to draft while I fix up the last remaining bits, it'll be around Sunday when it'll be ready for review again.

@bclavie
Copy link
Collaborator

bclavie commented Jan 13, 2024

Sounds great, thanks a lot!

FYI - I've just merged the initial CI step and enforcing ruff import sorting (&formatting/linting, but this was already the case locally). Might cause some conflicts but they should be relatively minor!

- Updated README and basic usage notebook
- Removed return_entire_source_document functionaliity
because document splitting introduces overlaps
@adharm adharm marked this pull request as ready for review January 16, 2024 00:05
@adharm
Copy link
Collaborator Author

adharm commented Jan 16, 2024

@bclavie Tests and code are updated! I held off on the return_entire_document functionality because with the way documents are split by default, you get passage overlap so you can't just join all the passages back together without duplication. Deduplication is computation overhead, and saving the original docs before preprocessing is almost doubling the storage needed so I figured we'd want to talk about it more before implementing it.

@LarsAC
Copy link

LarsAC commented Jan 16, 2024

On the return_entire_document feature: in most use cases I want users to see the passage that contains the specific information fed to the LLM as part of the answer. With a document id I can always go back to the original source and would rather point a user to the PDF (etc.) rather than show an extracted raw text.

@bclavie
Copy link
Collaborator

bclavie commented Jan 16, 2024

Hey, this is brilliant! It looks good at first glance. I'll review properly in a bit (next few days most likely) and merge/revert with comments if I spot anything then. Thank you so much!

@nigh8w0lf
Copy link

Was looking for the document ID feature to do de dupe after doing query decomposition into sub queries, thanks for adding this @anirudhdharmarajan

@PrimoUomo89
Copy link
Contributor

PrimoUomo89 commented Jan 21, 2024

Watching this one daily. Appreciate the work!

@adharm
Copy link
Collaborator Author

adharm commented Jan 23, 2024

@bclavie gentle bump on this PR!

@bclavie
Copy link
Collaborator

bclavie commented Jan 23, 2024

Hey @anirudhdharmarajan, definitely haven’t forgotten, sorry! I’ve spent the last week or so trying to recover from a bad infection, hoping to be able to get this in in the next few days (for real this time)!

@adharm
Copy link
Collaborator Author

adharm commented Jan 23, 2024

Ahhh, take the time to recover fully, no rush @bclavie! I was in a similar spot a couple of weeks ago, it can be really miserable.

@bclavie
Copy link
Collaborator

bclavie commented Jan 24, 2024

lgtm overall! There are some things that I think could be made a bit simpler in the future but this is a first go I'm very happy with, thanks @anirudhdharmarajan.

I've pushed a minor change to remove some code duplication & make sure process_corpus can still be used without document_ids. Let me know if you're happy with the changes!

@adharm
Copy link
Collaborator Author

adharm commented Jan 24, 2024

Good catch on process_corpus!

Changes look solid to me, thanks for reviewing @bclavie

@bclavie bclavie merged commit a245fba into AnswerDotAI:main Jan 24, 2024
2 checks passed
@adharm adharm deleted the custom-metadata branch January 24, 2024 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants