-
-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document metadata, document IDs, delete from index and tests #36
Conversation
except for add_to_index and delete_from_index
possible documents in the collection without metadata
Hey! First, thank you so much for this! This is the first community feature PR 🤗 I like the overall approach, thank you. I will try and review more in-depth later to make sure I haven't missed anything! After the first read, I'd have a few comments and one major comment/request change: I think
I think having (note: anything I refer to as I would also like the argument passed to index to remain named As for Thinking out loud: having document_ids means we could also have an optional Also, I can't express enough appreciation for this PR having a built-in test! This is definitely hypocritical of me but it's going to be in the contributing guidelines as something strongly appreciated 😄 |
I have nothing to add to Ben, just wanted to thank you @anirudhdharmarajan for this amazing PR! |
Thanks for the implementation @anirudhdharmarajan. Made some changes to work with the llama_hub implementation and works great. Really improves the utility of the library |
Thank you all for the kind feedback! @bclavie totally understandable about not changing arg names, happy to revert the names and keep the pid<>docid mapping in a separate file so that collection stays the same. Also fine with making the document_ids internal. I don't have strong feelings about it for this rev since the CRUD functionality is still so experimental and that's really the only place users would currently use them. That return flag is a good idea! I can throw that in there. Having the metadata also opens up filtering capabilities, but I'll save that for another time haha. |
This'd be great!
Yeah that's also where I'm coming from... I've got no issue with users being able to pass explicit document_ids and/or returning them with the results, but I think they should be wholly separate from the collection and only manifest through the mapping for now!
Feel free to if you feel up for it, but don't feel obligated, this PR is already more than doing its job 😄 |
- Made document ids independent of collection and saved as it's own map file - Added full document return flag - Updated tests
Moved to draft while I fix up the last remaining bits, it'll be around Sunday when it'll be ready for review again. |
Sounds great, thanks a lot! FYI - I've just merged the initial CI step and enforcing ruff import sorting (&formatting/linting, but this was already the case locally). Might cause some conflicts but they should be relatively minor! |
- Updated README and basic usage notebook - Removed return_entire_source_document functionaliity because document splitting introduces overlaps
@bclavie Tests and code are updated! I held off on the |
and added TODOs to move tests
On the return_entire_document feature: in most use cases I want users to see the passage that contains the specific information fed to the LLM as part of the answer. With a document id I can always go back to the original source and would rather point a user to the PDF (etc.) rather than show an extracted raw text. |
Hey, this is brilliant! It looks good at first glance. I'll review properly in a bit (next few days most likely) and merge/revert with comments if I spot anything then. Thank you so much! |
Was looking for the document ID feature to do de dupe after doing query decomposition into sub queries, thanks for adding this @anirudhdharmarajan |
Watching this one daily. Appreciate the work! |
@bclavie gentle bump on this PR! |
Hey @anirudhdharmarajan, definitely haven’t forgotten, sorry! I’ve spent the last week or so trying to recover from a bad infection, hoping to be able to get this in in the next few days (for real this time)! |
Ahhh, take the time to recover fully, no rush @bclavie! I was in a similar spot a couple of weeks ago, it can be really miserable. |
lgtm overall! There are some things that I think could be made a bit simpler in the future but this is a first go I'm very happy with, thanks @anirudhdharmarajan. I've pushed a minor change to remove some code duplication & make sure process_corpus can still be used without document_ids. Let me know if you're happy with the changes! |
Good catch on process_corpus! Changes look solid to me, thanks for reviewing @bclavie |
This got bigger than I expected but here's where I landed on after a lot of back and forth (and some non-trivial COVID brain):
This fixes #25, where one can add optional document metadata on a per document basis, and it'll be returned in the search results. A few significant changes:
.index()
takesdocuments
instead ofcollection
document_ids
are required, like chroma. Since there's a 1 to many relationship between documents and the passages they're split into and then indexed, we need a way to tie back documents fed toRAGPretrainedModel
to do CRUD on the indexdocuments
,document_ids
anddocument_metadatas
delete_from_index()
is now possible at the document level, andadd_to_index()
was updated to match. It needs more work though.I know this wasn't what was originally envisioned in #28, so open to suggestions!