-
-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple fixes: labeled data processing, dict data, fixes examples/02-basic_training.ipynb #138
Conversation
Anyway, the return value could be wrong in case when both document_splitter_fn and preprocessing_fn are None.
This test seems to succeed in CI but constantly fails on my machine due to identical score for 2 passages. "Specific works"... would be in the 4th retrieved passage: {'content': 'She met with Suzuki ... May 2006.', 'score': 24.421875, 'rank': 3, 'document_id': '82f22269-1fb8-479d-8...9f415dfdf9', 'passage_id': 45} {'content': 'Specific works that ... children.', 'score': 24.421875, 'rank': 4, 'document_id': '82f22269-1fb8-479d-8...9f415dfdf9', 'passage_id': 70}
…e code Avoid a few duplicate lines by adding a new method to make future changes easier. With even more work it could be perhaps possible to make .index() call .add_to_index() directly or other way round. Note: it feels like 'split_documents' argument should be removed. It would be easier to just check if 'document_splitter_fn' is None. This is the way how 'preprocessing_fn' argument is already used. 'split_documents' is now left only on the public methods.
Add test main formats. The test will currently fail. Fixes in later commits.
_process_raw_labeled_pairs() expects items in the format (query, passage, label). Thus we can not differentiate triplets and labeled pairs by using only length.
Queries with no positives or no negatives would results in access out of bounds.
Fix prepare_training_data() when the raw_data contains dicts. This is the case e.g. when using similar pattern to examples/02-basic_training.ipynb where the data is first processed with CorpusProcessor.
Use instance variables so that one can use multiple trainer objects in same python process. This also needed to fix test_prepare_training_data() test. Otherwise only the first parametrized test will succeed. Intance variables feels more natural here than e.g. some pytest fixture change.
Also here the item can contain dicts if use similar pattern to examples/02-basic_training.ipynb where the data is first processed with CorpusProcessor.
Hey, this is hugely appreciated! I will properly review at a later point but just wanted to say thanks, the training part has definitely been in need of some love, especially to accommodate the document splitter (or rather, the splitter output needs some love) |
Test training using the examples/02-basic_training.ipynb as an example. Alternative could be to test the actual notebook but it would probably bring some additional dependencies.
10d7c38
to
ec0bbbc
Compare
Skip the test when CUDA is not available. This is the case in CI. I'm not quite sure what would be needed to make it work without CUDA. Currently, without CUDA the test fails with something like: Process Process-1: Traceback (most recent call last): File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/runner/.cache/pypoetry/virtualenvs/ragatouille-EqBdPU0M-py3.9/lib/python3.9/site-packages/colbert/infra/launcher.py", line 134, in setup_new_process return_val = callee(config, *args) File "/home/runner/.cache/pypoetry/virtualenvs/ragatouille-EqBdPU0M-py3.9/lib/python3.9/site-packages/colbert/training/training.py", line 55, in train colbert = torch.nn.parallel.DistributedDataParallel(colbert, device_ids=[config.rank], File "/home/runner/.cache/pypoetry/virtualenvs/ragatouille-EqBdPU0M-py3.9/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 712, in __init__ self._log_and_throw( File "/home/runner/.cache/pypoetry/virtualenvs/ragatouille-EqBdPU0M-py3.9/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1037, in _log_and_throw raise err_type(err_msg) ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}. =========================== short test summary info ============================ FAILED tests/test_training.py::test_training - AssertionError: Timout in training ============= 1 failed, 67 passed, 4 skipped in 269.14s (0:04:29) ============== Error: Process completed with exit code 1.
I' didn't find any easy way to make the test_training.py work without CUDA so I added a conditional skip for now. |
in all_results[1][2]["content"] | ||
"Specific works that have influenced Miyazaki include Animal Farm (1945)" | ||
in actual | ||
or "She met with Suzuki" in actual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nearly all looks good to me -- but just a question here, is there a specific reason to include She met with Suzuki
here? The data processing looks like it still functions the same, so the initial test should still work unless I'm missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original test didn't work with me. I'm not sure what was is different on my machine but the 3rd and 4th results come in different order. And like the commit message says, they seem to have the same score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange... 🤔
The output I've been getting (same locally on multiple machines and in the CI is):
{'content': "Specific works that have[...]",
'score': 24.423484802246094,
[...]},
{'content': 'She met with Suzuki in August 2005, [...]',
'score': 24.417688369750977,
[...]},
The score are extremely similar to the point where it's possibly due to hardware differences, but still kind of odd!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(when casting to half precision, they do both have the same score, and since that's what a lot of people do, I'd say the issue is fairly minor and can be blamed on different hardware configs)
Yes, this should fix #139 |
Merging and releasing a |
Multiple small fixes and some refactorings while trying to fix the examples/02-basic_training.ipynb that seems to have been broken for some time.
Disclaimer: I'm not expert on all the data formats usually used in RAG inference/training so it could be I've mistaken something. Or perhaps the labeled pairs or other pattern in the example 2 are not really needed at all since the bugs prepare_training_data() feel somewhat fundamental... 😀