You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So, actually upon taking another look, I'm not sure this is actually a great idea, because the dataset is already a large composite dataset that includes a fair amount of overlap with the existing bagel sources (platypus includes airo 1.4.1, this includes airo 2.1, etc). I could go through it and remove the dupes or select piecemeal, will need to think about it.
@jondurbin Another good idea is to add the bellow dataset instead, which only has coding data, would not need to be de-duped as i dont think any of bagel overlaps with this dataset. Its just something else to consider.
I have created a pretty extensive dataset which you have missing from bagel, considering this is suppose to have "everything"
The filtered version is here:
https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV3_Tiny
For the full unfiltered version use this one if you want to filter and dedupe it yourself:
https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV3_1.6m_Evol
The text was updated successfully, but these errors were encountered: