Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could we add OpenHermes 2.5 dataset? #2

Open
yiouyou opened this issue Jan 7, 2024 · 3 comments
Open

Could we add OpenHermes 2.5 dataset? #2

yiouyou opened this issue Jan 7, 2024 · 3 comments

Comments

@yiouyou
Copy link

yiouyou commented Jan 7, 2024

Thanks!

@jondurbin
Copy link
Owner

It seems the dataset is 404'ing on huggingface.

Looking at the original openhermes, however, it includes:

  • GPTeacher - General Instruct, Roleplay v1, Roleplay v2, and Code Instruct Datasets, by Teknium
  • WizardLM (v1, evol_instruct 70k), by WizardLM Team/nlpxucan
  • Airoboros GPT-4 (v1.0), by JonDurbin
  • Camel-AI's domain expert datasets, by the Camel-AI Team
  • CodeAlpaca, by Sahil2801
  • GPT4-LLM and Unnatural Instructions, by Microsoft

I've already included airoboros and code alpaca, but I can look into the others. Is there a particular functionality you are seeing lacking in the model, or just want broader coverage of datasets in general?

@vgoklani
Copy link

Thank you for sharing @jondurbin I would like to build a better Mistral Instruct 0.2 model from the mistral base, and i'm looking for high quality datasets with good coverage. With regards to the previous question, I think having datasets with broad coverage is important. I'm also looking for good synthetic datasets.
I'm curious, how do you evaluate the dataset quality. Do you have a specific methodology? thanks!

@jondurbin
Copy link
Owner

I'm curious, how do you evaluate the dataset quality. Do you have a specific methodology? thanks!

I don't have the resources to deeply evaluate all of the items within each dataset, so I somewhat rely on the dataset creators/curators to know what they are doing, plus a bit of intuition on my part.

In airoboros I have a cull-instructions entrypoint that shrinks instructions down via approximate KNN search, then filtering bad responses with gpt-4 as a judge, which is very useful.

There are other tools as well, like distilabel which are handy for annotation as well.

The DPO datasets, however, I try to only use the highest quality items, which are either human annotated or GPT-4 annotated, and I tend to filter down to a subset of those. Having a bit of noise in the SFT phase isn't too much of a problem, but can cause havoc in the DPO phase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants