Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Alpaca Persian Dataset #3633

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pourmand1376
Copy link
Contributor

Hi,
In the last two days, I have been working on translating alpaca into Persian (Farsi) and this is the result. I have reviewed the translations and they are in my opinion pretty good.

Also, the dataset is still translating on Kaggle and will be finished in a couple of days. I will update the datasets accordingly when the translation is complete.

I have added two datasets. One is instruction-based and one is orca-style dataset. For the first one, I knew how to add it. But I don't know how to add the orca dataset to your datasets.

Thank you for your attention.

@stefangrotz
Copy link
Contributor

stefangrotz commented Aug 4, 2023

Hey great work, I always wanted translate this dataset to German or Esperanto. The main problem here is that the license of Alpaca isn't usable for Open Source LLMs because ChatGPT does not allow to use its output to train other models. Because of that it cannot be used for Open Assistant or for any commercial project.

However having this dataset surely is useful to train experimental systems and science projects.

BTW. do you know about the Alpaca Data Cleaned project? It fixed a lot of the errors in the dataset, like wrong calculations: https://github.com/gururise/AlpacaDataCleaned

@pourmand1376
Copy link
Contributor Author

pourmand1376 commented Aug 4, 2023

Hey great work, I always wanted translate this dataset to German or Esperanto. The main problem here is that the license of Alpaca isn't usable for Open Source LLMs because ChatGPT does not allow to use its output to train other models. Because of that it cannot be used for Open Assistant or for any commercial project.

However having this dataset surely is useful to train experimental systems and science projects.

BTW. do you know about the Alpaca Data Cleaned project? It fixed a lot of the errors in the dataset, like wrong calculations: https://github.com/gururise/AlpacaDataCleaned

Hi, Thanks for your comment.

Yes, I have used the cleaned version.

Sadly, I didn't know about license restrictions. The dataset itself (Alapaca) is published under Apache 2.0. I have also published my dataset under Apache 2.0.

Isn't that good enough?

@stefangrotz
Copy link
Contributor

stefangrotz commented Aug 4, 2023

Unfortunately not, see https://github.com/gururise/AlpacaDataCleaned#license
This is one of the main reasons why OA started to build up a crowd sourced conversational dataset.

Maybe you can translate the english and the spanish Open Assistant Dataset instead? Both are quite big.
https://huggingface.co/datasets/OpenAssistant/oasst1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants