Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about datas #3

Closed
lbourdois opened this issue Nov 10, 2022 · 1 comment
Closed

Questions about datas #3

lbourdois opened this issue Nov 10, 2022 · 1 comment

Comments

@lbourdois
Copy link

Hi 😀

First of all, thank you for your very interesting work 🚀

I was wondering about two points where I didn't find an answer by myself (maybe I didn't search well) and I would need your help.

  1. I would have liked to know for a given task, what is the prompt used for finetuning for a given language. For example, let's say French summarization. So I started to search to know which prompt were used for the French summarization but I didn't find a list that would summarize such information. PromptSource provides 2085 prompts in English, but nothing about translations in other languages. Does such a list exist? 🤔

  2. To try to have a solution to the previous point, I thought I had to download the xP3mt dataset and read directly which prompts were used. The problem is that you can actually download all the data for a selected language but you can't do an additional filter on the task/(sub)dataset. Would this be something that could be added?
    Or even better, create individual multilingual datasets of the translations you have done. For example, having the ability to upload an "mSamSum" which would be the multilingual version of "SamSum" which is purely in English at the base. This would probably allow to be reused in other works, especially monolingual ones. If I take again the example of French summary, there are few data currently available: Orangesum, XLSum and Wiki-lingua. Having easy access to the translations of CNN Daily Mail, Gigaword, MultiNews, SamSum and XSum would allow to do very interesting things 🤯

@lbourdois
Copy link
Author

By opening all the datasets and referring to bigscience-workshop/promptsource#838, it turns out that you did not translate all the datasets from English to French as I understood but add French part from 8 multilingual datasets (available in https://huggingface.co/datasets/bigscience/xP3/viewer/fr/train) and translated the prompts in French for 3 of these 8 datasets.
So my questions are not relevant, my bad, I close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant