-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add xnli #134
base: master
Are you sure you want to change the base?
Add xnli #134
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @gentaiscool 👋 Thanks for adding this!
I left a few change requests that boil down to promptsource
not supporting non-English XNLI sets. To make sure this task works out-of-box, let's remove all non-English tasks for now.
class XNLIFr(XNLI): | ||
DATASET_NAME = "fr" | ||
|
||
class XNLIEs(XNLI): | ||
DATASET_NAME = "es" | ||
|
||
class XNLIDe(XNLI): | ||
DATASET_NAME = "de" | ||
|
||
class XNLIEl(XNLI): | ||
DATASET_NAME = "el" | ||
|
||
class XNLIBg(XNLI): | ||
DATASET_NAME = "bg" | ||
|
||
class XNLIRu(XNLI): | ||
DATASET_NAME = "ru" | ||
|
||
class XNLITr(XNLI): | ||
DATASET_NAME = "tr" | ||
|
||
class XNLIAr(XNLI): | ||
DATASET_NAME = "ar" | ||
|
||
class XNLIVi(XNLI): | ||
DATASET_NAME = "vi" | ||
|
||
class XNLITh(XNLI): | ||
DATASET_NAME = "th" | ||
|
||
class XNLIZh(XNLI): | ||
DATASET_NAME = "zh" | ||
|
||
class XNLIHi(XNLI): | ||
DATASET_NAME = "hi" | ||
|
||
class XNLISw(XNLI): | ||
DATASET_NAME = "sw" | ||
|
||
class XNLIUr(XNLI): | ||
DATASET_NAME = "ur" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these classes. Unfortunately, English is currently the only language with promptsource support on the eval-hackathon
branch (see here).
def training_docs(self): | ||
if self.has_training_docs(): | ||
return self.dataset["train"] | ||
|
||
def validation_docs(self): | ||
if self.has_validation_docs(): | ||
return self.dataset["validation"] | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test_docs
method since the test set is available in the HuggingFace datasets
.
XNLIFr, | ||
XNLIEs, | ||
XNLIDe, | ||
XNLIEl, | ||
XNLIBg, | ||
XNLIRu, | ||
XNLITr, | ||
XNLIAr, | ||
XNLIVi, | ||
XNLITh, | ||
XNLIZh, | ||
XNLIHi, | ||
XNLISw, | ||
XNLIUr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these tasks (see comment above about lack of promptsource
support for non-English tasks).
def construct_tasks() -> typing.Dict[str, XNLI]: | ||
""" | ||
Returns a dictionary of tasks keyed by task name, for example: | ||
"GEM/wiki_lingua_ar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this key to an XNLI matching example, e.g. "xnli_en"
.
"xnli_fr": xnli.XNLIFr, | ||
"xnli_es": xnli.XNLIEs, | ||
"xnli_de": xnli.XNLIDe, | ||
"xnli_el": xnli.XNLIEl, | ||
"xnli_bg": xnli.XNLIBg, | ||
"xnli_ru": xnli.XNLIRu, | ||
"xnli_tr": xnli.XNLITr, | ||
"xnli_ar": xnli.XNLIAr, | ||
"xnli_vi": xnli.XNLIVi, | ||
"xnli_th": xnli.XNLITh, | ||
"xnli_zh": xnli.XNLIZh, | ||
"xnli_hi": xnli.XNLIHi, | ||
"xnli_sw": xnli.XNLISw, | ||
"xnli_ur": xnli.XNLIUr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these tasks (see comment above about lack of promptsource
support for non-English tasks).
I would love to be part of this conversation as well. Right now the multilingual modeling group is trying to perform evaluation on non-English tasks, and it seems like we have to forked both Eval-Harness and PromptSource to extend the prompt-based evaluation for non-EN tasks. Am I right? |
Hi @yongzx ! That's one way to do it. You'd have to:
Lastly, make sure your templates can be accessed from the harness. For example, using the XNLI French subset, run the following in a Python interpreter:
Once you see the templates listed, you should be ready to evaluate as usual. Let me know if you run into any issues (most problems stem from setting up a consistent Python virtual environment). I'll be glad to help! |
@jon-tow The Prompt Engineering WG has been successfully running non-English prompted tasks with the eval harness on BLOOM. @Muennighoff, can you explain how you’ve been running non-English prompts? |
I didn't use eval harness, but https://github.com/Muennighoff/t-zero/blob/muennighoff/upgrdps/evaluation/run_eval.py |
Thanks @jon-tow and @Muennighoff!! |
@jon-tow I actually did what you suggested. For instance:
|
But strangely evaluating with Will try with Niklas' repo. |
Thanks for the updates, @yongzx ! Did you obtain significantly different accuracies when using Niklas's repo? Re:
|
I obtained the same accuracies with the BLOOM model, but with Niklas' repo, I have gotten better accuracies using a different model (BLOOMZ). I haven't tried BLOOMZ with eval-harness yet. |
Adding xnli to lm-evaluation-harness