Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add xnli #134

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

gentaiscool
Copy link

@gentaiscool gentaiscool commented Oct 7, 2022

Adding xnli to lm-evaluation-harness

Copy link
Collaborator

@jon-tow jon-tow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @gentaiscool 👋 Thanks for adding this!

I left a few change requests that boil down to promptsource not supporting non-English XNLI sets. To make sure this task works out-of-box, let's remove all non-English tasks for now.

Comment on lines +49 to +89
class XNLIFr(XNLI):
DATASET_NAME = "fr"

class XNLIEs(XNLI):
DATASET_NAME = "es"

class XNLIDe(XNLI):
DATASET_NAME = "de"

class XNLIEl(XNLI):
DATASET_NAME = "el"

class XNLIBg(XNLI):
DATASET_NAME = "bg"

class XNLIRu(XNLI):
DATASET_NAME = "ru"

class XNLITr(XNLI):
DATASET_NAME = "tr"

class XNLIAr(XNLI):
DATASET_NAME = "ar"

class XNLIVi(XNLI):
DATASET_NAME = "vi"

class XNLITh(XNLI):
DATASET_NAME = "th"

class XNLIZh(XNLI):
DATASET_NAME = "zh"

class XNLIHi(XNLI):
DATASET_NAME = "hi"

class XNLISw(XNLI):
DATASET_NAME = "sw"

class XNLIUr(XNLI):
DATASET_NAME = "ur"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these classes. Unfortunately, English is currently the only language with promptsource support on the eval-hackathon branch (see here).

Comment on lines +37 to +45
def training_docs(self):
if self.has_training_docs():
return self.dataset["train"]

def validation_docs(self):
if self.has_validation_docs():
return self.dataset["validation"]


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test_docs method since the test set is available in the HuggingFace datasets.

Comment on lines +94 to +107
XNLIFr,
XNLIEs,
XNLIDe,
XNLIEl,
XNLIBg,
XNLIRu,
XNLITr,
XNLIAr,
XNLIVi,
XNLITh,
XNLIZh,
XNLIHi,
XNLISw,
XNLIUr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these tasks (see comment above about lack of promptsource support for non-English tasks).

def construct_tasks() -> typing.Dict[str, XNLI]:
"""
Returns a dictionary of tasks keyed by task name, for example:
"GEM/wiki_lingua_ar"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this key to an XNLI matching example, e.g. "xnli_en".

Comment on lines +117 to +130
"xnli_fr": xnli.XNLIFr,
"xnli_es": xnli.XNLIEs,
"xnli_de": xnli.XNLIDe,
"xnli_el": xnli.XNLIEl,
"xnli_bg": xnli.XNLIBg,
"xnli_ru": xnli.XNLIRu,
"xnli_tr": xnli.XNLITr,
"xnli_ar": xnli.XNLIAr,
"xnli_vi": xnli.XNLIVi,
"xnli_th": xnli.XNLITh,
"xnli_zh": xnli.XNLIZh,
"xnli_hi": xnli.XNLIHi,
"xnli_sw": xnli.XNLISw,
"xnli_ur": xnli.XNLIUr,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these tasks (see comment above about lack of promptsource support for non-English tasks).

@yongzx
Copy link

yongzx commented Oct 7, 2022

I would love to be part of this conversation as well. Right now the multilingual modeling group is trying to perform evaluation on non-English tasks, and it seems like we have to forked both Eval-Harness and PromptSource to extend the prompt-based evaluation for non-EN tasks. Am I right?

@jon-tow
Copy link
Collaborator

jon-tow commented Oct 7, 2022

Hi @yongzx ! That's one way to do it. You'd have to:

  1. Fork this big-science/lm-evaluation-harness repo and set up the Python environment.
git clone https://github.com/{fork-name}/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[dev]"
  1. Add the XNLI changes in this PR.
  2. Fork promptsource and work from the eval-hackathon branch here. (To tighten things up, you can later make this a submodule of lm-eval. See this harness fork that uses custom templates for a custom task).
pip uninstall promptsource  # Remove the version installed by the harness setup.
git clone --single-branch --branch eval-hackathon https://github.com/{fork-name}/promptsource
pip install -e ./promptsource
  1. Dump your prompt templates for the non-English subsets into the promptsource xnli template dir.

Lastly, make sure your templates can be accessed from the harness. For example, using the XNLI French subset, run the following in a Python interpreter:

import lm_eval
print(lm_eval.list_templates("xnli_fr"))

Once you see the templates listed, you should be ready to evaluate as usual.

Let me know if you run into any issues (most problems stem from setting up a consistent Python virtual environment). I'll be glad to help!

@StellaAthena
Copy link
Collaborator

@jon-tow The Prompt Engineering WG has been successfully running non-English prompted tasks with the eval harness on BLOOM. @Muennighoff, can you explain how you’ve been running non-English prompts?

@Muennighoff
Copy link

@jon-tow The Prompt Engineering WG has been successfully running non-English prompted tasks with the eval harness on BLOOM. @Muennighoff, can you explain how you’ve been running non-English prompts?

I didn't use eval harness, but https://github.com/Muennighoff/t-zero/blob/muennighoff/upgrdps/evaluation/run_eval.py

@yongzx
Copy link

yongzx commented Oct 12, 2022

Thanks @jon-tow and @Muennighoff!!

@yongzx
Copy link

yongzx commented Oct 12, 2022

@jon-tow I actually did what you suggested. For instance:

>>> import lm_eval
>>> print(lm_eval.list_templates("xnli_de"))
['GPT-3 style', 'MNLI crowdsource', 'always/sometimes/never', 'based on the previous passage', 'can we infer', 'claim true/false/inconclusive', 'consider always/sometimes/never', 'does it follow that', 'does this imply', 'guaranteed true', 'guaranteed/possible/impossible', 'justified in saying', 'must be true', 'should assume', 'take the following as truth']

@yongzx
Copy link

yongzx commented Oct 12, 2022

But strangely evaluating with xnli_en (English) using BLOOM_560m model on GPT-3 prompt gives 33.3% accuracy (as good as random).

Will try with Niklas' repo.

@jon-tow
Copy link
Collaborator

jon-tow commented Oct 13, 2022

Thanks for the updates, @yongzx ! Did you obtain significantly different accuracies when using Niklas's repo? Re:

But strangely evaluating with xnli_en (English) using BLOOM_560m model on GPT-3 prompt gives 33.3% accuracy (as good as random).

@yongzx
Copy link

yongzx commented Oct 13, 2022

I obtained the same accuracies with the BLOOM model, but with Niklas' repo, I have gotten better accuracies using a different model (BLOOMZ). I haven't tried BLOOMZ with eval-harness yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants