Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Language Validation Test #257

Merged

Conversation

NIXBLACK11
Copy link
Contributor

This PR introduces a new validation slow test to ensure availability of language models. The primary objectives of this PR are:

Language Code Validation: A validation process for language codes defined in the language_list.py module. This verification will help us ensure that the language codes are correct and align with the available models.

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Oct 31, 2023
from laser_encoders.models import initialize_encoder


def validate_language_models_and_tokenize():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe do the pytest.mark.parametrize thing with this test, instead of looping over the language inside it?
This way, it would be easier to rerun it for a particular language, if e.g. we decide to fix a single language code

@NIXBLACK11 NIXBLACK11 marked this pull request as ready for review November 1, 2023 13:17
@heffernankevin
Copy link
Contributor

Out of curiosity, do all languages pass the test?

@NIXBLACK11
Copy link
Contributor Author

Out of curiosity, do all languages pass the test?

Not all the languages pass the test.

@NIXBLACK11
Copy link
Contributor Author

After completing the tests for all the languages the languages that gave error were:
these

The output for the LASER2 language list is here.
The output for the LASER3 language list is here.

@avidale
Copy link
Contributor

avidale commented Nov 6, 2023

@NIXBLACK11
Thank you for identifying the languages for which the model initialization fails!
The next steps, I think, should be the following:

  1. Fix the problem with the two LASER3 languages: Kashmiri and Kanuri.
    These languages are special, in the sense that LASER3 supports two scripts for them, which should be specified explicitly.
    If they are not specified, we should raise a meaningful error (probably, a ValueError) with the description that tells the user to choose the script (e.g. ). Currently, we are raising a TypeError: unhashable type: 'list' for them, which is totally obscure for the user.

  2. Update your language validation test: for these two languages, it should check that a ValueError is raised.

  3. Fix the problem with the LASER2 languages (please do it in this PR). I hope, there is only one problem with them (so it will also fix An error initializing English pipeline (on the MLH-dev branch) #259).

  4. Write "fast" tests (the ones that will be run by Github on each commit), covering the two-script case (e.g. with Kashmiri) and the case for LASER2 (I suggest using English, as the most popular language).

Are you OK with this plan?

@@ -71,7 +71,7 @@ def download(self, filename: str):
def get_language_code(self, language_list: dict, lang: str) -> str:
try:
lang_3_4 = language_list[lang]
if isinstance(lang_3_4, tuple):
if isinstance(lang_3_4, list):
options = ", ".join(f"'{opt}'" for opt in lang_3_4)
raise ValueError(
f"Language '{lang_3_4}' has multiple options: {options}. Please specify using --lang."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, I am getting the following error message:

ValueError: Language '['kas_Arab', 'kas_Deva']' has multiple options: 'kas_Arab', 'kas_Deva'. Please specify using --lang.

I don't like two things about it:

  1. I would expect the first part to look like Language 'kas' has multiple options: 'kas_Arab', 'kas_Deva'. , so please replace the first occurrence of lang_3_4 with lang in this error message.
  2. The part Please specify using --lang. looks obscure if we use the Python interface instead of CLI. So please rephrase it as Please specify using the 'lang' argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @avidale, made the changes,

print(f"{lang} model validated successfully")


# This uses the mock downloader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't use the mock downloader? (L112)

for file_name in files:
file_path = os.path.join(self.model_dir, file_name)
if os.path.exists(file_path):
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return early won't check the other two files? (laser2.spm and laser2.cvocab)

def download_laser3(self, lang):
lang = self.get_language_code(LASER3_LANGUAGE, lang)
file_path = os.path.join(self.model_dir, f"laser3-{lang}.v1.pt")
if os.path.exists(file_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply say:

return os.path.exists(file_path)

Copy link
Contributor Author

@NIXBLACK11 NIXBLACK11 Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heffernankevin It should return opposite of this, as it returns true when there is error and false if there is no error.

Copy link
Contributor

@heffernankevin heffernankevin Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it more readable, I would suggest instead maybe raising an error if the lang code doesn't exist and then check it using something like:

try:
  download_laser3(lang)
except:
  [...]

f"language name: {lang} not found in language list. Specify a supported language name"
)

def download_laser3(self, lang):
Copy link
Contributor

@heffernankevin heffernankevin Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC @avidale's suggestion for the mock downloader was just to check if the language codes exist? (and then have a real downloader for a couple of languages like you have in test_models_initialization.py?). Maybe I misunderstood this comment: #257 (comment).

For example, you could parameterise it with the LASER3 langs, but the func download_laser3 inside the mock downloader just checks if the language code exists instead of actually downloading it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(resolved in chat)

@heffernankevin heffernankevin merged commit b0131d9 into facebookresearch:MLH-dev Nov 14, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants