-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Language Validation Test #257
Adding Language Validation Test #257
Conversation
laser_encoders/validate_models.py
Outdated
from laser_encoders.models import initialize_encoder | ||
|
||
|
||
def validate_language_models_and_tokenize(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe do the pytest.mark.parametrize thing with this test, instead of looping over the language inside it?
This way, it would be easier to rerun it for a particular language, if e.g. we decide to fix a single language code
Out of curiosity, do all languages pass the test? |
Not all the languages pass the test. |
@NIXBLACK11
Are you OK with this plan? |
laser_encoders/download_models.py
Outdated
@@ -71,7 +71,7 @@ def download(self, filename: str): | |||
def get_language_code(self, language_list: dict, lang: str) -> str: | |||
try: | |||
lang_3_4 = language_list[lang] | |||
if isinstance(lang_3_4, tuple): | |||
if isinstance(lang_3_4, list): | |||
options = ", ".join(f"'{opt}'" for opt in lang_3_4) | |||
raise ValueError( | |||
f"Language '{lang_3_4}' has multiple options: {options}. Please specify using --lang." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, I am getting the following error message:
ValueError: Language '['kas_Arab', 'kas_Deva']' has multiple options: 'kas_Arab', 'kas_Deva'. Please specify using --lang.
I don't like two things about it:
- I would expect the first part to look like
Language 'kas' has multiple options: 'kas_Arab', 'kas_Deva'.
, so please replace the first occurrence oflang_3_4
withlang
in this error message. - The part
Please specify using --lang.
looks obscure if we use the Python interface instead of CLI. So please rephrase it asPlease specify using the 'lang' argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @avidale, made the changes,
print(f"{lang} model validated successfully") | ||
|
||
|
||
# This uses the mock downloader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't use the mock downloader? (L112)
laser_encoders/validate_models.py
Outdated
for file_name in files: | ||
file_path = os.path.join(self.model_dir, file_name) | ||
if os.path.exists(file_path): | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will return early won't check the other two files? (laser2.spm
and laser2.cvocab
)
laser_encoders/validate_models.py
Outdated
def download_laser3(self, lang): | ||
lang = self.get_language_code(LASER3_LANGUAGE, lang) | ||
file_path = os.path.join(self.model_dir, f"laser3-{lang}.v1.pt") | ||
if os.path.exists(file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can simply say:
return os.path.exists(file_path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heffernankevin It should return opposite of this, as it returns true when there is error and false if there is no error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it more readable, I would suggest instead maybe raising an error if the lang code doesn't exist and then check it using something like:
try:
download_laser3(lang)
except:
[...]
f"language name: {lang} not found in language list. Specify a supported language name" | ||
) | ||
|
||
def download_laser3(self, lang): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC @avidale's suggestion for the mock downloader was just to check if the language codes exist? (and then have a real downloader for a couple of languages like you have in test_models_initialization.py
?). Maybe I misunderstood this comment: #257 (comment).
For example, you could parameterise it with the LASER3 langs, but the func download_laser3
inside the mock downloader just checks if the language code exists instead of actually downloading it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(resolved in chat)
This PR introduces a new validation slow test to ensure availability of language models. The primary objectives of this PR are:
Language Code Validation: A validation process for language codes defined in the language_list.py module. This verification will help us ensure that the language codes are correct and align with the available models.