-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
34 manual token labeling #40
Conversation
This is looking good. Here's what I think we want to do to finalize it:
|
89ced51
to
d970d8f
Compare
Nice! The CSV needs a column with token label though |
@jettjaniak Could you elaborate what you mean? |
This is looking good - nice. |
4f21d12
to
7a16cc8
Compare
tests/eval/test_token_labelling.py
Outdated
assert is_valid_structure(labelled_token_ids_dict) == True | ||
|
||
|
||
@pytest.mark.dependency(depends=["test_label_tokens_from_tokenizer"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This decorator has another test run first (the test that it depends on), whose results can be used in another test.
So here, we first create a dict with the token labels in another test and use those results to test subsequent functions of our library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I don't think this is a good practice, I suggest you use pytest fixtures instead (built-in, don't require additional packages) https://docs.pytest.org/en/6.2.x/fixture.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I disagree about good practice. In bigger projects with large amounts of tests, you would not want to waste compute on tests that anyways depend on other tests.
Here we do not have a large test set, so if you prefer I can go with fixtures and have the same code from test_label_tokens_from_tokenizer
a second time.
I don't see a use case for *.pkl anymore |
Resolved two comments of yours. Should I get rid of the *.pkl option entirely before merging? @jettjaniak |
def is_valid_structure(obj: dict[int, dict[str, bool]]) -> bool: | ||
""" | ||
Checks whether the obj fits the structure of `dict[int, dict[str, bool]]`. Returns True, if it fits, False otherwise. | ||
""" | ||
if not isinstance(obj, dict): | ||
print(f"Main structure is not dict! Instead is type {type(obj)}") | ||
return False | ||
for key, value in obj.items(): | ||
if not isinstance(key, int) or not isinstance(value, dict): | ||
print( | ||
f"Main structure is dict, but its keys are either not int or its values are not dicts. Instead key is type {type(key)} and value is type {type(value)}" | ||
) | ||
return False | ||
for sub_key, sub_value in value.items(): | ||
if not isinstance(sub_key, str) or not isinstance(sub_value, bool): | ||
print( | ||
f"The structure dict[int, dict[X, Y]] is True, but either X is not str or Y is not bool. Instead X is type {type(sub_key)} and Y is type {type(sub_value)}" | ||
) | ||
return False | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beartype is doing this automatically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, nice. But weirdly enough I needed this to catch a bug because beartype did not throw an error when the type was dict[int, dict[str, np.array[bool]]
|
|
@transcendingvictor your tasks is blocked by this and Josh won't be available until weekend. Could you complete the items in the comment above and merge this? Hit me up on Discord if sth is unclear |
tests/eval/test_token_labelling.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file should be renamed to spacy_* as well
tests/eval/test_token_labelling.py
Outdated
from spacy.language import Language | ||
from spacy.tokens import Doc | ||
from transformers import AutoTokenizer | ||
|
||
import delphi.eval.token_labelling as tl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you rename the files? If you do right click and rename in VSCode it should also rename all references
tests/eval/test_token_labelling.py
Outdated
@pytest.skip("These tests are slow") | ||
@pytest.fixture | ||
def dummy_doc() -> tuple[str, Doc, dict[str, bool]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a fixture, not a test, so you probably don't have to skip it
tests/eval/test_token_labelling.py
Outdated
|
||
@pytest.skip("These tests are slow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize how many there are and that they're all in a single file. I believe you can just do pytest.skip() on the top level, just after the imports to skip the whole file. The reason is "tests are slow and we're not using this module currently"
Thanks @transcendingvictor for doing this |
Added a *.CSV file that is cerated during the labelling process of tokens.
While the pickle file (which still exists) has the structure of
dict[token_id, dict[label_name, label_value]]
the CSV now has the form