Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

34 manual token labeling #40

Merged
merged 18 commits into from
Mar 13, 2024
Merged

34 manual token labeling #40

merged 18 commits into from
Mar 13, 2024

Conversation

joshuawe
Copy link
Collaborator

Added a *.CSV file that is cerated during the labelling process of tokens.
While the pickle file (which still exists) has the structure of dict[token_id, dict[label_name, label_value]] the CSV now has the form

token_id Starts with space Capitalized Is Adjective Is Adposition ...
0 False True False False ...
1 False False True False ...
2 False False False False ...
... ... ... ... ... ...

@joshuawe joshuawe linked an issue Feb 18, 2024 that may be closed by this pull request
@joshuawe joshuawe requested a review from jaidhyani February 18, 2024 19:32
@jaidhyani
Copy link
Collaborator

This is looking good. Here's what I think we want to do to finalize it:

  1. Factor out the core token-labeling functionality into the token_labelling.py (basically lines 49-88). This probably looks like a function that takes a tokenizer as an argument and returns a labelled_token_ids_dict (or similar data structure that fully specifies token labels)

  2. Adapt the current script into an export_labels script with three arguments: (1) model/tokenizer (as exists in the current script) (2) output format (csv or pkl) and (3) output path. Output path should be a mandatory argument without defaults. This will call the function defined in (1) and basically just handle the file writing.

  3. Back in token_labeling.py, add a function import_token_labels which takes a path argument (to a csv) and returns a labelled_token_ids_dict-like data structure specifying token labels.

  4. Move generated data files to delphi/static/ (see currently-a-draft PR Added static file folder #41)

@joshuawe joshuawe force-pushed the 34-manual-token-labeling branch from 89ced51 to d970d8f Compare February 22, 2024 17:46
@jettjaniak
Copy link
Contributor

Nice! The CSV needs a column with token label though

@joshuawe
Copy link
Collaborator Author

Nice! The CSV needs a column with token label though

@jettjaniak Could you elaborate what you mean?
Would you like to have a header line, that explains the meaning of each column?

@jaidhyani
Copy link
Collaborator

This is looking good - nice.

@joshuawe joshuawe force-pushed the 34-manual-token-labeling branch from 4f21d12 to 7a16cc8 Compare February 28, 2024 08:25
@joshuawe joshuawe marked this pull request as ready for review February 29, 2024 21:27
assert is_valid_structure(labelled_token_ids_dict) == True


@pytest.mark.dependency(depends=["test_label_tokens_from_tokenizer"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This decorator has another test run first (the test that it depends on), whose results can be used in another test.
So here, we first create a dict with the token labels in another test and use those results to test subsequent functions of our library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I don't think this is a good practice, I suggest you use pytest fixtures instead (built-in, don't require additional packages) https://docs.pytest.org/en/6.2.x/fixture.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I disagree about good practice. In bigger projects with large amounts of tests, you would not want to waste compute on tests that anyways depend on other tests.

Here we do not have a large test set, so if you prefer I can go with fixtures and have the same code from test_label_tokens_from_tokenizer a second time.

@jettjaniak
Copy link
Contributor

I don't see a use case for *.pkl anymore

@joshuawe
Copy link
Collaborator Author

joshuawe commented Mar 1, 2024

Resolved two comments of yours.

Should I get rid of the *.pkl option entirely before merging? @jettjaniak

Comment on lines +122 to +141
def is_valid_structure(obj: dict[int, dict[str, bool]]) -> bool:
"""
Checks whether the obj fits the structure of `dict[int, dict[str, bool]]`. Returns True, if it fits, False otherwise.
"""
if not isinstance(obj, dict):
print(f"Main structure is not dict! Instead is type {type(obj)}")
return False
for key, value in obj.items():
if not isinstance(key, int) or not isinstance(value, dict):
print(
f"Main structure is dict, but its keys are either not int or its values are not dicts. Instead key is type {type(key)} and value is type {type(value)}"
)
return False
for sub_key, sub_value in value.items():
if not isinstance(sub_key, str) or not isinstance(sub_value, bool):
print(
f"The structure dict[int, dict[X, Y]] is True, but either X is not str or Y is not bool. Instead X is type {type(sub_key)} and Y is type {type(sub_value)}"
)
return False
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beartype is doing this automatically

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, nice. But weirdly enough I needed this to catch a bug because beartype did not throw an error when the type was dict[int, dict[str, np.array[bool]]

@jettjaniak
Copy link
Contributor

jettjaniak commented Mar 1, 2024

I think #47 and #51 should go with this PR
And please add the CSV to static when ready

@joshuawe
Copy link
Collaborator Author

joshuawe commented Mar 5, 2024

  • get rid of pickling
  • add str repr into *.csv
  • rewrite testing to not include dependency (slow version OR @pytest.skip)
  • Add the token labelling CSV once it is created to STATIC
  • comment / todo / fixme on export to csv eating spaces

@jettjaniak
Copy link
Contributor

@jettjaniak
Copy link
Contributor

@transcendingvictor your tasks is blocked by this and Josh won't be available until weekend. Could you complete the items in the comment above and merge this? Hit me up on Discord if sth is unclear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should be renamed to spacy_* as well

from spacy.language import Language
from spacy.tokens import Doc
from transformers import AutoTokenizer

import delphi.eval.token_labelling as tl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you rename the files? If you do right click and rename in VSCode it should also rename all references

Comment on lines 14 to 16
@pytest.skip("These tests are slow")
@pytest.fixture
def dummy_doc() -> tuple[str, Doc, dict[str, bool]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a fixture, not a test, so you probably don't have to skip it


@pytest.skip("These tests are slow")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize how many there are and that they're all in a single file. I believe you can just do pytest.skip() on the top level, just after the imports to skip the whole file. The reason is "tests are slow and we're not using this module currently"

@jettjaniak jettjaniak merged commit 456958a into main Mar 13, 2024
1 check passed
@jettjaniak jettjaniak deleted the 34-manual-token-labeling branch March 13, 2024 17:09
@joshuawe
Copy link
Collaborator Author

Thanks @transcendingvictor for doing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

manual token labeling
4 participants