34 manual token labeling #40

joshuawe · 2024-02-18T19:31:35Z

Added a *.CSV file that is cerated during the labelling process of tokens.
While the pickle file (which still exists) has the structure of dict[token_id, dict[label_name, label_value]] the CSV now has the form

token_id	Starts with space	Capitalized	Is Adjective	Is Adposition	...
0	False	True	False	False	...
1	False	False	True	False	...
2	False	False	False	False	...
...	...	...	...	...	...

scripts/label_all_tokens.py

jaidhyani · 2024-02-19T23:30:25Z

This is looking good. Here's what I think we want to do to finalize it:

Factor out the core token-labeling functionality into the token_labelling.py (basically lines 49-88). This probably looks like a function that takes a tokenizer as an argument and returns a labelled_token_ids_dict (or similar data structure that fully specifies token labels)
Adapt the current script into an export_labels script with three arguments: (1) model/tokenizer (as exists in the current script) (2) output format (csv or pkl) and (3) output path. Output path should be a mandatory argument without defaults. This will call the function defined in (1) and basically just handle the file writing.
Back in token_labeling.py, add a function import_token_labels which takes a path argument (to a csv) and returns a labelled_token_ids_dict-like data structure specifying token labels.
Move generated data files to delphi/static/ (see currently-a-draft PR Added static file folder #41)

jettjaniak · 2024-02-23T06:50:25Z

Nice! The CSV needs a column with token label though

scripts/label_all_tokens.py

joshuawe · 2024-02-23T10:51:44Z

Nice! The CSV needs a column with token label though

@jettjaniak Could you elaborate what you mean?
Would you like to have a header line, that explains the meaning of each column?

jaidhyani · 2024-02-27T07:09:16Z

This is looking good - nice.

src/delphi/eval/token_labelling.py

jettjaniak · 2024-03-01T02:29:08Z

tests/eval/test_token_labelling.py

+    assert is_valid_structure(labelled_token_ids_dict) == True
+
+
+@pytest.mark.dependency(depends=["test_label_tokens_from_tokenizer"])


what does it do?

This decorator has another test run first (the test that it depends on), whose results can be used in another test.
So here, we first create a dict with the token labels in another test and use those results to test subsequent functions of our library.

I see. I don't think this is a good practice, I suggest you use pytest fixtures instead (built-in, don't require additional packages) https://docs.pytest.org/en/6.2.x/fixture.html

In general I disagree about good practice. In bigger projects with large amounts of tests, you would not want to waste compute on tests that anyways depend on other tests.

Here we do not have a large test set, so if you prefer I can go with fixtures and have the same code from test_label_tokens_from_tokenizer a second time.

scripts/label_all_tokens.py

jettjaniak · 2024-03-01T04:16:33Z

I don't see a use case for *.pkl anymore

joshuawe · 2024-03-01T16:56:41Z

Resolved two comments of yours.

Should I get rid of the *.pkl option entirely before merging? @jettjaniak

jettjaniak · 2024-03-01T21:06:48Z

tests/eval/test_token_labelling.py

+def is_valid_structure(obj: dict[int, dict[str, bool]]) -> bool:
+    """
+    Checks whether the obj fits the structure of `dict[int, dict[str, bool]]`. Returns True, if it fits, False otherwise.
+    """
+    if not isinstance(obj, dict):
+        print(f"Main structure is not dict! Instead is type {type(obj)}")
+        return False
+    for key, value in obj.items():
+        if not isinstance(key, int) or not isinstance(value, dict):
+            print(
+                f"Main structure is dict, but its keys are either not int or its values are not dicts. Instead key is type {type(key)} and value is type {type(value)}"
+            )
+            return False
+        for sub_key, sub_value in value.items():
+            if not isinstance(sub_key, str) or not isinstance(sub_value, bool):
+                print(
+                    f"The structure dict[int, dict[X, Y]] is True, but either X is not str or Y is not bool. Instead X is type {type(sub_key)} and Y is type {type(sub_value)}"
+                )
+                return False
+    return True


beartype is doing this automatically

Ha, nice. But weirdly enough I needed this to catch a bug because beartype did not throw an error when the type was dict[int, dict[str, np.array[bool]]

jettjaniak · 2024-03-01T21:18:13Z

I think #47 and #51 should go with this PR
And please add the CSV to static when ready

joshuawe · 2024-03-05T17:28:01Z

get rid of pickling
add str repr into *.csv
- correctly create & save CSV
- correctly load the CSV
- Merge the code into this branch of 34 manual token labeling #40
rewrite testing to not include dependency (slow version OR @pytest.skip)
Add the token labelling CSV once it is created to STATIC
comment / todo / fixme on export to csv eating spaces

jettjaniak · 2024-03-05T18:08:59Z

check what's wrong with beartype and nested structures https://beartype.readthedocs.io/en/latest/eli5/
implement slow version of tests and @pytest.skip or do fixture
comment / todo / fixme on export to csv eating spaces

jettjaniak · 2024-03-07T00:21:12Z

@transcendingvictor your tasks is blocked by this and Josh won't be available until weekend. Could you complete the items in the comment above and merge this? Hit me up on Discord if sth is unclear

…ppearing spaces

jettjaniak · 2024-03-11T16:47:11Z

tests/eval/test_token_labelling.py

this file should be renamed to spacy_* as well

jettjaniak · 2024-03-11T16:47:56Z

tests/eval/test_token_labelling.py

 from spacy.language import Language
 from spacy.tokens import Doc
+from transformers import AutoTokenizer

 import delphi.eval.token_labelling as tl


How did you rename the files? If you do right click and rename in VSCode it should also rename all references

jettjaniak · 2024-03-11T16:48:21Z

tests/eval/test_token_labelling.py

+@pytest.skip("These tests are slow")
 @pytest.fixture
 def dummy_doc() -> tuple[str, Doc, dict[str, bool]]:


this is a fixture, not a test, so you probably don't have to skip it

jettjaniak · 2024-03-11T16:49:51Z

tests/eval/test_token_labelling.py


+@pytest.skip("These tests are slow")


I didn't realize how many there are and that they're all in a single file. I believe you can just do pytest.skip() on the top level, just after the imports to skip the whole file. The reason is "tests are slow and we're not using this module currently"

joshuawe · 2024-03-13T17:29:22Z

Thanks @transcendingvictor for doing this

joshuawe linked an issue Feb 18, 2024 that may be closed by this pull request

manual token labeling #34

Closed

joshuawe requested a review from jaidhyani February 18, 2024 19:32

jaidhyani reviewed Feb 19, 2024

View reviewed changes

scripts/label_all_tokens.py Outdated Show resolved Hide resolved

joshuawe force-pushed the 34-manual-token-labeling branch from 89ced51 to d970d8f Compare February 22, 2024 17:46

jettjaniak reviewed Feb 23, 2024

View reviewed changes

scripts/label_all_tokens.py Outdated Show resolved Hide resolved

jettjaniak reviewed Feb 23, 2024

View reviewed changes

scripts/label_all_tokens.py Outdated Show resolved Hide resolved

joshuawe added 11 commits February 28, 2024 09:18

add pandas to requirements.txt

36ffb7b

visualize token label stats

62e9b84

added CSV creation for token labelling

6915061

add command line arguments save_dir and output_format

54b58ce

factor out the token labelling

ca43948

add test_label_tokens_from_tokenizer

c7be073

add import_token_labels function

b313650

add convert label dict to pd dataframe

8c07c6f

bug fix datatypes

533ec9c

refactor

130418e

add test_import_token_labels

7a16cc8

joshuawe force-pushed the 34-manual-token-labeling branch from 4f21d12 to 7a16cc8 Compare February 28, 2024 08:25

joshuawe marked this pull request as ready for review February 29, 2024 21:27

joshuawe requested review from jaidhyani and jettjaniak February 29, 2024 21:28

joshuawe mentioned this pull request Feb 29, 2024

Added option to save to full paths #45

Closed

jettjaniak reviewed Mar 1, 2024

View reviewed changes

src/delphi/eval/token_labelling.py Outdated Show resolved Hide resolved

jettjaniak reviewed Mar 1, 2024

View reviewed changes

scripts/label_all_tokens.py Outdated Show resolved Hide resolved

small fixes

7ef56c3

joshuawe mentioned this pull request Mar 1, 2024

remove pickle option when saving token labels #51

Closed

jettjaniak reviewed Mar 1, 2024

View reviewed changes

remove the save as pickle option for token label file

4797c0c

remove pytest-dependency and associated code

a49a2f0

jettjaniak assigned transcendingvictor Mar 7, 2024

jettjaniak mentioned this pull request Mar 7, 2024

get new non-spacy token labels #53

Closed

transcendingvictor added 3 commits March 10, 2024 12:13

skipped all tests for token labelling

be84c4b

renamed labelling .py files, and .csv output to spacy_

010921d

added a #FIXME comment warning about the CSV export problem with disa…

8123fed

…ppearing spaces

jettjaniak reviewed Mar 11, 2024

View reviewed changes

tests/eval/test_token_labelling.py Outdated

Copy link

Contributor

jettjaniak Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should be renamed to spacy_* as well

jettjaniak reviewed Mar 11, 2024

View reviewed changes

cancel spacy tests and renaming

4a04e54

jettjaniak approved these changes Mar 13, 2024

View reviewed changes

jettjaniak merged commit 456958a into main Mar 13, 2024
1 check passed

jettjaniak deleted the 34-manual-token-labeling branch March 13, 2024 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

34 manual token labeling #40

34 manual token labeling #40

joshuawe commented Feb 18, 2024

jaidhyani commented Feb 19, 2024

jettjaniak commented Feb 23, 2024

joshuawe commented Feb 23, 2024

jaidhyani commented Feb 27, 2024

jettjaniak Mar 1, 2024

joshuawe Mar 1, 2024

jettjaniak Mar 1, 2024

joshuawe Mar 5, 2024

jettjaniak commented Mar 1, 2024

joshuawe commented Mar 1, 2024

jettjaniak Mar 1, 2024

joshuawe Mar 5, 2024

jettjaniak commented Mar 1, 2024 •

edited

Loading

joshuawe commented Mar 5, 2024 •

edited

Loading

jettjaniak commented Mar 5, 2024

jettjaniak commented Mar 7, 2024

jettjaniak Mar 11, 2024

jettjaniak Mar 11, 2024

jettjaniak Mar 11, 2024

jettjaniak Mar 11, 2024

joshuawe commented Mar 13, 2024

		assert is_valid_structure(labelled_token_ids_dict) == True


		@pytest.mark.dependency(depends=["test_label_tokens_from_tokenizer"])

34 manual token labeling #40

34 manual token labeling #40

Conversation

joshuawe commented Feb 18, 2024

jaidhyani commented Feb 19, 2024

jettjaniak commented Feb 23, 2024

joshuawe commented Feb 23, 2024

jaidhyani commented Feb 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jettjaniak commented Mar 1, 2024

joshuawe commented Mar 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jettjaniak commented Mar 1, 2024 • edited Loading

joshuawe commented Mar 5, 2024 • edited Loading

jettjaniak commented Mar 5, 2024

jettjaniak commented Mar 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshuawe commented Mar 13, 2024

jettjaniak commented Mar 1, 2024 •

edited

Loading

joshuawe commented Mar 5, 2024 •

edited

Loading