add token labelling #21

joshuawe · 2024-02-02T16:30:07Z

No description provided.

joshuawe · 2024-02-02T16:45:24Z

Are the files that have been added stored in a good location?
I think not.

menamerai · 2024-02-02T17:44:16Z

@joshuawe I think you need to add spacy to the requirements.txt

jettjaniak

Let's make a pickle with dict[int, dict[str, bool]] that labels every single token in the tokenizer.

src/delphi/eval/token_labelling.py

jettjaniak · 2024-02-02T17:53:55Z

src/delphi/eval/token_labelling.py

+            print("   ", label.ljust(10), key)
+
+
+def label_single_token(token: Token) -> List:


I would like to have function that just takes a str token label. Should we return this as a dict?

joshuawe · 2024-02-08T22:49:08Z

Have a look at the notebook, where function calling as well as the creation of the pickle object which is basically Look Up Table containing a dict with keys token_id and value token_labels.

Note that token_labels is itself a dict with all keys label_name and values bool

For example:

token_ids_lablled = {
  0: {'Starts with space': False, 'Capitalized': True, 'Is Noun': True, 'Is Pronoun': False, 'Is Adjective': False, 'Is Verb': False, 'Is Adverb': False, 'Is Preposition': False, 'Is Conjunction': False, 'Is Interjunction': False, 'Is Named Entity': False},
  1: {'Starts with space': False, 'Capitalized': True, 'Is Noun': True, 'Is Pronoun': False, 'Is Adjective': False, 'Is Verb': False, 'Is Adverb': False, 'Is Preposition': False, 'Is Conjunction': False, 'Is Interjunction': False, 'Is Named Entity': False},
  2: ...
...
}

The question that remains is

where to store the pickle.
How to make it accessible. Should it always be loaded into the memory, when the module is imported?

joshuawe · 2024-02-08T22:52:03Z

Another interesting thing is the following,
the vocab size of the tokenizer is larger than 4000 by far.

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("roneneldan/TinyStories-1M")

>>> vocab_size = tokenizer.vocab_size
>>> print("The vocab size is:", vocab_size)
50247

So, am I using the correct tokenizer? And are there multiple tokenizers?

jettjaniak · 2024-02-08T23:15:12Z

The tokenizer is wrong, it's from the original tiny stories models. You need to pass model name from any of the models on delphi-suite Hugging Face to use our small tokenizer that was trained on tiny stories dataset.

requirements.txt

jettjaniak · 2024-02-08T23:31:21Z

src/delphi/__init__.py

+from . import eval
+


I still require this so I can import everything under delphi. For WHATEVER reason an empty __init__.py file will not allow me to import it. (of course, I used pip install -e .)

src/delphi/eval/token_labelling.py

jettjaniak · 2024-02-08T23:39:54Z

src/delphi/eval/token_labelling.py

+    """
+    Labels tokens in a sentence batchwise. Takes the context of the token into
+    account for dependency labels (e.g. subject, object, ...).
+
+    Parameters
+    ----------
+    sentences : list
+        A batch/list of sentences, each being a list of tokens.
+    tokenized : bool, optional
+        Whether the sentences are already tokenized, by default True. If the sentences
+        are full strings and not lists of tokens, then set to False. If true then `sentences` must be list[list[str]].
+    verbose : bool, optional
+        Whether to print the tokens and their labels to the console, by default False.
+
+    Returns
+    -------
+    list[list[dict[str, bool]]
+        Returns a list of sentences. Each sentence contains a list of its
+        corresponding token length where each entry provides the labels/categories
+        for the token. Sentence -> Token -> Labels
+    """


My preferred docstring template is

"""Short one-line description

Optional longer description
"""

If you want to list and describe all arguments that's fine, but specifying their type and optionality is redundant, you can see all of that in function definition.

src/delphi/eval/token_labelling.py

jettjaniak · 2024-02-08T23:44:51Z

you need to rebase on top of main

jettjaniak · 2024-02-08T23:57:29Z

The notebook is good as an explanation, but generating the labels shouldn't require running it. Let's add something in scripts/ for that.

jettjaniak · 2024-02-09T00:01:56Z

Is there any reason we don't have a function like label_token(token: str) -> dict[str, bool]? Combining random tokens into sentences seems opaque and bug prone.

jettjaniak · 2024-02-09T00:03:57Z

for now just store the pickle in eval directory

jettjaniak · 2024-02-09T16:46:38Z

We should debug this. What is your exact setup? System, python interpreter, how you make your venv etc. Also, can you `import sys; print(sys.path)`

…

On Fri, 9 Feb 2024, 08:39 Joshua We, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/delphi/__init__.py <#21 (comment)>: > +from . import eval + I still require this so my I can import everything. For WHATEVER reason an empty *init*.py file will not allow me to import it. (of course, I used pip install -e .) — Reply to this email directly, view it on GitHub <#21 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AECKP3YGYAEMHVD3A4UROPDYSZGLNAVCNFSM6AAAAABCW6SGQOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQNZSHAYDAMRVGU> . You are receiving this because you commented.Message ID: ***@***.***>

joshuawe · 2024-02-09T18:19:14Z

We should debug this. What is your exact setup? System, python interpreter, how you make your venv etc. Also, can you import sys; print(sys.path)

My setup:

Windows
conda for venv creation
pip install -e .
Python interpreter by running where Python, the first entry is :
C:/Users/.../anaconda3/envs/delphi2/python.exe

Here the other outputs

>>> print(dir(delphi))
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'beartype_this_package', 'eval']

>>> import sys; print(sys.path)
['c:\\"path to \\delphi\\notebooks', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\python310.zip', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\DLLs', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib', 'c:\\Users\\...\\anaconda3\\envs\\delphi2', '', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib\\site-packages', 'c:\\users\\...\\joshua\\research\\2024_01_tinyevals\\delphi\\src', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib\\site-packages\\win32', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib\\site-packages\\win32\\lib', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib\\site-packages\\Pythonwin']

jettjaniak · 2024-02-09T18:29:18Z

The only thing that comes to mind is conda being weird. You could just download and install python for windows and make venv this way

joshuawe · 2024-02-13T20:52:35Z

added tests
added a script label_all_tokens.py that does some token labelling
now two files are saved under src/delphi/eval/. using the tokenizer from a given huggingface language model repo
1. the tokens and their ids are stored in all_tokens_{huggingface_model_name}.txt
2. The labelled tokens stored as labelled_token_ids_dict_{huggingface_model_name}.pkl
had to update requirements in order to extract/run the delphi tokenizer

Small todo's left for tomorrow:

Clear up the __init__.py files again
Implement the function that takes in a token as a str and checks for its labels in the token_label_dict
@jettjaniak Should the tokenized dict be loaded in memory always, i.e. when the module is loaded it also loads the pickle?

jettjaniak · 2024-02-14T05:23:57Z

the tokens and their ids are stored in all_tokens_{huggingface_model_name}.txt
The labelled tokens stored as labelled_token_ids_dict_{huggingface_model_name}.pkl

Can we drop the _{huggingface_model_name}? All delphi models use the same tokenizer.

Implement the function that takes in a token as a str and checks for its labels in the token_label_dict
@jettjaniak Should the tokenized dict be loaded in memory always, i.e. when the module is loaded it also loads the pickle?

I assume this question only applies if you're implementing this function? I'd say don't implement it, at least for now. The pickle is enough for the demo. And in this case, you should never be loading the pickle in code included in this PR.

jettjaniak · 2024-02-14T05:24:28Z

branch is out-of-date with main, you need to git rebase main

jettjaniak · 2024-02-14T05:26:56Z

were you using git merge? something is wrong with "Files changed" tab in this PR and with the commit history

src/delphi/eval/token_labelling.py

* add token labelling * add explanation function * add notebook * test * swtich off dependency labels + add spacy to requirements * small improvements * improve notebook explanation * fix errors * add notebook * test * swtich off dependency labels + add spacy to requirements * small improvements * improve notebook explanation * fix errors * complete UPOS tags for token labels * add tests * update requirements for delphi tokenizer * added token label script * add the files containing token information/labels * small enhancements suggested for PR * rebasing * improve optional downloading of spacy language model * bugfix: handle tokens empty string '' * add argparse for label_all_tokens.py script * add tokenized dicts * update notebook * undo __init__ * change spacy model from "trf" to "sm" * bug fix

joshuawe added the feature New feature or request label Feb 2, 2024

joshuawe linked an issue Feb 2, 2024 that may be closed by this pull request

categorize tokens #12

Closed

jettjaniak reviewed Feb 2, 2024

View reviewed changes

jettjaniak reviewed Feb 8, 2024

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

jettjaniak reviewed Feb 8, 2024

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

jettjaniak reviewed Feb 8, 2024

View reviewed changes

src/delphi/eval/token_labelling.py Outdated Show resolved Hide resolved

jettjaniak reviewed Feb 8, 2024

View reviewed changes

src/delphi/eval/token_labelling.py Outdated Show resolved Hide resolved

jettjaniak reviewed Feb 8, 2024

View reviewed changes

src/delphi/eval/token_labelling.py Outdated Show resolved Hide resolved

jettjaniak reviewed Feb 8, 2024

View reviewed changes

src/delphi/eval/token_labelling.py Outdated Show resolved Hide resolved

joshuawe assigned joshuawe and unassigned joshuawe Feb 12, 2024

jettjaniak reviewed Feb 14, 2024

View reviewed changes

src/delphi/eval/token_labelling.py Outdated Show resolved Hide resolved

joshuawe added 26 commits February 16, 2024 19:23

test

e9f5c11

swtich off dependency labels + add spacy to requirements

c10d5c1

small improvements

ab3be19

improve notebook explanation

3c08947

fix errors

6bf1c56

add notebook

70c337d

test

1fcdd35

swtich off dependency labels + add spacy to requirements

bd3be77

small improvements

e3013db

improve notebook explanation

90f2dbb

fix errors

57689f4

complete UPOS tags for token labels

197364d

add tests

cdef0d6

update requirements for delphi tokenizer

2c49e2e

added token label script

535a0c0

add the files containing token information/labels

48f7f6a

small enhancements suggested for PR

210a3da

rebasing

6a4a42d

improve optional downloading of spacy language model

fcf4ba6

bugfix: handle tokens empty string ''

f234ec5

add argparse for label_all_tokens.py script

4047be4

add tokenized dicts

3c4a1a4

update notebook

87e18b3

undo __init__

ef0f2e4

change spacy model from "trf" to "sm"

5af8a6f

bug fix

c292da7

joshuawe force-pushed the 12-categorize-tokens branch from b851bea to c292da7 Compare February 16, 2024 18:24

jettjaniak merged commit bd4a88b into main Feb 17, 2024
1 check passed

jettjaniak deleted the 12-categorize-tokens branch February 17, 2024 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add token labelling #21

add token labelling #21

joshuawe commented Feb 2, 2024

joshuawe commented Feb 2, 2024

menamerai commented Feb 2, 2024

jettjaniak left a comment

jettjaniak Feb 2, 2024

joshuawe commented Feb 8, 2024

joshuawe commented Feb 8, 2024

jettjaniak commented Feb 8, 2024

jettjaniak Feb 8, 2024

joshuawe Feb 9, 2024 •

edited

Loading

jettjaniak Feb 8, 2024

jettjaniak commented Feb 8, 2024

jettjaniak commented Feb 8, 2024

jettjaniak commented Feb 9, 2024

jettjaniak commented Feb 9, 2024

jettjaniak commented Feb 9, 2024 via email

joshuawe commented Feb 9, 2024 •

edited

Loading

jettjaniak commented Feb 9, 2024

joshuawe commented Feb 13, 2024 •

edited

Loading

jettjaniak commented Feb 14, 2024

jettjaniak commented Feb 14, 2024

jettjaniak commented Feb 14, 2024

		print(" ", label.ljust(10), key)


		def label_single_token(token: Token) -> List:

add token labelling #21

add token labelling #21

Conversation

joshuawe commented Feb 2, 2024

joshuawe commented Feb 2, 2024

menamerai commented Feb 2, 2024

jettjaniak left a comment

Choose a reason for hiding this comment

jettjaniak Feb 2, 2024

Choose a reason for hiding this comment

joshuawe commented Feb 8, 2024

joshuawe commented Feb 8, 2024

jettjaniak commented Feb 8, 2024

jettjaniak Feb 8, 2024

Choose a reason for hiding this comment

joshuawe Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

jettjaniak Feb 8, 2024

Choose a reason for hiding this comment

jettjaniak commented Feb 8, 2024

jettjaniak commented Feb 8, 2024

jettjaniak commented Feb 9, 2024

jettjaniak commented Feb 9, 2024

jettjaniak commented Feb 9, 2024 via email

joshuawe commented Feb 9, 2024 • edited Loading

jettjaniak commented Feb 9, 2024

joshuawe commented Feb 13, 2024 • edited Loading

jettjaniak commented Feb 14, 2024

jettjaniak commented Feb 14, 2024

jettjaniak commented Feb 14, 2024

joshuawe Feb 9, 2024 •

edited

Loading

joshuawe commented Feb 9, 2024 •

edited

Loading

joshuawe commented Feb 13, 2024 •

edited

Loading