Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add token labelling #21

Merged
merged 29 commits into from
Feb 17, 2024
Merged

add token labelling #21

merged 29 commits into from
Feb 17, 2024

Conversation

joshuawe
Copy link
Collaborator

@joshuawe joshuawe commented Feb 2, 2024

No description provided.

@joshuawe joshuawe added the feature New feature or request label Feb 2, 2024
@joshuawe joshuawe linked an issue Feb 2, 2024 that may be closed by this pull request
@joshuawe
Copy link
Collaborator Author

joshuawe commented Feb 2, 2024

Are the files that have been added stored in a good location?
I think not.

@menamerai
Copy link
Collaborator

@joshuawe I think you need to add spacy to the requirements.txt

Copy link
Contributor

@jettjaniak jettjaniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a pickle with dict[int, dict[str, bool]] that labels every single token in the tokenizer.

src/delphi/eval/token_labelling.py Outdated Show resolved Hide resolved
print(" ", label.ljust(10), key)


def label_single_token(token: Token) -> List:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have function that just takes a str token label. Should we return this as a dict?

@joshuawe
Copy link
Collaborator Author

joshuawe commented Feb 8, 2024

Have a look at the notebook, where function calling as well as the creation of the pickle object which is basically Look Up Table containing a dict with keys token_id and value token_labels.

Note that token_labels is itself a dict with all keys label_name and values bool

For example:

token_ids_lablled = {
  0: {'Starts with space': False, 'Capitalized': True, 'Is Noun': True, 'Is Pronoun': False, 'Is Adjective': False, 'Is Verb': False, 'Is Adverb': False, 'Is Preposition': False, 'Is Conjunction': False, 'Is Interjunction': False, 'Is Named Entity': False},
  1: {'Starts with space': False, 'Capitalized': True, 'Is Noun': True, 'Is Pronoun': False, 'Is Adjective': False, 'Is Verb': False, 'Is Adverb': False, 'Is Preposition': False, 'Is Conjunction': False, 'Is Interjunction': False, 'Is Named Entity': False},
  2: ...
...
}

The question that remains is

  1. where to store the pickle.
  2. How to make it accessible. Should it always be loaded into the memory, when the module is imported?

@joshuawe
Copy link
Collaborator Author

joshuawe commented Feb 8, 2024

Another interesting thing is the following,
the vocab size of the tokenizer is larger than 4000 by far.

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("roneneldan/TinyStories-1M")

>>> vocab_size = tokenizer.vocab_size
>>> print("The vocab size is:", vocab_size)
50247

So, am I using the correct tokenizer? And are there multiple tokenizers?

@jettjaniak
Copy link
Contributor

The tokenizer is wrong, it's from the original tiny stories models. You need to pass model name from any of the models on delphi-suite Hugging Face to use our small tokenizer that was trained on tiny stories dataset.

requirements.txt Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
Comment on lines 3 to 4
from . import eval

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

@joshuawe joshuawe Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still require this so I can import everything under delphi. For WHATEVER reason an empty __init__.py file will not allow me to import it. (of course, I used pip install -e .)

Comment on lines +128 to +167
"""
Labels tokens in a sentence batchwise. Takes the context of the token into
account for dependency labels (e.g. subject, object, ...).

Parameters
----------
sentences : list
A batch/list of sentences, each being a list of tokens.
tokenized : bool, optional
Whether the sentences are already tokenized, by default True. If the sentences
are full strings and not lists of tokens, then set to False. If true then `sentences` must be list[list[str]].
verbose : bool, optional
Whether to print the tokens and their labels to the console, by default False.

Returns
-------
list[list[dict[str, bool]]
Returns a list of sentences. Each sentence contains a list of its
corresponding token length where each entry provides the labels/categories
for the token. Sentence -> Token -> Labels
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preferred docstring template is

"""Short one-line description

Optional longer description
"""

If you want to list and describe all arguments that's fine, but specifying their type and optionality is redundant, you can see all of that in function definition.

@jettjaniak
Copy link
Contributor

you need to rebase on top of main

@jettjaniak
Copy link
Contributor

The notebook is good as an explanation, but generating the labels shouldn't require running it. Let's add something in scripts/ for that.

@jettjaniak
Copy link
Contributor

Is there any reason we don't have a function like label_token(token: str) -> dict[str, bool]? Combining random tokens into sentences seems opaque and bug prone.

@jettjaniak
Copy link
Contributor

for now just store the pickle in eval directory

@jettjaniak
Copy link
Contributor

jettjaniak commented Feb 9, 2024 via email

@joshuawe
Copy link
Collaborator Author

joshuawe commented Feb 9, 2024

We should debug this. What is your exact setup? System, python interpreter, how you make your venv etc. Also, can you import sys; print(sys.path)

My setup:

  • Windows
  • conda for venv creation
  • pip install -e .
  • Python interpreter by running where Python, the first entry is :
    C:/Users/.../anaconda3/envs/delphi2/python.exe

Here the other outputs

>>> print(dir(delphi))
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'beartype_this_package', 'eval']

>>> import sys; print(sys.path)
['c:\\"path to \\delphi\\notebooks', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\python310.zip', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\DLLs', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib', 'c:\\Users\\...\\anaconda3\\envs\\delphi2', '', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib\\site-packages', 'c:\\users\\...\\joshua\\research\\2024_01_tinyevals\\delphi\\src', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib\\site-packages\\win32', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib\\site-packages\\win32\\lib', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib\\site-packages\\Pythonwin']

@jettjaniak
Copy link
Contributor

The only thing that comes to mind is conda being weird. You could just download and install python for windows and make venv this way

@joshuawe joshuawe assigned joshuawe and unassigned joshuawe Feb 12, 2024
@joshuawe
Copy link
Collaborator Author

joshuawe commented Feb 13, 2024

  • added tests
  • added a script label_all_tokens.py that does some token labelling
  • now two files are saved under src/delphi/eval/. using the tokenizer from a given huggingface language model repo
    1. the tokens and their ids are stored in all_tokens_{huggingface_model_name}.txt
    2. The labelled tokens stored as labelled_token_ids_dict_{huggingface_model_name}.pkl
  • had to update requirements in order to extract/run the delphi tokenizer

Small todo's left for tomorrow:

  • Clear up the __init__.py files again
  • Implement the function that takes in a token as a str and checks for its labels in the token_label_dict
    @jettjaniak Should the tokenized dict be loaded in memory always, i.e. when the module is loaded it also loads the pickle?

@jettjaniak
Copy link
Contributor

the tokens and their ids are stored in all_tokens_{huggingface_model_name}.txt
The labelled tokens stored as labelled_token_ids_dict_{huggingface_model_name}.pkl

Can we drop the _{huggingface_model_name}? All delphi models use the same tokenizer.

Implement the function that takes in a token as a str and checks for its labels in the token_label_dict
@jettjaniak Should the tokenized dict be loaded in memory always, i.e. when the module is loaded it also loads the pickle?

I assume this question only applies if you're implementing this function? I'd say don't implement it, at least for now. The pickle is enough for the demo. And in this case, you should never be loading the pickle in code included in this PR.

@jettjaniak
Copy link
Contributor

branch is out-of-date with main, you need to git rebase main

@jettjaniak
Copy link
Contributor

were you using git merge? something is wrong with "Files changed" tab in this PR and with the commit history

@joshuawe joshuawe force-pushed the 12-categorize-tokens branch from b851bea to c292da7 Compare February 16, 2024 18:24
@jettjaniak jettjaniak merged commit bd4a88b into main Feb 17, 2024
1 check passed
@jettjaniak jettjaniak deleted the 12-categorize-tokens branch February 17, 2024 02:56
siwei-li pushed a commit that referenced this pull request Feb 19, 2024
* add token labelling

* add explanation function

* add notebook

* test

* swtich off dependency labels + add spacy to requirements

* small improvements

* improve notebook explanation

* fix errors

* add notebook

* test

* swtich off dependency labels + add spacy to requirements

* small improvements

* improve notebook explanation

* fix errors

* complete UPOS tags for token labels

* add tests

* update requirements for delphi tokenizer

* added token label script

* add the files containing token information/labels

* small enhancements suggested for PR

* rebasing

* improve optional downloading of spacy language model

* bugfix: handle tokens empty string ''

* add argparse for label_all_tokens.py script

* add tokenized dicts

* update notebook

* undo __init__

* change spacy model from "trf" to "sm"

* bug fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

categorize tokens
4 participants