-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add token labelling #21
Conversation
Are the files that have been added stored in a good location? |
@joshuawe I think you need to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make a pickle with dict[int, dict[str, bool]] that labels every single token in the tokenizer.
src/delphi/eval/token_labelling.py
Outdated
print(" ", label.ljust(10), key) | ||
|
||
|
||
def label_single_token(token: Token) -> List: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have function that just takes a str
token label. Should we return this as a dict?
Have a look at the notebook, where function calling as well as the creation of the pickle object which is basically Look Up Table containing a dict with keys token_id and value token_labels. Note that token_labels is itself a dict with all keys label_name and values bool For example:
The question that remains is
|
Another interesting thing is the following, >>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("roneneldan/TinyStories-1M")
>>> vocab_size = tokenizer.vocab_size
>>> print("The vocab size is:", vocab_size)
50247 So, am I using the correct tokenizer? And are there multiple tokenizers? |
The tokenizer is wrong, it's from the original tiny stories models. You need to pass model name from any of the models on delphi-suite Hugging Face to use our small tokenizer that was trained on tiny stories dataset. |
src/delphi/__init__.py
Outdated
from . import eval | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still require this so I can import everything under delphi
. For WHATEVER reason an empty __init__.py
file will not allow me to import it. (of course, I used pip install -e .
)
""" | ||
Labels tokens in a sentence batchwise. Takes the context of the token into | ||
account for dependency labels (e.g. subject, object, ...). | ||
|
||
Parameters | ||
---------- | ||
sentences : list | ||
A batch/list of sentences, each being a list of tokens. | ||
tokenized : bool, optional | ||
Whether the sentences are already tokenized, by default True. If the sentences | ||
are full strings and not lists of tokens, then set to False. If true then `sentences` must be list[list[str]]. | ||
verbose : bool, optional | ||
Whether to print the tokens and their labels to the console, by default False. | ||
|
||
Returns | ||
------- | ||
list[list[dict[str, bool]] | ||
Returns a list of sentences. Each sentence contains a list of its | ||
corresponding token length where each entry provides the labels/categories | ||
for the token. Sentence -> Token -> Labels | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preferred docstring template is
"""Short one-line description
Optional longer description
"""
If you want to list and describe all arguments that's fine, but specifying their type and optionality is redundant, you can see all of that in function definition.
you need to rebase on top of main |
The notebook is good as an explanation, but generating the labels shouldn't require running it. Let's add something in scripts/ for that. |
Is there any reason we don't have a function like |
for now just store the pickle in eval directory |
We should debug this. What is your exact setup? System, python interpreter,
how you make your venv etc. Also, can you `import sys; print(sys.path)`
…On Fri, 9 Feb 2024, 08:39 Joshua We, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/delphi/__init__.py
<#21 (comment)>:
> +from . import eval
+
I still require this so my I can import everything. For WHATEVER reason an
empty *init*.py file will not allow me to import it. (of course, I used pip
install -e .)
—
Reply to this email directly, view it on GitHub
<#21 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AECKP3YGYAEMHVD3A4UROPDYSZGLNAVCNFSM6AAAAABCW6SGQOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQNZSHAYDAMRVGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
My setup:
Here the other outputs >>> print(dir(delphi))
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'beartype_this_package', 'eval']
>>> import sys; print(sys.path)
['c:\\"path to \\delphi\\notebooks', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\python310.zip', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\DLLs', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib', 'c:\\Users\\...\\anaconda3\\envs\\delphi2', '', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib\\site-packages', 'c:\\users\\...\\joshua\\research\\2024_01_tinyevals\\delphi\\src', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib\\site-packages\\win32', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib\\site-packages\\win32\\lib', 'c:\\Users\\...\\anaconda3\\envs\\delphi2\\lib\\site-packages\\Pythonwin'] |
The only thing that comes to mind is conda being weird. You could just download and install python for windows and make venv this way |
Small todo's left for tomorrow:
|
Can we drop the _{huggingface_model_name}? All delphi models use the same tokenizer.
I assume this question only applies if you're implementing this function? I'd say don't implement it, at least for now. The pickle is enough for the demo. And in this case, you should never be loading the pickle in code included in this PR. |
branch is out-of-date with main, you need to |
were you using |
b851bea
to
c292da7
Compare
* add token labelling * add explanation function * add notebook * test * swtich off dependency labels + add spacy to requirements * small improvements * improve notebook explanation * fix errors * add notebook * test * swtich off dependency labels + add spacy to requirements * small improvements * improve notebook explanation * fix errors * complete UPOS tags for token labels * add tests * update requirements for delphi tokenizer * added token label script * add the files containing token information/labels * small enhancements suggested for PR * rebasing * improve optional downloading of spacy language model * bugfix: handle tokens empty string '' * add argparse for label_all_tokens.py script * add tokenized dicts * update notebook * undo __init__ * change spacy model from "trf" to "sm" * bug fix
No description provided.