Skip to content

Commit

Permalink
add tokenized dicts
Browse files Browse the repository at this point in the history
  • Loading branch information
joshuawe committed Feb 15, 2024
1 parent b24b792 commit 95d01db
Show file tree
Hide file tree
Showing 3 changed files with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion scripts/label_all_tokens.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ def main():
# let's label each token
labelled_token_ids_dict: dict[int, dict[str, bool]] = {} # token_id: labels
max_token_id = tokenizer.vocab_size # stop at which token id, vocab size
# we iterate (batchwise) over all token_ids, individually takes too much time
# we iterate over all token_ids individually
for token_id in tqdm(range(0, max_token_id), desc="Labelling tokens"):
# decode the token_ids to get a list of tokens, a 'sentence'
tokens = decode(tokenizer, token_id) # list of tokens == sentence
Expand Down
Binary file added src/delphi/eval/all_tokens_list.txt
Binary file not shown.
Binary file added src/delphi/eval/labelled_token_ids_dict.pkl
Binary file not shown.

0 comments on commit 95d01db

Please sign in to comment.