Skip to content

Commit

Permalink
Fix bug huggingface#35447 LlamaTokenizer does not split text accordin…
Browse files Browse the repository at this point in the history
…g to newly added input tokens

The root reason is Trie.split method didn't ignore partial match that should be removed
  • Loading branch information
jiongjiongli committed Dec 29, 2024
1 parent 5c75087 commit 693e2c0
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion src/transformers/tokenization_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,10 @@ def split(self, text: str) -> List[str]:
# matches
# "[CLS]", "L", we need to match CLS even if L is special
for lookstart, looktrie_pointer in states.items():
if lookstart > start:
if lookstart in to_remove:
# This partial match should be removed
continue
elif lookstart > start:
# This partial match is later, we can stop looking
break
elif lookstart < start:
Expand Down

0 comments on commit 693e2c0

Please sign in to comment.