Skip to content

Commit

Permalink
Cherry Pick For Release (#174)
Browse files Browse the repository at this point in the history
* Fix Tokenization of Special Tokens in Sentencepiece

(cherry picked from commit 6093bd1)

* Add Left Padding and Padding to Max Length

(cherry picked from commit 128f7fc)
  • Loading branch information
apaniukov authored Jun 12, 2024
1 parent e5cb83b commit c615ec5
Show file tree
Hide file tree
Showing 10 changed files with 454 additions and 149 deletions.
164 changes: 82 additions & 82 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -336,8 +336,8 @@ This report is autogenerated and includes tokenizers and detokenizers tests. The
<tbody>
<tr>
<td >BPE</td>
<td >96.57</td>
<td >4991</td>
<td >94.45</td>
<td >5535</td>
</tr>
<tr>
<td >SentencePiece</td>
Expand All @@ -346,13 +346,13 @@ This report is autogenerated and includes tokenizers and detokenizers tests. The
</tr>
<tr>
<td >Tiktoken</td>
<td >98.17</td>
<td >218</td>
<td >93.98</td>
<td >266</td>
</tr>
<tr>
<td >WordPiece</td>
<td >94.97</td>
<td >1053</td>
<td >91.31</td>
<td >1301</td>
</tr>
</tbody>
</table>
Expand All @@ -372,140 +372,140 @@ This report is autogenerated and includes tokenizers and detokenizers tests. The
<tr>
<td >BPE</td>
<td >EleutherAI/gpt-j-6b</td>
<td >98.16</td>
<td >217</td>
<td >95.18</td>
<td >249</td>
</tr>
<tr>
<td >BPE</td>
<td >EleutherAI/gpt-neo-125m</td>
<td >98.16</td>
<td >217</td>
<td >95.18</td>
<td >249</td>
</tr>
<tr>
<td >BPE</td>
<td >EleutherAI/gpt-neox-20b</td>
<td >97.24</td>
<td >217</td>
<td >95.71</td>
<td >233</td>
</tr>
<tr>
<td >BPE</td>
<td >EleutherAI/pythia-12b-deduped</td>
<td >97.24</td>
<td >217</td>
<td >95.71</td>
<td >233</td>
</tr>
<tr>
<td >BPE</td>
<td >KoboldAI/fairseq-dense-13B</td>
<td >98.16</td>
<td >217</td>
<td >96.57</td>
<td >233</td>
</tr>
<tr>
<td >BPE</td>
<td >NousResearch/Meta-Llama-3-8B-Instruct</td>
<td >97.24</td>
<td >217</td>
<td >95.71</td>
<td >233</td>
</tr>
<tr>
<td >BPE</td>
<td >Salesforce/codegen-16B-multi</td>
<td >99.08</td>
<td >217</td>
<td >95.98</td>
<td >249</td>
</tr>
<tr>
<td >BPE</td>
<td >Xenova/gpt-4o</td>
<td >97.24</td>
<td >217</td>
<td >94.38</td>
<td >249</td>
</tr>
<tr>
<td >BPE</td>
<td >ai-forever/rugpt3large_based_on_gpt2</td>
<td >96.31</td>
<td >217</td>
<td >90.36</td>
<td >249</td>
</tr>
<tr>
<td >BPE</td>
<td >bigscience/bloom</td>
<td >99.08</td>
<td >217</td>
<td >97.42</td>
<td >233</td>
</tr>
<tr>
<td >BPE</td>
<td >databricks/dolly-v2-3b</td>
<td >97.24</td>
<td >217</td>
<td >95.71</td>
<td >233</td>
</tr>
<tr>
<td >BPE</td>
<td >facebook/bart-large-mnli</td>
<td >98.16</td>
<td >217</td>
<td >95.18</td>
<td >249</td>
</tr>
<tr>
<td >BPE</td>
<td >facebook/galactica-120b</td>
<td >97.24</td>
<td >217</td>
<td >95.71</td>
<td >233</td>
</tr>
<tr>
<td >BPE</td>
<td >facebook/opt-66b</td>
<td >98.16</td>
<td >217</td>
<td >96.57</td>
<td >233</td>
</tr>
<tr>
<td >BPE</td>
<td >gpt2</td>
<td >98.16</td>
<td >217</td>
<td >95.18</td>
<td >249</td>
</tr>
<tr>
<td >BPE</td>
<td >laion/CLIP-ViT-bigG-14-laion2B-39B-b160k</td>
<td >70.97</td>
<td >217</td>
<td >74.70</td>
<td >249</td>
</tr>
<tr>
<td >BPE</td>
<td >microsoft/deberta-base</td>
<td >98.16</td>
<td >217</td>
<td >96.57</td>
<td >233</td>
</tr>
<tr>
<td >BPE</td>
<td >roberta-base</td>
<td >98.16</td>
<td >217</td>
<td >95.18</td>
<td >249</td>
</tr>
<tr>
<td >BPE</td>
<td >sentence-transformers/all-roberta-large-v1</td>
<td >98.16</td>
<td >217</td>
<td >95.18</td>
<td >249</td>
</tr>
<tr>
<td >BPE</td>
<td >stabilityai/stablecode-completion-alpha-3b-4k</td>
<td >97.24</td>
<td >217</td>
<td >95.71</td>
<td >233</td>
</tr>
<tr>
<td >BPE</td>
<td >stabilityai/stablelm-2-1_6b</td>
<td >97.24</td>
<td >217</td>
<td >95.71</td>
<td >233</td>
</tr>
<tr>
<td >BPE</td>
<td >stabilityai/stablelm-tuned-alpha-7b</td>
<td >97.24</td>
<td >217</td>
<td >95.71</td>
<td >233</td>
</tr>
<tr>
<td >BPE</td>
<td >tiiuae/falcon-7b</td>
<td >97.24</td>
<td >217</td>
<td >94.38</td>
<td >249</td>
</tr>
<tr>
<td >SentencePiece</td>
Expand Down Expand Up @@ -630,92 +630,92 @@ This report is autogenerated and includes tokenizers and detokenizers tests. The
<tr>
<td >Tiktoken</td>
<td >Qwen/Qwen-14B-Chat</td>
<td >98.17</td>
<td >109</td>
<td >92.91</td>
<td >141</td>
</tr>
<tr>
<td >Tiktoken</td>
<td >Salesforce/xgen-7b-8k-base</td>
<td >98.17</td>
<td >109</td>
<td >95.20</td>
<td >125</td>
</tr>
<tr>
<td >WordPiece</td>
<td >ProsusAI/finbert</td>
<td >97.53</td>
<td >81</td>
<td >91.43</td>
<td >105</td>
</tr>
<tr>
<td >WordPiece</td>
<td >bert-base-multilingual-cased</td>
<td >97.53</td>
<td >81</td>
<td >91.43</td>
<td >105</td>
</tr>
<tr>
<td >WordPiece</td>
<td >bert-base-uncased</td>
<td >97.53</td>
<td >81</td>
<td >91.43</td>
<td >105</td>
</tr>
<tr>
<td >WordPiece</td>
<td >cointegrated/rubert-tiny2</td>
<td >91.36</td>
<td >81</td>
<td >91.43</td>
<td >105</td>
</tr>
<tr>
<td >WordPiece</td>
<td >distilbert-base-uncased-finetuned-sst-2-english</td>
<td >97.53</td>
<td >81</td>
<td >91.43</td>
<td >105</td>
</tr>
<tr>
<td >WordPiece</td>
<td >google/electra-base-discriminator</td>
<td >97.53</td>
<td >81</td>
<td >91.43</td>
<td >105</td>
</tr>
<tr>
<td >WordPiece</td>
<td >google/mobilebert-uncased</td>
<td >97.53</td>
<td >81</td>
<td >94.38</td>
<td >89</td>
</tr>
<tr>
<td >WordPiece</td>
<td >jhgan/ko-sbert-sts</td>
<td >87.65</td>
<td >81</td>
<td >91.43</td>
<td >105</td>
</tr>
<tr>
<td >WordPiece</td>
<td >prajjwal1/bert-mini</td>
<td >97.53</td>
<td >81</td>
<td >94.38</td>
<td >89</td>
</tr>
<tr>
<td >WordPiece</td>
<td >rajiv003/ernie-finetuned-qqp</td>
<td >97.53</td>
<td >81</td>
<td >94.38</td>
<td >89</td>
</tr>
<tr>
<td >WordPiece</td>
<td >rasa/LaBSE</td>
<td >90.12</td>
<td >81</td>
<td >80.00</td>
<td >105</td>
</tr>
<tr>
<td >WordPiece</td>
<td >sentence-transformers/all-MiniLM-L6-v2</td>
<td >87.65</td>
<td >81</td>
<td >91.43</td>
<td >105</td>
</tr>
<tr>
<td >WordPiece</td>
<td >squeezebert/squeezebert-uncased</td>
<td >97.53</td>
<td >81</td>
<td >94.38</td>
<td >89</td>
</tr>
</tbody>
</table>
Expand Down
Loading

0 comments on commit c615ec5

Please sign in to comment.