*** docs

gbenson · Jun 1, 2024 · 3a9120d · 3a9120d
1 parent 2c24a89
commit 3a9120d
Showing 1 changed file with 62 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -11,6 +11,68 @@
 
 DOM-aware tokenization for Hugging Face language models.
 
+## What?
+
+Natural language tokeniz(er,ation scheme)s are designed so
+as to
+a) group particles of meaning together
+b) (omit/discard/hide) unimportant details
+such that models consuming sequences of token IDs
+are presented with what they need in a way they can most
+easily (process/derive meaning from)
+[in theory, models could consume streams of utf-8, but
+the model will have to learn everything the tokenizer does
+so consuming resources (layers/neurons/parameters)
+and (portentally vastyl) extending training time.]
+
+for example, tokenizers aimed at languages that delimit with
+whitespace generally have features to (omit/discard/embed/hide)
+whitespace in their output so the model/consumer does not need
+to care about it.
+
+this shiz aims to do the same but for HTML, such that:
+
+> <code>
+X
+becomes:
+
+> <code style="background-color: #ccbfee;">&lt;<code>
+> <code style="background-color: #beedc6;">html<code>
+> <code style="background-color: #f6d9ab;">&gt;<code>
+> <code style="background-color: #f4aeb1;">&lt;<code>
+> <code style="background-color: #a4dcf3;">head<code>
+> <code style="background-color: #ccbfee;">&gt;<code>
+> <code style="background-color: #beedc6;">&lt;<code>
+> <code style="background-color: #f6d9ab;">meta<code>
+> <code style="background-color: #f4aeb1;">_<code>
+> <code style="background-color: #a4dcf3;">http<code>
+> <code style="background-color: #ccbfee;">equiv<code>
+> <code style="background-color: #beedc6;">=<code>
+> <code style="background-color: #f6d9ab;">utf<code>
+> <code style="background-color: #f4aeb1;">8<code>
+> <code style="background-color: #a4dcf3;">&gt;<code>
+> <code style="background-color: #ccbfee;">&lt;/<code>
+> <code style="background-color: #beedc6;">meta<code>
+> <code style="background-color: #f6d9ab;">&gt;<code>
+> <code style="background-color: #f4aeb1;">a<code>
+> <code style="background-color: #a4dcf3;">b<code>
+> <code style="background-color: #ccbfee;">c<code>
+> <code style="background-color: #beedc6;">d<code>
+> <code style="background-color: #f6d9ab;">e<code>
+> <code style="background-color: #f4aeb1;">f<code>
+> <code style="background-color: #a4dcf3;">g<code>
+> <code style="background-color: #ccbfee;">h<code>
+> <code style="background-color: #beedc6;">i<code>
+> <code style="background-color: #f6d9ab;">j<code>
+> <code style="background-color: #f4aeb1;">k<code>
+> <code style="background-color: #a4dcf3;">l<code>
+
+
+tokenizers for generation need to be able to decode reversibly,
+but generation isn't a goal for me/for now at least, so this
+tokenizer will discard some of its input in order to better distil
+the meaning of what it's looking at.
+
 ## Installation
 
 ### With PIP