Skip to content

Commit

Permalink
*** docs
Browse files Browse the repository at this point in the history
  • Loading branch information
gbenson committed Jun 1, 2024
1 parent 2c24a89 commit 3a9120d
Showing 1 changed file with 62 additions and 0 deletions.
62 changes: 62 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,68 @@

DOM-aware tokenization for Hugging Face language models.

## What?

Natural language tokeniz(er,ation scheme)s are designed so
as to
a) group particles of meaning together
b) (omit/discard/hide) unimportant details
such that models consuming sequences of token IDs
are presented with what they need in a way they can most
easily (process/derive meaning from)
[in theory, models could consume streams of utf-8, but
the model will have to learn everything the tokenizer does
so consuming resources (layers/neurons/parameters)
and (portentally vastyl) extending training time.]

for example, tokenizers aimed at languages that delimit with
whitespace generally have features to (omit/discard/embed/hide)
whitespace in their output so the model/consumer does not need
to care about it.

this shiz aims to do the same but for HTML, such that:

> <code>
X
becomes:

> <code style="background-color: #ccbfee;">&lt;<code>
> <code style="background-color: #beedc6;">html<code>
> <code style="background-color: #f6d9ab;">&gt;<code>
> <code style="background-color: #f4aeb1;">&lt;<code>
> <code style="background-color: #a4dcf3;">head<code>
> <code style="background-color: #ccbfee;">&gt;<code>
> <code style="background-color: #beedc6;">&lt;<code>
> <code style="background-color: #f6d9ab;">meta<code>
> <code style="background-color: #f4aeb1;">_<code>
> <code style="background-color: #a4dcf3;">http<code>
> <code style="background-color: #ccbfee;">equiv<code>
> <code style="background-color: #beedc6;">=<code>
> <code style="background-color: #f6d9ab;">utf<code>
> <code style="background-color: #f4aeb1;">8<code>
> <code style="background-color: #a4dcf3;">&gt;<code>
> <code style="background-color: #ccbfee;">&lt;/<code>
> <code style="background-color: #beedc6;">meta<code>
> <code style="background-color: #f6d9ab;">&gt;<code>
> <code style="background-color: #f4aeb1;">a<code>
> <code style="background-color: #a4dcf3;">b<code>
> <code style="background-color: #ccbfee;">c<code>
> <code style="background-color: #beedc6;">d<code>
> <code style="background-color: #f6d9ab;">e<code>
> <code style="background-color: #f4aeb1;">f<code>
> <code style="background-color: #a4dcf3;">g<code>
> <code style="background-color: #ccbfee;">h<code>
> <code style="background-color: #beedc6;">i<code>
> <code style="background-color: #f6d9ab;">j<code>
> <code style="background-color: #f4aeb1;">k<code>
> <code style="background-color: #a4dcf3;">l<code>

tokenizers for generation need to be able to decode reversibly,
but generation isn't a goal for me/for now at least, so this
tokenizer will discard some of its input in order to better distil
the meaning of what it's looking at.

## Installation

### With PIP
Expand Down

0 comments on commit 3a9120d

Please sign in to comment.