Tokenizer enhancements #5597

solardiz · 2024-12-01T01:56:35Z

This is to keep track of issues with and enhancement ideas for the tokenizer (currently tokenize.pl) and its uses.

The text was updated successfully, but these errors were encountered:

solardiz · 2024-12-02T01:41:47Z

Currently we separately determine the tokens to use (tokenize.pl) and perform actual tokenization (a generated sed program). It could improve ease of use (at the cost of reduced flexibility) and allow for further enhancements if actual tokenization is performed by the same program (which means it would have to load its input into memory or re-read it from a file - it would not be able to process it solely from stdin anymore).
Currently the token allocation is performed without consideration for how some tokens (applied first) reduce usage of other overlapping tokens (applied later). This results in us wasting some token codes on rarely-used tokens (e.g. "lov" and "ove" when we also have "love"). We could perform some kind of iterated token reallocation before deciding on a more optimal set of tokens to use. More than 2 passes over input would be needed.
The tokenizer currently uses a fixed set of 158 token codes. It could as well or instead use any codes that are unused in the training input, assuming that the same training input is used for both determining the tokens to use (tokenize.pl) and actual tokenization (sed). For example, in the Project Gutenberg Australia training mentioned in https://www.openwall.com/lists/john-users/2024/11/20/1 "We could have 40 more tokens here if the tokenizer dynamically allocated all unused codes as potential tokens."
We currently exclude training inputs that contain any of the 158 codes we intend to reuse as tokens. This is unnecessarily strict because some of those codes in the inputs could have been part of our token substrings and thus be replaced with proper token codes, potentially leaving no non-translated uses of the codes in the input string. For example, a multi-byte UTF-8 letter very common in the input would conceivably have been allocated a token code for it, but our current implementation unnecessarily excludes (and worse, only in sed) all input strings containing such letter (even potentially wasting a token code that may have been allocated for it). We need to separately track original vs. replaced codes, so that we'd reject inputs only if any original codes clashing with token codes remain.
It may happen that certain codes seen in training input are never seen after tokenization (are only ever seen as part of the token strings), in which case they should ideally be reused as extra token codes.
Some codes may be seen in input so rarely - or become so rare after tokenization - that we'd prefer to omit such training inputs in order to free up and use more token codes (a configurable threshold on number or percentage of uses, or maybe a dynamic threshold matching the expected number of uses of the next extra token we could allocate).
Currently the tokens represent constant substrings. We could also try having (and generating where appropriate) a few magic token codes that repeat/recall/reflect a preceding part of the training or candidate password (referred to either by token or character relative positions) - e.g., "repeat last 3 tokens" or "recall/insert here the 3 tokens seen 2 tokens earlier" or "insert here a reflected/reversed substring seen from 5 to 2 characters earlier". This may be crazy and excessive complexity (and it could be tricky to choose an optimal set of just a few such magic tokens), but it's similar to how LZ77 compression works, so it could result in us capturing more password patterns in fewer bytes and in improved learning vs. memorizing.
As long as we preserve individual characters as well as use (overlapping) tokens, the same strings may be represented in multiple ways. Incremental mode eventually tries all combinations, including those never seen in its training input. This means there are some duplicates after token expansion. We could detect and skip (some, most, or all) effectively-duplicate token strings (e.g., insist on certain ordering of tokens). This is complicated and slow - maybe not worth it. Technically, it could be implemented in the generated external mode (skipping of individual candidates only) or in modified incremental mode (potentially skipping of prefixes corresponding to groups of candidates at once).
We should allow enabling of our generic opportunistic dupe suppressor along with external mode filters (More use of opportunistic dupe suppression #5055), but this would rarely be worth it since the percentage of dupes seen within a few billion is (luckily) very low (in testing of tokenized incremental mode runs so far) and this suppressor is ineffective across candidate password counts way in excess of its memory allocation.
Our command-line length control options are confused by the external filter altering the length. Currently, specifying e.g. -inc=custom -ext=untokenize -length=8 appears to result in length 8 (tokens) generated by incremental mode, possibly longer passwords (characters) coming from external mode, and then them getting truncated at 8 (sub-optimal order and probably more dupes). This is likely tricky to correct. We could want to document the issue meanwhile. Maybe even print a warning (once) when these options are used and the filter alters the length.
External mode support could be enhanced to allow for more than one character in a word element - essentially moving the "right shift by 8 until zero" loop from the Untokenize external mode into external.c. This would also eliminate the need for the external mode to save a copy of the input word, because at that level the length (number of used elements) would remain unchanged, so all replacements could be made in-place. A drawback is that other external modes would become a tiny bit slower, especially if this extra logic is enabled unconditionally (could instead enable by a flag to be set by the external mode, similarly to how we already have utf32). Edit: alternatively or additionally, we could add native code support for the entire mapping currently performed by filter - e.g., if the external mode defines int charmap[0x100], then our external.c detects that and does the right thing even in the absence of a filter (the array would still be initialized by init). Such a mode could also support optional filter (called before or after the built-in mapping? depends on what we need for item 7 above).
The entire logic could be rolled into charset.c and inc.c, keeping the mapping inside .chr files (of a new file format version). This would improve ease of use and speeds, but complicate the code and hurt flexibility (would be tied to incremental mode, although the existing tokenizer could still be maintained separately and used with other/third-party models).

solardiz added enhancement RFC / discussion Help or comments wanted labels Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer enhancements #5597

Tokenizer enhancements #5597

solardiz commented Dec 1, 2024 •

edited

Loading

solardiz commented Dec 2, 2024 •

edited

Loading

Tokenizer enhancements #5597

Tokenizer enhancements #5597

Comments

solardiz commented Dec 1, 2024 • edited Loading

solardiz commented Dec 2, 2024 • edited Loading

solardiz commented Dec 1, 2024 •

edited

Loading

solardiz commented Dec 2, 2024 •

edited

Loading