rethinking the Trie #14

jtauber · 2017-10-16T02:16:15Z

The key to any speedup and/or reduction in memory consumption is likely to involve replacing the Trie.

The Trie only exists because we need to be able to do longest-prefix retrieval on a dictionary.

We could use a regular Python dictionary if incoming strings were tokenised by collation key.

What I'm basically thinking is:

build a regular python dictionary for the collation table (some keys of which will consist of multiple characters)
build a regex using the keys in the collation table
use that regex to tokenise incoming strings (this has the effect of a Trie)
look the tokens up (some of which will be multiple characters) in the python dictionary

It will be worth having some sort of timed test to see if this approach actually makes a (positive) difference but I suspect it might.

jtauber · 2018-10-20T01:48:32Z

I'm not actually sure this approach is feasible in the general case but the algorithm actually involves writing back to the string being tokenised so it can't be "pre-tokenised"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rethinking the Trie #14

rethinking the Trie #14

jtauber commented Oct 16, 2017

jtauber commented Oct 20, 2018

rethinking the Trie #14

rethinking the Trie #14

Comments

jtauber commented Oct 16, 2017

jtauber commented Oct 20, 2018