Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rethinking the Trie #14

Open
jtauber opened this issue Oct 16, 2017 · 1 comment
Open

rethinking the Trie #14

jtauber opened this issue Oct 16, 2017 · 1 comment

Comments

@jtauber
Copy link
Owner

jtauber commented Oct 16, 2017

The key to any speedup and/or reduction in memory consumption is likely to involve replacing the Trie.

The Trie only exists because we need to be able to do longest-prefix retrieval on a dictionary.

We could use a regular Python dictionary if incoming strings were tokenised by collation key.

What I'm basically thinking is:

  1. build a regular python dictionary for the collation table (some keys of which will consist of multiple characters)
  2. build a regex using the keys in the collation table
  3. use that regex to tokenise incoming strings (this has the effect of a Trie)
  4. look the tokens up (some of which will be multiple characters) in the python dictionary

It will be worth having some sort of timed test to see if this approach actually makes a (positive) difference but I suspect it might.

@jtauber
Copy link
Owner Author

jtauber commented Oct 20, 2018

I'm not actually sure this approach is feasible in the general case but the algorithm actually involves writing back to the string being tokenised so it can't be "pre-tokenised"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant