Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mnemonic bisect_left will fail (ungracefully) on other bip39 wordlists #132

Open
wizardofozzie opened this issue Nov 25, 2015 · 12 comments
Open

Comments

@wizardofozzie
Copy link
Contributor

Bisect only works for sorted wordlists; not an issue if only english.txt is used, however, forks using standard bip39 wordlists will fail most ungracefully.

eg (Japanese wordlist, forum post)

@cryptid11
Copy link

Yes, this line is wrong

wordlist_english=list(open(os.path.join(os.path.dirname(os.path.realpath(file)),'english.txt'),'r'))

@Steve132
Copy link
Contributor

however, forks using standard bip39 wordlists will fail most ungracefully.

english.txt is the standard bip39 english wordlist, and ALL the wordlists being lexographically sorted is a part of the bip39 spec

  • the wordlist is sorted which allows for more efficient lookup of the code words
    (i.e. implementations can use binary search instead of linear search)
    • this also allows trie (a prefix tree) to be used, e.g. for better compression

However, you aren't wrong about the bisection algorithm being incorrect, but for several other completely different reasons. I'm going to submit a patch soon.

Furthermore, english.txt is only provided for convenience. If you want to use another wordlist, you can pass it as a sorted list to the 'wordlist' parameter in most of the mnemonic functions

@vbuterin
Copy link
Owner

Perhaps it might be best for convenience reasons to include an assert loop
in the code that makes sure a list is sorted before using it? This seems to
have come up many times now.

On Wed, Jan 20, 2016 at 4:58 PM, Steven Braeger [email protected]
wrote:

however, forks using standard bip39 wordlists will fail most ungracefully.

english.txt is the standard bip39 english wordlist, and the wordlists
being lexographically sorted are a part of the bip39 spec
https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki#Wordlist

  • the wordlist is sorted which allows for more efficient lookup of the
    code words (i.e. implementations can use binary search instead of linear
    search)
    • this also allows trie (a prefix tree) to be used, e.g. for better
      compression

However, you aren't wrong about the bisection algorithm being incorrect,
but for several other completely different reasons. I'm going to submit a
patch soon.

Furthermore, english.txt is only provided for convenience. If you want to
use another wordlist, you can pass it as a sorted list to the 'wordlist'
parameter in most of the mnemonic functions


Reply to this email directly or view it on GitHub
#132 (comment)
.

@wizardofozzie
Copy link
Contributor Author

@vbuterin In my fork there's an assert for bip39_check. I believe the checksum should flag mis-sorted word lists. (fwiw my code diverges for bip39 because it was written prior to this branch's PR)

@Steve132
Copy link
Contributor

Actually, I've been doing some research with this and what should really
happen is that the search needs to degrade to a linear search in the event
that sorted(wordlist) != wordlist...OR a hashtable should be used to
compute the index based on the position in the file..

This is because in order for the wallet to match the spec, the search index
needs to match the index of the word AS SEEN IN THE FILE...whereas
implementing binary search or resorting the input list will use the CURRENT
LOCALE for the comparisons, thus possibly permuting the file order and thus
making the detected indices incorrect.

The correct behavior will be one of the following options:

  1. Always implement linear search (this is what my current version does in
    my branch) Upside: this is simple and always works. Downside, it's O(n)
    for every lookup
  2. Implement a hash table that maps words->indices. Upside: This is
    simple-ish. Downside: it's O(n) for every lookup unless we implement it as
    a pre-processing stage (like have an internal wordlist type that indicates
    a correct hash table, and if the wordlist argument is a list convert it and
    cache the result or something)
  3. Detect if sorted(list) == list when a non-standard list is discovered.
    If it does, use the binary search path. Otherwise, use a linear list.
    This result will also have to be cached.

On Thu, Jan 21, 2016 at 6:44 AM, WizardOfOzzie [email protected]
wrote:

@vbuterin https://github.com/vbuterin In my fork
https://github.com/wizardofozzie/pybitcointools/blob/master/bitcoin/mnemonic.py#L101
there's an assert for bip39_check. I believe the checksum should flag
mis-sorted word lists. (fwiw my code diverges for bip39 because it was
written prior to this branch's PR)
NB.


Reply to this email directly or view it on GitHub
#132 (comment)
.

@vbuterin
Copy link
Owner

  1. Detect if sorted(list) == list when a non-standard list is discovered.
    If it does, use the binary search path. Otherwise, use a linear list.
    This result will also have to be cached.

If you're computing the sorted list anyway, would it not be simpler to just
cache the sorted list for each list?

On Thu, Jan 21, 2016 at 1:47 PM, Steven Braeger [email protected]
wrote:

Actually, I've been doing some research with this and what should really
happen is that the search needs to degrade to a linear search in the event
that sorted(wordlist) != wordlist...OR a hashtable should be used to
compute the index based on the position in the file..

This is because in order for the wallet to match the spec, the search index
needs to match the index of the word AS SEEN IN THE FILE...whereas
implementing binary search or resorting the input list will use the CURRENT
LOCALE for the comparisons, thus possibly permuting the file order and thus
making the detected indices incorrect.

The correct behavior will be one of the following options:

  1. Always implement linear search (this is what my current version does in
    my branch) Upside: this is simple and always works. Downside, it's O(n)
    for every lookup
  2. Implement a hash table that maps words->indices. Upside: This is
    simple-ish. Downside: it's O(n) for every lookup unless we implement it as
    a pre-processing stage (like have an internal wordlist type that indicates
    a correct hash table, and if the wordlist argument is a list convert it and
    cache the result or something)
  3. Detect if sorted(list) == list when a non-standard list is discovered.
    If it does, use the binary search path. Otherwise, use a linear list.
    This result will also have to be cached.

On Thu, Jan 21, 2016 at 6:44 AM, WizardOfOzzie [email protected]
wrote:

@vbuterin https://github.com/vbuterin In my fork
<
https://github.com/wizardofozzie/pybitcointools/blob/master/bitcoin/mnemonic.py#L101

there's an assert for bip39_check. I believe the checksum should flag
mis-sorted word lists. (fwiw my code diverges for bip39 because it was
written prior to this branch's PR)
NB.


Reply to this email directly or view it on GitHub
<
#132 (comment)

.


Reply to this email directly or view it on GitHub
#132 (comment)
.

@Steve132
Copy link
Contributor

No, because then the result would be incorrect. The whole point is that
the spec says that the 11 bits correspond to the index of the word in the
given wordlist file, in the order given in the file. This is important,
because the spec doesn't specify how the list is supposed to be searched,
only that the result must be consistent with the order given in the
standard file for compatibility.

They are supposed to be sorted in the file, but the bug appears because the
sort order for comparing two strings is locale-dependant...that is, if we
compute cached_lists[id(wordlist)]=sorted(wordlist), and the code runs in
a locale different than the one that generated the original standard file,
the sort order will be WRONG and produce the incorrect 11 bits on a lookup,
because the words will appear at a different index than the one in the
file...which is unacceptable for obvious reasons

You could compute cached_hashtables[id(wordlist)]=dict([(value,index) for index,value enumerate(wordlist)]) and that would be correct AND be
O(1), but that's some significant complexity when linear search through
2048 strings is probably not THAT expensive relative to how often this is
likely to be needed, ESPECIALLY in python where it's unlikely to be a
high-performance embedded application anyway.

@Steve132
Copy link
Contributor

I fixed this with a pull request #141

It uses a hash table for the index lookup and word check operations which is O(1) and doesn't require locale-dependant string comparisons to do either operation. All functions which expect a wordlist can use this type as the wordlist for improved performance, but they all degrade gracefully to O(n) cases if the user provides a list as the wordlist instead of an instance of the type.

It also adds support for ALL the languages in bip39, which are now selectable. English remains the default, but you can use any of the others without needing to load them yourself (e.g. bitcoin.words_verify(spanish_phrase,wordlist=bitcoin.wordlists['spanish'])

@wizardofozzie
Copy link
Contributor Author

@Steve132 I've just taken a cursory glance, but it looks fantastic so far, great work!
Take a look at this discussion. Unfortunately wordlists will always have ongoing issues, both because of different standards (electrum 2.0 / bip39) and locale

@Steve132
Copy link
Contributor

Unfortunately wordlists will always have ongoing issues, both because of different standards (electrum 2.0 / bip39) and locale

This patch fixed the locale-based search issues because it uses only the original index of the standard wordlist, NOT the index calculated from a sorted wordlist or a binary search with string comparisons. It remains fast by using a hash table.

I think someone should patch the BIP39 specification where it says the lists must be sorted so as to enable binary search, as we've seen binary-search is not possible to do in a locale-independant way. Instead perhaps the specification should have a recommendation that a hash function be used, perhaps even including C code and look up tables for a minimum-perfect-hash implementation for all languages.

@reiven
Copy link
Contributor

reiven commented Jan 29, 2016

@wizardofozzie hi! i'm trying to use your fork, but i found the tests are failing. can you add issues to your proyect, so users can report this kind of errors? thanks!

@wizardofozzie
Copy link
Contributor Author

@reiven Absolutely, mate. I've been dabbling in iOS Pythonista 2.0, Pythonista 1.6 and another branch, so I'll be consolidating and testing the code ASAP (~1-2 days maximum)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants