mnemonic bisect_left will fail (ungracefully) on other bip39 wordlists #132

wizardofozzie · 2015-11-25T22:40:04Z

Bisect only works for sorted wordlists; not an issue if only english.txt is used, however, forks using standard bip39 wordlists will fail most ungracefully.

eg (Japanese wordlist, forum post)

The text was updated successfully, but these errors were encountered:

cryptid11 · 2015-12-07T00:35:03Z

Yes, this line is wrong

wordlist_english=list(open(os.path.join(os.path.dirname(os.path.realpath(file)),'english.txt'),'r'))

Steve132 · 2016-01-20T21:58:42Z

however, forks using standard bip39 wordlists will fail most ungracefully.

english.txt is the standard bip39 english wordlist, and ALL the wordlists being lexographically sorted is a part of the bip39 spec

the wordlist is sorted which allows for more efficient lookup of the code words
(i.e. implementations can use binary search instead of linear search)

this also allows trie (a prefix tree) to be used, e.g. for better compression

However, you aren't wrong about the bisection algorithm being incorrect, but for several other completely different reasons. I'm going to submit a patch soon.

Furthermore, english.txt is only provided for convenience. If you want to use another wordlist, you can pass it as a sorted list to the 'wordlist' parameter in most of the mnemonic functions

vbuterin · 2016-01-21T00:52:23Z

Perhaps it might be best for convenience reasons to include an assert loop
in the code that makes sure a list is sorted before using it? This seems to
have come up many times now.

On Wed, Jan 20, 2016 at 4:58 PM, Steven Braeger [email protected]
wrote:

however, forks using standard bip39 wordlists will fail most ungracefully.

english.txt is the standard bip39 english wordlist, and the wordlists
being lexographically sorted are a part of the bip39 spec
https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki#Wordlist

the wordlist is sorted which allows for more efficient lookup of the
code words (i.e. implementations can use binary search instead of linear
search)

this also allows trie (a prefix tree) to be used, e.g. for better
compression

However, you aren't wrong about the bisection algorithm being incorrect,
but for several other completely different reasons. I'm going to submit a
patch soon.

Furthermore, english.txt is only provided for convenience. If you want to
use another wordlist, you can pass it as a sorted list to the 'wordlist'
parameter in most of the mnemonic functions

—
Reply to this email directly or view it on GitHub
#132 (comment)
.

wizardofozzie · 2016-01-21T11:44:17Z

@vbuterin In my fork there's an assert for bip39_check. I believe the checksum should flag mis-sorted word lists. (fwiw my code diverges for bip39 because it was written prior to this branch's PR)

Steve132 · 2016-01-21T18:47:24Z

Actually, I've been doing some research with this and what should really
happen is that the search needs to degrade to a linear search in the event
that sorted(wordlist) != wordlist...OR a hashtable should be used to
compute the index based on the position in the file..

This is because in order for the wallet to match the spec, the search index
needs to match the index of the word AS SEEN IN THE FILE...whereas
implementing binary search or resorting the input list will use the CURRENT
LOCALE for the comparisons, thus possibly permuting the file order and thus
making the detected indices incorrect.

The correct behavior will be one of the following options:

Always implement linear search (this is what my current version does in
my branch) Upside: this is simple and always works. Downside, it's O(n)
for every lookup
Implement a hash table that maps words->indices. Upside: This is
simple-ish. Downside: it's O(n) for every lookup unless we implement it as
a pre-processing stage (like have an internal wordlist type that indicates
a correct hash table, and if the wordlist argument is a list convert it and
cache the result or something)
Detect if sorted(list) == list when a non-standard list is discovered.
If it does, use the binary search path. Otherwise, use a linear list.
This result will also have to be cached.

On Thu, Jan 21, 2016 at 6:44 AM, WizardOfOzzie [email protected]
wrote:

@vbuterin https://github.com/vbuterin In my fork
https://github.com/wizardofozzie/pybitcointools/blob/master/bitcoin/mnemonic.py#L101
there's an assert for bip39_check. I believe the checksum should flag
mis-sorted word lists. (fwiw my code diverges for bip39 because it was
written prior to this branch's PR)
NB.

—
Reply to this email directly or view it on GitHub
#132 (comment)
.

vbuterin · 2016-01-21T23:40:31Z

Detect if sorted(list) == list when a non-standard list is discovered.
If it does, use the binary search path. Otherwise, use a linear list.
This result will also have to be cached.

If you're computing the sorted list anyway, would it not be simpler to just
cache the sorted list for each list?

On Thu, Jan 21, 2016 at 1:47 PM, Steven Braeger [email protected]
wrote:

Actually, I've been doing some research with this and what should really
happen is that the search needs to degrade to a linear search in the event
that sorted(wordlist) != wordlist...OR a hashtable should be used to
compute the index based on the position in the file..

This is because in order for the wallet to match the spec, the search index
needs to match the index of the word AS SEEN IN THE FILE...whereas
implementing binary search or resorting the input list will use the CURRENT
LOCALE for the comparisons, thus possibly permuting the file order and thus
making the detected indices incorrect.

The correct behavior will be one of the following options:

Always implement linear search (this is what my current version does in
my branch) Upside: this is simple and always works. Downside, it's O(n)
for every lookup

Implement a hash table that maps words->indices. Upside: This is
simple-ish. Downside: it's O(n) for every lookup unless we implement it as
a pre-processing stage (like have an internal wordlist type that indicates
a correct hash table, and if the wordlist argument is a list convert it and
cache the result or something)

Detect if sorted(list) == list when a non-standard list is discovered.
If it does, use the binary search path. Otherwise, use a linear list.
This result will also have to be cached.

On Thu, Jan 21, 2016 at 6:44 AM, WizardOfOzzie [email protected]
wrote:

@vbuterin https://github.com/vbuterin In my fork
<
https://github.com/wizardofozzie/pybitcointools/blob/master/bitcoin/mnemonic.py#L101

there's an assert for bip39_check. I believe the checksum should flag
mis-sorted word lists. (fwiw my code diverges for bip39 because it was
written prior to this branch's PR)
NB.

—
Reply to this email directly or view it on GitHub
<
#132 (comment)

.

—
Reply to this email directly or view it on GitHub
#132 (comment)
.

Steve132 · 2016-01-22T00:41:48Z

No, because then the result would be incorrect. The whole point is that
the spec says that the 11 bits correspond to the index of the word in the
given wordlist file, in the order given in the file. This is important,
because the spec doesn't specify how the list is supposed to be searched,
only that the result must be consistent with the order given in the
standard file for compatibility.

They are supposed to be sorted in the file, but the bug appears because the
sort order for comparing two strings is locale-dependant...that is, if we
compute cached_lists[id(wordlist)]=sorted(wordlist), and the code runs in
a locale different than the one that generated the original standard file,
the sort order will be WRONG and produce the incorrect 11 bits on a lookup,
because the words will appear at a different index than the one in the
file...which is unacceptable for obvious reasons

You could compute cached_hashtables[id(wordlist)]=dict([(value,index) for index,value enumerate(wordlist)]) and that would be correct AND be
O(1), but that's some significant complexity when linear search through
2048 strings is probably not THAT expensive relative to how often this is
likely to be needed, ESPECIALLY in python where it's unlikely to be a
high-performance embedded application anyway.

Steve132 · 2016-01-24T02:41:54Z

I fixed this with a pull request #141

It uses a hash table for the index lookup and word check operations which is O(1) and doesn't require locale-dependant string comparisons to do either operation. All functions which expect a wordlist can use this type as the wordlist for improved performance, but they all degrade gracefully to O(n) cases if the user provides a list as the wordlist instead of an instance of the type.

It also adds support for ALL the languages in bip39, which are now selectable. English remains the default, but you can use any of the others without needing to load them yourself (e.g. bitcoin.words_verify(spanish_phrase,wordlist=bitcoin.wordlists['spanish'])

wizardofozzie · 2016-01-25T14:43:56Z

@Steve132 I've just taken a cursory glance, but it looks fantastic so far, great work!
Take a look at this discussion. Unfortunately wordlists will always have ongoing issues, both because of different standards (electrum 2.0 / bip39) and locale

Steve132 · 2016-01-25T18:45:12Z

Unfortunately wordlists will always have ongoing issues, both because of different standards (electrum 2.0 / bip39) and locale

This patch fixed the locale-based search issues because it uses only the original index of the standard wordlist, NOT the index calculated from a sorted wordlist or a binary search with string comparisons. It remains fast by using a hash table.

I think someone should patch the BIP39 specification where it says the lists must be sorted so as to enable binary search, as we've seen binary-search is not possible to do in a locale-independant way. Instead perhaps the specification should have a recommendation that a hash function be used, perhaps even including C code and look up tables for a minimum-perfect-hash implementation for all languages.

reiven · 2016-01-29T18:47:54Z

@wizardofozzie hi! i'm trying to use your fork, but i found the tests are failing. can you add issues to your proyect, so users can report this kind of errors? thanks!

wizardofozzie · 2016-01-31T06:47:40Z

@reiven Absolutely, mate. I've been dabbling in iOS Pythonista 2.0, Pythonista 1.6 and another branch, so I'll be consolidating and testing the code ASAP (~1-2 days maximum)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mnemonic bisect_left will fail (ungracefully) on other bip39 wordlists #132

mnemonic bisect_left will fail (ungracefully) on other bip39 wordlists #132

wizardofozzie commented Nov 25, 2015

cryptid11 commented Dec 7, 2015

Steve132 commented Jan 20, 2016

vbuterin commented Jan 21, 2016

wizardofozzie commented Jan 21, 2016

Steve132 commented Jan 21, 2016

vbuterin commented Jan 21, 2016

Steve132 commented Jan 22, 2016

Steve132 commented Jan 24, 2016

wizardofozzie commented Jan 25, 2016

Steve132 commented Jan 25, 2016

reiven commented Jan 29, 2016

wizardofozzie commented Jan 31, 2016

mnemonic bisect_left will fail (ungracefully) on other bip39 wordlists #132

mnemonic bisect_left will fail (ungracefully) on other bip39 wordlists #132

Comments

wizardofozzie commented Nov 25, 2015

cryptid11 commented Dec 7, 2015

Steve132 commented Jan 20, 2016

vbuterin commented Jan 21, 2016

wizardofozzie commented Jan 21, 2016

Steve132 commented Jan 21, 2016

vbuterin commented Jan 21, 2016

Steve132 commented Jan 22, 2016

Steve132 commented Jan 24, 2016

wizardofozzie commented Jan 25, 2016

Steve132 commented Jan 25, 2016

reiven commented Jan 29, 2016

wizardofozzie commented Jan 31, 2016