use bytebuffer abstraction instead of byte[] #20

shlomiv · 2015-12-02T00:31:59Z

Hey,
I would love it if trie4j could use ByteBuffer instead of byte[]. This could be useful for avoiding heap space for large static tries, by using DirectByteBuffer, MappedByteBuffer etc..

I only looked at the code briefly, and found that it could be quite simple to modify BytesSuccinctBitVector.java to use ByteBuffer. Wanted to know if there are more places I should be looking at?

WDYT?

Thanks!

takawitter · 2015-12-02T03:57:45Z

I'm not sure, but using ByteBuffer may cause serious performance decrease because of copy operation from native memory to heap.
But, anyway, you can try and measure performance.
I recommend creating ByteBufferSuccinctBitVector class by copy and modify BytesSuccinctBitVector and use it in TailLOUDSTrie class such like:

    new TailLOUDSTrie(
        firstTrie,
        new LOUDSBvTree(new ByteBufferSuccinctBitVector(firstTrie.nodeSize() * 2)),
        new ConcatTailArrayBuilder(firstTrie.size() * 4));

shlomiv · 2015-12-02T04:32:55Z

There might be some performance decrease, but in some cases (like mine, where things are really huge) eliminating extra GC could prove to be beneficial in the big picture.. In any case this assumption has to be measured.

When Ill get to this, should I bother with the specialized BitVectors as well (such as BytesRank0OnlySuccinctBitVector)?

takawitter · 2015-12-02T07:12:39Z

I see.

FYI, according to the URL below, using Unsafe class and DirectBuffer class seem to provide faster access to natively allocated memory like this:

Field field = sun.misc.Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
unsafe = (sun.misc.Unsafe) field.get(null);
sun.nio.ch.DirectBuffer buffer=(sun.nio.ch.DirectBuffer)ByteBuffer.allocateDirect(size);
unsafe.getInt(byteBuffer.address(), offset);

http://stackoverflow.com/questions/11174231/compare-direct-and-non-direct-bytebuffer-get-put-operations

I don't think you must consider about bit vector variations because these classes provides slightly low performance improvements (But if you need it, you can implement it :).

shlomiv · 2015-12-02T11:41:05Z

Thanks :)
Initial testing with AllTries.java and jawiki-20150805-all-titles-in-ns0.gz gives:

with 1564854 words and 12611354 chars in wikipedia titles.
warming up... running all tries 10 times.
TailLOUDSTrie(LOUDSBvTree(BytesSuccinctBitVector),ConcatTailArrayBuilder)
TailLOUDSTrie(LOUDSBvTree(ByteBufferSuccinctBitVector),ConcatTailArrayBuilder)
1 2 3 4 5 6 7 8 9 10 ... done.
executiong each process 10 times.
TailLOUDSTrie(LOUDSBvTree(BytesSuccinctBitVector),ConcatTailArrayBuilder), 265, 721, 38914280
TailLOUDSTrie(LOUDSBvTree(ByteBufferSuccinctBitVector),ConcatTailArrayBuilder), 247, 754, 38592176

So there is a little delay for a little improvement in memory consumption. Looking further with memory analyzer, it appears that TailLOUDSTrie.labels is actually taking the majority of space. Ill make a BBTailLOUDSTrie with similar optimizations and recheck.

takawitter · 2015-12-02T13:39:56Z

That's so nice!!!
I can't wait your pull-req (Do you plan to do that?)

Ahh.. Yes, indeed, the tail array is a memory eater. Though I want to improve that, currently no idea.

shlomiv · 2015-12-02T21:41:53Z

Thanks! Yes, I plan on sending a pull request, still some more things to check though.

Here is another attempt, this time adding a BBTailLOUDSTrie:

TailLOUDSTrie(LOUDSBvTree(BytesSuccinctBitVector),ConcatTailArrayBuilder), 226, 655, 39325432
BBTailLOUDSTrie(LOUDSBvTree(ByteBufferSuccinctBitVector),ConcatTailArrayBuilder), 244, 751, 35004480

We again see some slight degradation in speed but this time a more noticeable improvement in memory consumption! I am sure we could propagate these changes to more places and reduce memory usage even further..

Do you have any papers you were basing your implementation on that I could read? I would love improve my understanding of this project.

takawitter · 2015-12-03T16:04:25Z

That's great!
I'm looking forward to your pull req :)

I read Japanese book http://www.amazon.co.jp/gp/product/4774149934/ (Technologies to support Japanese input method editors)
and also gather information from web (all Japanese):

I think base implementation is similar to original papers because the authors of documents above refer those, but I add following some little optimizations:

use index to speed up bitvector access (BytesSuccinctBitVector.indexCache0)
use Character to code map to make DoubleArray dense (DoubleArray.charToCode)

and so on.
But I will not be surprised if other implementation has those kind of optimization because those are very simple.

knutwannheden · 2024-07-04T01:18:59Z

What ever happened to this? Would indeed be nice with an implementation that can use off-heap memory.

takawitter · 2024-07-16T04:55:29Z

@knutwannheden
I added ByteBufferSBV.
7c17679

But we still need to implement the ByteBuffer version of TailLOUDSTrie according to #20 (comment)

takawitter · 2024-07-16T09:41:55Z

@knutwannheden
To reduce heap usage, we need to implement the ByteBuffer version of TailLOUDSTrie(store labels on bytebuffer) and TailArrays(store tail chars on bytebuffer). It may cause huge performance loss and we still need heap memory enough to store PatriciaTrie because only PatriciaTrie can build trie from a set of strings and TailLOUDSTrie needs it.
Do you think bytebuffer versions are worth implementing even in this situation?

takawitter self-assigned this Jul 16, 2024

takawitter added the enhancement label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use bytebuffer abstraction instead of byte[] #20

use bytebuffer abstraction instead of byte[] #20

shlomiv commented Dec 2, 2015

takawitter commented Dec 2, 2015

shlomiv commented Dec 2, 2015

takawitter commented Dec 2, 2015

shlomiv commented Dec 2, 2015

takawitter commented Dec 2, 2015

shlomiv commented Dec 2, 2015

takawitter commented Dec 3, 2015

knutwannheden commented Jul 4, 2024

takawitter commented Jul 16, 2024 •

edited

Loading

takawitter commented Jul 16, 2024 •

edited

Loading

use bytebuffer abstraction instead of byte[] #20

use bytebuffer abstraction instead of byte[] #20

Comments

shlomiv commented Dec 2, 2015

takawitter commented Dec 2, 2015

shlomiv commented Dec 2, 2015

takawitter commented Dec 2, 2015

shlomiv commented Dec 2, 2015

takawitter commented Dec 2, 2015

shlomiv commented Dec 2, 2015

takawitter commented Dec 3, 2015

knutwannheden commented Jul 4, 2024

takawitter commented Jul 16, 2024 • edited Loading

takawitter commented Jul 16, 2024 • edited Loading

takawitter commented Jul 16, 2024 •

edited

Loading

takawitter commented Jul 16, 2024 •

edited

Loading