-
-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read-only secondary indexes should get a tailored in-memory data structure #624
Comments
Still open ? |
Yes |
@anmol797 we should probably chat about this |
can i take it up ? |
Sure |
Are you familiar with basic data structures, especially functional/persistent structures? I'd like to discuss this first with someone already familiar to these topics :-) |
yes i am aware of all these |
Ok, so basically the main datastructure in Sirix is a "keyed", persistent trie to fetch pages (full pages) or page fragments (e.g. only changed records plus records which fall out of a sliding window). The data is stored in the leaf pages of the tries. Currently, we store JSON nodes or XML nodes in a trie. Secondary indexes based on Red/Black balanced binary trees are currently stored in other tries. Now, as red/black trees are not cache-line friendly (and hopefully Valhalla bears some fruits better sooner than later) we could alternatively store the secondary indexes as Adaptive Radix trees (or the newer variant Height Optimized trees). That said to incorporate also the sliding snapshot algorithm used to version the current leaf pages is a lot of work... I'm currently not sure if the rotations due to balancing a binary tree lead to a lot of copied tree nodes (which are currently stored in the trie leaf pages). Instead of implementing a persistent ART with also versioning the leaf pages it would of course be much simpler to simply read the stored red black tree nodes into an ART, but not sure if that really would make sense. On the other hand it's currently "nice to have", but a lot of work (maybe half a year or maybe even much more as it's done in spare time). Another pressing issue is, why a full scan of a bigger resource (3,8Gb JSON file stored in Sirix) in parallel traversed by N read-only trxs is much slower than with only one trx (it's not at all obvious for me currently when profiling what's the issue). |
okay okay , understood , so basically the optimization of Read need to be done "" " Instead of implementing a persistent ART with also versioning the leaf pages it would of course be much simpler to simply read the stored red black tree nodes into an ART, but not sure if that really would make sense." can you please explain a bit more ? |
Point is it's currently more of a "nice to have" thing with a huge possibility that work is not going to be finished and it's more like "this could be better" ;-) However, I think what would be valuable in any case would be to separate the |
|
Oh and BTW: we have to fix this first as currently the CI build always fails due to an old docker image format used by Keycloak 7.0.1, which is of course super old and should be updated: #711 |
yeah sure , that will be more effective and helpful for me please let me know what to start with |
okay |
You could start with the Keycloak issue and then with the proposed refactoring? |
ok , |
You have to update the docker files and check that it works again, as startup scripts to import the realm are not supported anymore |
how to test that change ? any reference ? |
Currently, we're reading the red-black tree nodes from the disk on the first load and putting the nodes into a global buffer manager/cache. Still, we could, for instance, also use the adaptive radix tree or an even better data structure for read-only...
The text was updated successfully, but these errors were encountered: