-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
/account.get_exitable_utxos should not call OMG.DB.utxos() directly #1257
Comments
Ah sorry I didn't put why I thought they're related. Yup agree that they're different endpoints. The only relation is that the discussions that took place were on utxos retrieval but there are actually 2 separate places to fix. So adding a relation here so we don't mistake fixing only one issue being "the utxos retrieval issue is solved" I've added this reasoning to my original reference above. Thanks! |
Slack discussion leads to a few points already:
|
As discussed in slack, I will not follow up the issue. Papa has been assigned to the issue and he will follow up. |
I don't think we should shard by address, it's there to separate logically different thing the way we already use key prefixes. To summarize:
|
I investigated endpoint optimization options, here is what I considered. Assumptions:
Option: Partitioning addresses setThe idea here is to divide address set into Storage considerationsLimits of the RocksDB
We can store ~250M of utxo positions in the single RocksDB value, so storage engine is not limiting us. Storage space estimation Assuming 64-bit utxo position (enough to encode a year of blocks), 65k partitions each of 250 utxo positions adds less than 200MB additional storage space. It corresponds to 60% overestimated UTXO set (16.38M of utxos), the average space consumption should be significantly lower. List of 250 64-bit utxo position serialized to binary ( Performance considerationsHere I'm presenting output from a test I've written to measure times of fetching utxo set Performed tests
Test description Outcomes
Partitioning strategyBecause addresses are generated at random we can assume uniform distribution of the addresses in the address space. Therefore calculating partition key from the formula: For simplicity we can choose I considered also consistent hashing for a partition strategy but it gain us nothing. Even if we would like to adjust partition count it will lead to change of each partition. So I think simpler strategies are more suitable here. Solution draftFirst of all please mind this is just supportive endpoint of security Watcher which aims to remind the address owner the utxo positions (s)he owns. Awared owner can skip this step and start exit utxos by the position right away. This endpoint was never meant to be be used for transacting. The following draft is trying to answer performance issues while support the entire address space. Other options were mentioned in other options Lets assume we have a function:
which assigns each address in the system to one of Next, lets compute a mapping and then compute the mapping Additional data in RocksDBFor each of How to filter utxos owned by addressThis is the functionality behind Given
Result
Algorithm
How to keep ownership lists up-to-dateWe need to update each ownership list with newly created utxos just after new block is consumed by a Watcher. DB updates with newly created utxo as well as spend utxo positions are keep up-to-date in If we'd like to update partitions while applying the block, we are missing information about spent utxo owner (which is available only when transaction is executed on state). However keeping spent utxo positions in the partition is not much issue because we need then retrieve utxo from the state to check the owner, so it won't introduce more burden than consuming additional space. What's more usually one of the transaction outputs will contain the owner of the spent utxo, so most of the cases we will manage to clear spent positions. |
Hey @pnowosie good stuff! Are there any drawbacks? |
Results are very promising.
|
I have a couple of questions I can't answer from the above @pnowosie. What amount of partitions is this aiming at? The idea is to have partitions evenly spread accross keys because if every key takes a fair share (lets say 10 keys), then in theory you should access the DB 10x faster. But if partitions are not evenly spread, then you can get skewed partitions (in our case, you'd have large amount of UTXOs under one key and less under others). Is this addressed? |
I'll work more to describe solution more clearly in the post above (in "Solution Draft")
I'd say in the proposed solution there are 2 kinds of parameters
Both are related to the protocol bottleneck which is exitability, so the constraint is the total number of UTXOs. I don't have the answer to the above points 1 & 2, I stated my assumptions that the number of utxos belonging to each address is small (which should be also the account's owner focus). This points me to additional drawbacks I missed: 🔗 More drawbacks
Actually it is much more faster than iterating through entire UTXO set and filter by the owner, because we fetch utxos by the key ( 🙏 Thank you for the constructive feadback ❗ |
BTW, it will be nice to have some documentation or investigation on the impact of UTXO set actually goes over that. I think there are two possibility that this will happen:
|
I don't quite get ... why don't we partition UTXOs generally in the Watcher? I don't think it's viable to maintain two UTXO ledgers the way this is suggesting (partitioning just for this endpoint and having the utxo set under the rocksdb key |
Not the UTXO set is partitioned here, I'm introducing here partitioning the address space to allow for quick(er) filtering UTXO by address (for the sake of the mentioned endpoint).
|
Pasting my inputs from slack:
|
Let me provide a brief recap of the possible approaches and the discussion we had today:
The core problem with this endpoint currently is that when you're looking up UTXOs from an account there's a lot you need to scan (imagine 10M utxos, mentioned previously) Additionally to the linked text and paragraph: Work impact:
The flow is as follows.
It seems like a very cool approach, but the current codebase is very much not event based and does not provide much reliable pub/sub. So the work impact is quite substantial (refactoring and then implementation). He he, brevity was a lie. |
@InoMurko Thanks for the summary! 💪 The summary looks good to me. While the 3rd option research is happening, I can explore the 4th option from usage side. So some follow up questions for the 4th option:
|
Update to point 3 of the recap First I'll describe a proposition which is step back from the partitioning proposition described above. It's simple additional mapping Proposition 0Let's introduce new dictionary
where key is 20-bytes Ethereum address and value contains list of all utxo positions owned by the address. Storage considerationsBecause Utxo has single owner, above
Performance considerationsWe can assume that fetching utxo list by address is as fast as fetching the
Given full block of 65 535 transactions is consumed 4 output each, just fetching data to update
Please mind it's the theoretical worse case forecast, it would be worth to check in the practice. Switching from in-memory UTXO set to RocksDB baked does not impact performance significantly. 👐 Optimizations ideasYou might ask why in the above calculation I considered only outputs (newly created utxos we need to add to owner addresses. What about spent inputs - shouldn't we remove them from the list as well? Not really, hence it is important to add new utxo positions. Spent ones won't mess much besides occupying additional disk space, corresponding utxos from UTXO set will be deleted anyway. Further clean up can be done in the background job. 👍 Gains & Drawbacks 👎Drawbacks
Gains
I need to emphasize the fact that separate mapping for addresses does not impact the ability and performance of retrieving utxos by position. This is going to change... 👇 ConclusionWe can see that approach described in Proposition 0 will address issues of the Also we already show how to further optimize towards partitioning approach described above. Additional mapping was also implemented in ERC-721 to retrieve tokens owned by the address. Option (kind of sharding approach)I do not recommend the following approach, it's described here to demonstrate that we can further merge UTXO set and the Proposition 0's Dataset constructionLet's create a dictionary
Further we can apply partition function to the address to switch to the partitioned approach. By constructing this dataset we will loose the ability to quickly fetch the utxo by the position. This means the need to rewrite the existing data fetching functionality. ConclusionIt's obvious that the following dataset will resolve performance issues of the How it would serve the other scenarios
... I'm not going to analyze the other scenarios. It is obvious that the approach is lacking the flexibility and will not allow to address future requirements. |
Regarding the
|
Recap from conversation off-:octocat:
Yes, with the proposed dataset we can reduce the addresses only to the tracked ones later to save disk space on instances there is such a need. Also our global Watcher deployment should continue to handle all addresses.
Good point 👍 👏 . What I've learnt about Bitcoin partitioning strategies which are either based on partitioning the address space or utxo set was that the goal is to split the large network into smaller pieces (shards). Shards then operate separately and participants needs to store only shard's data to validate the subnetwork (hence the miners needs still keep entire network's data to mine blocks). |
sharding is another name for partitioning. so when I mentioned Bitcoins sharding I of course didn’t mean we would
The point was to see how they split the UTXO set into shards (partitions) so that they maintain optimal balance on these shards (partitions). Their shards (partitions) would be a way how we define the key (partitioning/sharding) function in:
So if you have more information about their sharding/partitioning, I would love to read it. Naturally, this would mean stripping out the bitcoin complexity of validation and mining (since we could have a distributed network but NOT a decentralized network (there are no byzantine cases, all participants are honest)). |
One of the benefits of partitioning is parallelism. In our case that's not the case because RocksDB is single threaded. I actually think this proposal is what we should do: #1257 (comment) But instead of maintaining two "ledgers" we refactor the current storage to use just the partitioned one. |
how.....do you do that though? BTW, also want to bring this out early before any design is committed. In ALD, |
if output type is a problem wouldn't this work too? Now this is under the assumption that what you're saying @boolafish means that a new output type means that you have N number of UTXO sets, correct? Currently, adding a new output type means two UTXO sets. Payment type v1 and the new one, right?
Well it's pretty simple, I think, lol.
So for the example above, when you write to rocksdb, you write under 6 differen keys (that is, if the hashing function redirects the addresses into 6 different shards (keys!!!). I hope this makes sense. If not, I need to POC to explain it better perhaps? :D Or think about an example more. |
as discussed as slack, the problem is ALD does not have the assumption "owner address" would definitely exists for all output type. It is on each output type to define what should be their data structure. ALD only asked for a output type field enable further dispatchment to different modules/codes/data. It does not mean we cannot shard with address, just we need to design the DB to be first dispatching with output type, and then the output type can be using a sharded DB. |
I'll start with proposition 0 and then we can consider move next step...
I think I already elaborated I have found no single ledger solution. Can you describe how we proceed with the refactor? |
Can you pls repeat what the problem is with just one? |
Here is my recap of the Bitcoin (or more specifically UTXO-based ledger) sharding technics which will hopefully helps to answer to the comment. As @InoMurko stated in the mentioned comment, sharding is a huge space of research which also needs to address security concerns like shard overtaking (similar to 51% attack), cross-shard transactions and so on. UTXO ledger sharding technics
This is quite interesting approach but also has its drawbacks. It's interesting because theoretically leads to balanced utxo distribution. Also because the transaction usually spends single owner funds all inputs should be in the same partition. What's more it allows to take advantage of the usage patterns, e.g. addresses that initiate large amount of transactions. Drawbacks lies in the payment transaction structure which identifies inputs by utxo position, and the owner can be identified by the transaction hash and the signature (by computationally expensive recovering owner address). This means large and risky refactor of the Watcher codebase. Readings
This approach isn't something we can implement immediately. It assumes that utxo is identified by the pair: As a simple idea for ledger partitioning given the utxo position we're using now we might consider Readings
During the investigation I also briefly skimmed following papers. They are focused on the full sharding details that provide limited usefulness in our case. To name a few sharding protocols working on UTXO ledger model: RapidChain, OmniLedger, Elastico... Papers
|
Relates to #1274 (in the sense that fixing this issue, the other issue regarding utxos retrieval is still remaining to be fixed)
RE: #1256 (comment)
OMG.DB.utxos()
should not be used directly for serving utxos listing due to performance issuesThe text was updated successfully, but these errors were encountered: