Quantifying CoinJoin inefficiencies #27

seresistvanandras · 2020-05-06T15:16:06Z

For a proper research paper we need quantified arguments about CoinJoin inefficiencies. Our goal is that WabiSabi will enable us to somewhat alleviate these inefficiencies.

These are the statistics we were discussing with @nopara73 and @nothingmuch:

The number of input UTXOs in a coinjoin that have less value than the minimum denomination. In current CoinJoin implementations these UTXOs if merged, than coordinator can link them. This is a privacy-leak we will be able to eliminate with WabiSabi with input merging.
The number of output UTXOs that right after the CoinJoin tx were redeemed for payment purposes. We could use a heuristic that every tx with 2 output UTXOs are payments. This might be a good approximation of the payment transactions using CoinJoin outputs. With WabiSabi these transactions could happen in the CoinJoin tx itself, therefore saving considerable blockspace for the whole community.
Calculating the ratio/number of unmixed change outputs in CoinJoin txs.

Pls feel free to add below or edit this issue if you have any meaningful idea we should measure!

nopara73 · 2020-05-06T15:19:07Z

The number of input UTXOs in a coinjoin that have less value than the minimum denomination. In current CoinJoin implementations these UTXOs if merged, than coordinator can link them. This is a privacy-leak we will be able to eliminate with WabiSabi with input merging.

I think this isn't what we are looking for. What we are looking for is every normal looking blockchain transaction that doesn't have anything to do with coinjoins. And from that set we should get percentages instead.

nopara73 · 2020-05-06T15:21:07Z

The number of output UTXOs that right after the CoinJoin tx were redeemed for payment purposes.

What does right after mean? I think this is somewhat messy, but I see that it makes sense. Maybe it should be a scale. A better question would be: among non-remixed utxos how long it took them to be spent?

nopara73 · 2020-05-06T15:21:25Z

Calculating the ratio/number of unmixed change outputs in CoinJoin txs.

ACK. Very important.

nothingmuch · 2020-05-06T15:34:20Z

What we are looking for is every normal looking blockchain transaction that doesn't have anything to do with coinjoins.

i think both are in scope, but very different:

common usage patterns that right now are not met by coinjoins, but that we want to make possible
aspects of current coinjoins that are sub-optimal

this question seems relevant, and i think more general because right now users are using Wasabi with a certain distribution of input amounts. i also think it's potentially more than just a privacy leak to the coordinator, it leaves an on chain footprint.

for example input amounts of exactly 1.0 and 0.5 are very common in the data i looked at (beginning until sept 2019), presumably from unit bias, and this interacts with variable denominations to produce change output amounts. specific amounts created can also end up interacting with coin selection when users are remixing coins in the next denomination epoch, which can affect how much users pay in fees, and i believe (but have not yet been able to demonstrate) that this can be used for some privacy attacks - for example suppose a user is now remixing after a denomination reset, or wants to mix change from a payment of a known amount, but chooses an input that is a bit larger than the optimal size for the now larger denomination - that allows an attacker to conclude that they are likely not one of the users who got a near optimal change output from a previous mixing transaction. this is an issue because these change output sizes are not arbitrary, they are a function of the user's prior input and the denomination of the round that created it, i.e. a very strong temporal quasi-identifier.

nothingmuch · 2020-05-06T16:06:31Z

For a proper research paper we need quantified arguments about CoinJoin inefficiencies. Our goal is that WabiSabi will enable us to somewhat alleviate these inefficiencies.

It seems to me like there are several kinds of inefficiencies we might be concerned with:

UTXO set size efficiency at a point in time
block space efficiency, which is a dynamic perpsective on 1 - as the user's UTXO set evolves, this feeds back into how it can evolve in the future. Murch
efficiency in the Boltzmann sense - how well is block space used to create indistinguishable outputs

and i'm not sure if efficiency is the right frame for privacy leaks, since but all of these measures interact with it, potentially non-linearly i.e. higher efficiency might reveal less information, so fewer quasi-identifiers to try and combine, but also might stand out more. since Wasabi coinjoins are easily identified, this has more to the txs happenning before a coinjoin, i.e. do TXOs that end up going into mixes have a non-typical distribution? and also to txs happenning after, both of which are constrained by the current round structure. framed more quantitatively - what is the degree of connectivity (e.g. average path length, maybe modular decomposition of graphs would be useful) between the coinjoin-related transaction subgraph, and its complement in the entire tx graph?

seresistvanandras · 2020-05-07T09:14:54Z

I think this isn't what we are looking for. What we are looking for is every normal looking blockchain transaction that doesn't have anything to do with coinjoins. And from that set we should get percentages instead.

You mean something along these lines? See: Sergi's (@sr-gi) paper: Another coin bites the dust: An analysis of dust in UTXO based cryptocurrencies (https://eprint.iacr.org/2018/513.pdf)

It seems to me like there are several kinds of inefficiencies we might be concerned with:

UTXO set size efficiency at a point in time

Can you elaborate on this? How do you measure UTXO set size efficiency? what does it mean at all? How do you define it?

block space efficiency, which is a dynamic perpsective on 1 - as the user's UTXO set evolves, this feeds back into how it can evolve in the future. Murch

To me block space efficiency means that many txs (especially payments) issued after CoinJoin txs could be squished into the CoinJoin itself...this could be quite easily measured.

efficiency in the Boltzmann sense - how well is block space used to create indistinguishable outputs

Really interesting point. I'm not into optimization theory, but I guess this cannot be solved concretely in poly-time. Maybe you can approximate the best solution in poly-time, but I would suppose you cannot do this efficiently given that this optimization problem can essentially be reduced to the subset-sum problem. Hence, I think this is out of scope...at least the optimal solution.

nothingmuch · 2020-05-07T20:30:04Z

Can you elaborate on this? How do you measure UTXO set size efficiency? what does it mean at all? How do you define it?

equal valued coinjoin means that for for an arbitrary amount you will typically end up with change, and then you will also typically end up with change if you later use such outputs for payments.

overall the lower the hamming weight of amounts, the more outputs need to be created to represent a given value ownership distribution.

block space efficiency, which is a dynamic perpsective on 1 - as the user's UTXO set evolves, this feeds back into how it can evolve in the future. Murch

To me block space efficiency means that many txs (especially payments) issued after CoinJoin txs could be squished into the CoinJoin itself...this could be quite easily measured.

yes, and it pertains to the UTXO efficiency i mentioned as well, i.e. the same situation that makes you end up with arbitrary amounts that are unlikely to be equal to your prior amount or the amounts you eventually need to spend, thereby increasing your average UTXO set size per wallet also makes it so that you create and destroy more intermediate outputs, especially if you are trying to avoid linking e.g. change outputs.

currently bitcoin privacy almost always requires a tradeoff in block space (or mostly equivalently, fees paid). there is a perverse incentive to sacrifice privacy (and therefore also hurt fungibility) if trying to reduce transaction costs, so it would be nice to try and quantify this overhead.

Really interesting point. I'm not into optimization theory, but I guess this cannot be solved concretely in poly-time. Maybe you can approximate the best solution in poly-time, but I would suppose you cannot do this efficiently given that this optimization problem can essentially be reduced to the subset-sum problem. Hence, I think this is out of scope...at least the optimal solution.

if the denomination is fixed then the optimal solution is trivial - equal valued coinjoins with no change outputs. this is the premise underlying whirlpool's tx design and fee structure, which i think makes a lot of sense since it also enables and incentivizes large graphs of coinjoins.

for arbitrary denominations only using amounts of hamming weight 1 seems hard to beat for fungibility, but obviously adds a large overhead (a roughly constant factor easily quantified by the hamming weight distribution of all outputs).

however, arguing that that is optimal requires some sort of model/framework, and if that model also takes into account block space/fees then like you said it becomes non-linear and is seems likely to be NP hard.

ideally whatever tx structure we propose should align privacy and block space, these goals are theoretically very compatible: the fewer bits you write to the blockchain the less there is to analyze in order to break your privacy. unfortunately in practice this is very far from being the case, but i don't think it's unrealistic.

i think a pragmatic approach would be to aim for some robust minimal level of privacy by construction, and attempts to approach a pareto frontier of privacy and block space efficiency as a combined objective, but actually favoring block space efficiency (not privacy per block space efficiency), because making the tradeoff less painful seems more likely to change individual behaviour, but that will have compounding effects for privacy & fungibility that is positive sum.

nothingmuch · 2020-05-07T20:36:23Z

oops, forgot to bring all that rambling back to this issue - the point is we should try to measure this tradeoff, to try and estimate ow much are privacy wallet users paying relative to privacy nihilists, and to what extent do these function as separate economies (i.e. do those users subsidize the nihilists privacy in some way?)

nopara73 · 2020-05-08T07:12:46Z

"Can you elaborate on this?"
"Hold my beer!" - @nothingmuch

nopara73 · 2020-05-09T12:01:39Z

I'm making the decision on what should be in the serialized transactions.

Before that just to reiterate what kind of raw data I'm going to collect on the blockchain:

data collection is from Wasabi launch
Wasabi txs
Samourai txs
Samourai tx0s
other coinjoin looking txs

A transaction object would contain:

txid
raw transactin (hex)
blockhash
blockheight
blocktime
blockindex
inputs: txid, index, scriptpubkey, value
outputs: (txid, index,) scriptpubkey, value

Is there anything else we care about in a transaction? It's very important to decide on these now, because it'll likely take a week to parse the blockchain to gather all this initial information. (Although the software will finish overnight, but I expect it to fail have incorrect data or something to come up that makes me have to restart the run a few times so that's why my one week estimate.)

nothingmuch · 2020-05-09T12:15:37Z

i think we can just sample a subset of the blocks (or txs, but that's harder to sample uniformly), i don't expect we will gain much insight from exhaustively parsing everything
I think we want to compare across tx populations, so IMO we shouldn't select out data about txs that only falls into those categories
nLocktime and nSequence are fingerprintable so maybe they are interesting? not sure...
do we really need the raw tx? given txid it's always easy to get

nopara73 · 2020-05-09T12:31:14Z

i think we can just sample a subset of the blocks (or txs, but that's harder to sample uniformly), i don't expect we will gain much insight from exhaustively parsing everything

Unfortunately it needs to be. There are dependencies on exploring whole txchains. For example if you don't know all the Wasabi transactions, then you won't know which inputs of a specific Wasabi transaction coming from another CoinJoin transactions.

I think we want to compare across tx populations, so IMO we shouldn't select out data about txs that only falls into those categories

I think you misunderstand it. The above data would be collected to speed up further data collection. I didn't mean that after this data collected we will never make a getrawtransaction or a getblock request ever again :)

nLocktime and nSequence are fingerprintable so maybe they are interesting? not sure...

That's in the transaction hex, so that's covered.

do we really need the raw tx? given txid it's always easy to get

Would be faster and more convenient, but it will ultimately depend on if I will have enough free disk space:)

For the record asking all the Wasabi txs based on txids takes 12 minutes for me with an overoptimized algorithm that loses order of transactions, which is a huge problem for some metrics. So it'd be really good to keep these on disk instead otherwise we'll spend hours to collect statistics.

nothingmuch · 2020-05-09T13:08:41Z

ah, so this is more like an index? in that case yeah this all sounds right to me

nopara73 · 2020-05-10T11:50:59Z

I'm ready. Starting the algorithm and praying it won't bloat my disk or be too slow. Otherwise I'll have to optimize.

nopara73 · 2020-05-11T06:28:20Z

Unfortunately it didn't work out. I encountered multiple issues, I'll work on them.

I realized we need to collect post mix transactions, too, because Core isn't able to walk the chain forward. I think immediate post mix transactions should be sufficient, we won't want to analyze post mix tx chains, right?
There's a memory leak (which isn't obvious why, as C# is a managed language.) When I woke up all the 16GB of my memory was being used by the software.
Too slow. I've ran it for 18 hours from block 530501 to 568901 (38.7% of blocks). And at the time when I stopped it estimated 86.5 hours to still take. It'd be more, because more txs will match, I didn't even get to Scamourai's first transactions yet. The speed is quite problematic. I can easily speed up by parallelizing block requests and block processing, but the problem is I have to somehow ensure order and acquire block indexes of txs, which could be quite a challenge or even a blocker.
Have to possibly optimize space, too. Currently I collected 109MB data. 83MB of Wasabi, 29MB of other CJ, and 0MB of Scamourai CJs or tx0s. As Wasabi txs grow I suspect it'll be GBs, which may be fine, but with the addition of post-mix txs IDK, it could be quite a problem.

seresistvanandras · 2020-05-11T06:51:30Z

I realized we need to collect post mix transactions, too, because Core isn't able to walk the chain forward. I think immediate post mix transactions should be sufficient, we won't want to analyze post mix tx chains, right?

I think it should suffice to get a rough picture. We will not get the whole picture, but a rough lower bound could still be established for the number "immediate post-mix payments".

Have to possibly optimize space, too. Currently I collected 109MB data. 83MB of Wasabi, 29MB of other CJ, and 0MB of Scamourai CJs or tx0s. As Wasabi txs grow I suspect it'll be GBs, which may be fine, but with the addition of post-mix txs IDK, it could be quite a problem.

Does your script fetch entire blocks containing mixing transactions? Can't you just fetch certain txs by transaction id? For Wasabi txs, I suppose you have all the tx ids, hence you could just fetch them directly without downloading the whole block. Personally, I would only be interested in wasabi mixing txs (+post/pre mix txs). At least in the very short term, maybe we could extend our scope later.

MaxHillebrand · 2020-05-11T11:19:12Z

Good work guys, I think it would be sufficient to have only wasabi tx for now, and we already have all tx ids, so this would make it a lot more efficient. Would still need to get post mix tx though...

nopara73 · 2020-05-15T10:38:55Z

Index is done and some checks suggest it's correct. I'm moving on to the metrics.

nopara73 · 2020-05-22T16:49:07Z

LOL, I found this unsent comment here for days. By tomorrow hopefully all the stats I described will be published.

Problem that WabiSabi Solves and what statistics can prove that problem exists.

Consolidation leads to privacy loss due to common input ownership heuristics. - Average number of inputs of first post-mix transactions. This is lower bound estimation. Both Wasabi, Samourai and Other CJ-like txs.
Minimum denomination prevents some people from being able to participating. - The percentage of input UTXOs in Wasabi CoinJoins those have less values than the minimum denomination. In the percentage calculation, the total is the number of all input UTXOs.
Unmixed change leads to privacy loss. - Percentage of non-remixed changes where the total is the fresh bitcoins coming into mixes. Both in Wasabi and in Samourai. In Samourai it's easy, because you just need to take a look at tx0 txs.
Payments in coinjoins aren't possible. - @seresistvanandras's suggestion in the starting post doesn't really make sense IMO, but let's discuss it on a meeting. I know any data here that can help.

We'll gather other statistics when we want to decide on denominations. I think other suggestions here should come at that phase. It seems to me that most of @nothingmuch's suggestions are comparative metrics and not standalone ones. So we can score coinjoins and compare them to the new WabiSabi protocol with denominations figured out, but for now it doesn't make sense to compare Wasabi, Scamourai and other coinjoins to each other. That's another research.

nopara73 · 2020-05-23T03:06:20Z

Consolidation leads to privacy loss due to common input ownership heuristics.

Minimum denomination prevents some people from being able to participating.

Unmixed change leads to privacy loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantifying CoinJoin inefficiencies #27

Quantifying CoinJoin inefficiencies #27

seresistvanandras commented May 6, 2020

nopara73 commented May 6, 2020

nopara73 commented May 6, 2020

nopara73 commented May 6, 2020

nothingmuch commented May 6, 2020 •

edited

Loading

nothingmuch commented May 6, 2020

seresistvanandras commented May 7, 2020

nothingmuch commented May 7, 2020

nothingmuch commented May 7, 2020

nopara73 commented May 8, 2020

nopara73 commented May 9, 2020

nothingmuch commented May 9, 2020

nopara73 commented May 9, 2020 •

edited

Loading

nothingmuch commented May 9, 2020

nopara73 commented May 10, 2020

nopara73 commented May 11, 2020 •

edited

Loading

seresistvanandras commented May 11, 2020

MaxHillebrand commented May 11, 2020

nopara73 commented May 15, 2020

nopara73 commented May 22, 2020

nopara73 commented May 23, 2020

Quantifying CoinJoin inefficiencies #27

Quantifying CoinJoin inefficiencies #27

Comments

seresistvanandras commented May 6, 2020

nopara73 commented May 6, 2020

nopara73 commented May 6, 2020

nopara73 commented May 6, 2020

nothingmuch commented May 6, 2020 • edited Loading

nothingmuch commented May 6, 2020

seresistvanandras commented May 7, 2020

nothingmuch commented May 7, 2020

nothingmuch commented May 7, 2020

nopara73 commented May 8, 2020

nopara73 commented May 9, 2020

nothingmuch commented May 9, 2020

nopara73 commented May 9, 2020 • edited Loading

nothingmuch commented May 9, 2020

nopara73 commented May 10, 2020

nopara73 commented May 11, 2020 • edited Loading

seresistvanandras commented May 11, 2020

MaxHillebrand commented May 11, 2020

nopara73 commented May 15, 2020

nopara73 commented May 22, 2020

nopara73 commented May 23, 2020

Consolidation leads to privacy loss due to common input ownership heuristics.

Minimum denomination prevents some people from being able to participating.

Unmixed change leads to privacy loss.

nothingmuch commented May 6, 2020 •

edited

Loading

nopara73 commented May 9, 2020 •

edited

Loading

nopara73 commented May 11, 2020 •

edited

Loading