-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify to reduce data overhead and migration complexity #200
Comments
Note: the registry characteristic shares some similarity to what I described in #54, though in that case I proposed that they should aggressively expire/change as newer versions ship. |
The problem is this is adding more passive entropy to the User-Agent header and we're trying to accomplish the opposite with UA-CH. I appreciate the effort that went into this issue, but I think this is incompatible with the stated primary goal of this API. |
@miketaylr I was not aware the goals were ranked in order. Perhaps the document could be updated to clearly rank the relative priority and justification for each goal? In any case I'm unsure how this proposal is not compatible with the goal associated with so called "fingerprinting". Perhaps you can provide more detail? The keys could be subject to the same rules as the Sec-UA-CH-* headers. In raising the proposal I was mindful to avoid this contentious issue. See point 6.
Once migrated from the current User-Agent value string construction there is no difference in regard to entropy. I observe that this proposal adds entropy when deployed today. The proposal reduces the data overhead of any solution which improves the structure associated with the transmission of user agent data. As demonstrated when widely deployed the data savings are significant across the web. It also provides a simpler path to migration for the vast majority of web participants. A browser vendor can make a change in days that will result in 10s of millions of hours of developer effort for the ecosystem. I hope we can agree these stakeholders are significant and not represented in debates at W3C and WICG. There is agreement to improve the structure of the values. GREASE does not assist that. I would prefer a working group with scheduled meetings under the W3C process to progress these changes. However there is no consensus for proposals like UACH that seek to prevent future crime (think Minority Report dystopia). The UACH proposal was advanced without considering methods that might be used to detect and sanction bad actors committing crimes. Instead all use cases are impaired including those that support among others, analytics, performance, optimisation and fraud identification. These services are often provided by a vibrant supply chain of small businesses who enable smaller businesses to band together to compete with larger businesses. I have explained these issues in this thread last year. No evidence has been provided to support the UACH proposal or the position of TAG. A note on use of language. Words like "fingerprinting", "abuse", "threat" and "attack" have been adopted over a prolonged period of time within this community. These words apply neuro linguistic programming (NLP) and make the reader think of crime. Could we not move to adopt engineering words like "probabilistic identifier" instead of "fingerprinting" in future? This is particularly important as the debate is taken to a wider audience and we need to embrace the benefits of technologies, not just the minority of harms that might enable. |
The key distinction is active entropy (sent when requested) vs passive entropy (sent for all requests). Can you explain how it adds entropy? I'm not sure I follow that point. |
I'd be interested in seeing the justification for the goals also, particularly around mitigating the passive fingerprinting that is supposedly so widespread. As far as I can tell no one has provided any evidence for this to date. I raised this same point in the TAG review of the proposal to freeze the user-agent string. The response was to link to the EFF's Panopticon which 1) mixes both active and passive signals and, 2) says nothing about how widespread passive fingerprinting is, just that it is possible. If this practice is so widespread and harmful there ought to be evidence for it. It feels like a lot of work is being undertaken to fix an unproven problem. Also, if browser makers want to reduce the entropy available for fingerprinting they already have a perfectly good means for doing so—by changing their user-agent header to something simpler, an approach already adopted by Mozilla etc. |
Since Chrome 89 was released it's now possible with a high degree of certainty to detect the Brave web browser due to the absence of the "sec-ch-ua" header field name when a User-Agent field value is present that is identical to the one included in Chrome. |
Vivaldi also does not send "sec-ch-ua" headers and has a User-Agent string that is identical to Chrome. Vivaldi and Brave are indistinguishable in this manner |
Overall entropy has increased across the web because as a minimum we now have three web browsers that prior to the release of Chrome 89 appeared identical when HTTP headers were inspected. Now two of those three web browsers appear differently to the other. Also as Google opted web browsers at random into the trial prior to Chrome 89 entropy for all Chrome users increased. |
Sure, I'm happy to improve the spec in these areas, and link out to other guidance such as https://w3c.github.io/fingerprinting-guidance/ (which links out to some academic research, but there's plenty more out there).
@ronancremin can you clarify what Mozilla has done to simplify their UA string? |
@miketaylr the document referenced, and the documents it references explain that probabilistic identifiers (aka fingerprints) exist, but contains very little to support the case for harms actually occurring as a result of the existence of probabilistic identifiers. I believe it was this evidence @ronancremin and myself were asking for. In order to keep the this document short I would prefer to see the actual case for harms occurring being made rather than the theoretical risk. Even if harm is occurring, which there undoubtedly will be whilst people have free will, we will need to consider remedies that help identify these harms and bring the perpetrators to justice according to the laws set by law makers rather than WICG or the W3C. For example; helping data protection regulators more easily identify the bad actors causing these harms would be a remedy to the problem. No consideration has been given by the proposers of this solution to such remedies. At the moment this document does the equivalent of stating that knives exist, identifying they could be used to threaten or injure people, before concluding the only remedy is to ban all knives without considering how we can deter people from using them for harm, and without recognizing they can be used to prepare and consume tasty meals and that this is the primary use. |
If the implication is that a knife is the User-Agent string in this metaphor, I don't think it works. UA-CH doesn't ban it, it aims to provide an alternative mechanism that doesn't suffer from the same privacy issues (passive entropy collection), and ideally a nicer API that still allows for default low-entropy collection and active higher-entropy collection in a way a user agent can observe and applies interventions on, if it so chooses. Is that a knife sheath? An edge guard? (It's not important, I don't think the metaphor works in any case). But, we're veered into unproductive discussion that falls outside of the scope of the UA-CH API, and away from the original idea proposed in this issue. As to the original issue, it doesn't align with the stated goals of the API, so I'm inclined to close it. But thanks for opening it and starting the discussion. |
@miketaylr As you have opened a new issue to progress adding the justification for the proposal to the document I ask that you either re-open this issue so that the merits of the original issue can be debated, or explain exactly how the original issue does not meet the stated goals of the API. I believe the original issue does meet all the stated goals as explained earlier in this thread and also provides additional benefits concerning migration complexity and reducing the cost for the web ecosystem to migrate. I took care not to include the issue of justification for the proposal in the original issue. I also note that it was in answering your comment in relation to the stated goals that led to the supposed unproductive discussion. I and others answered you in good faith and we now have a new issue #214 to continue that discussion without encumbering this one. Thank you for separating them. |
Sure, as stated in #200 (comment), the proposal in this issue adds passive entropy to the UA string. The 2 goals of UA-CH are to add a new API that reduces default passive entropy, and to improve upon the UA string. The spec is pretty up front that the UA string isn't a great API. The current text in the spec abstract states:
Your proposal is essentially:
So, again, adding passive entropy to the UA string is out of scope for UA-CH, and not aligned with the goals of the spec. |
Thanks for the link. I scanned through all of the documents cited in the Research section of that page. These research papers did not demonstrate anything about how widespread passive fingerprinting is—they are almost entirely focused on active fingerprinting methods such as font enumeration, plugins, canvas etc. This only further highlights the oddness of the UA-CH focus on passive fingerprinting. Yes, there is real evidence for active tracking, and a browser's UA string can contribue entopy to that, but Google's own reseach suggests that active tracking is so much more powerful in its ability to fingerprint a user that a browser's UA string seems meaningless by comparison. From Google's Picasso reseach (which uses only canvas techniques): "Our JavaScript implementation of Picasso, when properly configured using the right graphical primitives, is able to successfully distinguish the browser family (Chrome, Firefox, etc.) and the OS family (Windows, iOS, OSX,etc.) of over 52 million clients with 100% accuracy."
On mobile devices Firefox opts to leave out any model identifier tokens such as "SAMSUNG SM-G920F" from the UA string, unlike Chrome, Samsung Browser etc. Given that there are at least tens of thousands of distinct Android models alone in the market this has a very meaningful impact on the entropy contained in the UA string. Recall that the User-Agent header is a SHOULD requirement in the HTTP RFCs—browser makers aren't obligated to send anything at all, yet they chose to because overall it is benificial to their users and the web ecosystem. Browser makers are well-placed to understand their users' needs and some have already made changes to increase user privacy without requiring a new spec to do so. The biggest improvements to user privacy have been made by defaulting third-party cookies to off and other active tracking prevention systems such as those included in Safari and Firefox to counter more sophisticated tracking approaches. |
@cwilso @LJWatson @travisleithead @yoavweiss @dontcallmedom - as WICG chairs and W3C representative for community groups, please could you re-open this issue so that those participants who feel it has been closed prematurely can signal to the wider community that we consider it unresolved? I understand @miketaylr will be OOO for a few days, hence the request to you. @miketaylr welcome back. Let me clarify. There is a difference between the data model used to communicate the information, and the conditions under which that information can be communicated. I only addressed the former aspect of the data model in this issue. I raised the engineering impact and cost for the existing ecosystem associated with the proposed data model, and the comparative inefficiency of the proposals data model. Now that Chrome 89 contains the experiment many web participants are appreciating the impact first hand. Consider transmitting the platform version value. In this proposal the information would be appended to the User-Agent string in square brackets with a short prefix described in a registry. For example;
In the document as currently written it might appear as a new header as follows. Sec-CH-UA-Platform-Version: "Platform Version";v="75.6", ";Not A Something";v="99" This will require those that record this information to change their data model, involves more data being consumed, and the associated costs for both (developer effort and data costs). The proposed change does not require a change to the Accept-CH header, or any other method that would be used to determine the presence of the information. That is unrelated. I do have separate concerns over the need for those requirements of the UACH proposal. These are now raised in #214 and #215. That should not prevent this issue being debated purely from the perspective of efficiency and cost to all web participants, many of whom are not represented directly here. |
Reading through the comments, it appears that the solution proposed for this issue would not be compatible with the stated goals of the API. The justification for this was clearly given, and so I do not see a reason to reopen this issue. If it is felt that an API with a different set of goals is worth exploring, then forking and/or starting an alternative incubation is always an option. |
@LJWatson Thank you for taking the time to review the issue and responding. Please could you refer me to the clear justification? I'm unable to identify it. My analysis follows. Reading 17th March 2021 version of the document I’ve been able to identify the following goals from the abstract.
The specification should be able to relate all it's requirements to these goals. If this is not done the requirements are merely ancillary and not of importance. After the Abstract the document goes onto describe a solution where the information associated with content negotiation would appear in HTTP headers additional to the The content negotiation is performed using a further HTTP header named By requiring the server to request the additional information that information is no longer sent with every request in the The proposal retains the There will be a phase of parallel operation prior to implementers reaching a point where the This issue was raised so that this process can be optimised to reduce the amount of data required to achieve the stated goals and importantly reduce the migration complexity for the majority of web participants. I therefore conclude that there is a goal I’ve been unable to identify. Could the reasons for closure relate to the implementation consideration suggesting removal of the If this is true then the abstract should change to clearly state a goal of the proposal is to remove in its entirety the @LJWatson @miketaylr Please could you confirm this interpretation is correct? I will then raise an issue to modify the goals to make this clear and a further issue to include the evidence to justify removing the My preference is for this proposal to graduate to a W3C Working Group so that the W3C consensus based process can be applied. However @miketylor does not believe the proposal is sufficiently advanced for this to commence as open issues remain. See #176. For the avoidance of doubt multiple WICG participants are requesting evidence from the proposer concerning the real harms, not theoretical risk, associated with passive probabilistic identifiers associated with the |
I suggest not using ClientHint, but simply reducing the user agent Remove As a result, we get a new format: this is much less than with CH-hints and a bunch of headers. PS my ClientHint |
@sanchezzzhak thank you for the additional analysis. Perhaps you could ask WICG chairs to re-open the issue? Alternatively post a new issue with your analysis referencing this one? |
Current User-Agent header values require an average of 174 bytes including the HTTP header name and value to be included in every request.
To communicate the equivalent information via UACH would require an initial request of approximately 1444 bytes, and an additional 470 bytes to be sent on every subsequent request. The saving of the 174 byte associated with the current User-Agent header field name and value would only be achieved once the User-Agent field and value is removed.
The following table shows some of the assumptions behind the numbers.
A more efficient method of representing the data has already been proven by Facebook. The following is an example of a User-Agent field value provided by Facebook applications when accessing the web.
Mozilla/5.0 (iPhone; CPU iPhone OS 14_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 [FBAN/FBIOS;FBDV/iPhone8,1;FBMD/iPhone;FBSN/iOS;FBSV/14.2;FBSS/2;FBID/phone;FBLC/fr_FR;FBOP/5]
The structured information is appended to the User-Agent value within square brackets. Semi colons and forward slashes are used to delineate keys and values and as such would be reserved characters in any specification change. This approach has been proven to work well alongside existing conventions.
The approach has the following benefits.
The issue of increasing data overhead are raised in other issues including 155, 179, 153, 169, 152, 127 and 66.
This approach improves parsing issues and therefore relates to 196 to 156.
This analysis does not consider the additional overhead needed to implement proposals like Critical-CH discussed in issue 66.
The structure of the values could be moved out of the specification and into a registry providing an element of future proofing and flexibility by proposes for different keys and values. This relates to issue 133. Assuming this proposal progresses to a W3C working group prior to mass deployment the approach of using registries will aid in recommendation track work. Once established as a recommendation the working group would continue to administer the registry.
The text was updated successfully, but these errors were encountered: