Modify to reduce data overhead and migration complexity #200

jwrosewell · 2021-02-10T12:45:37Z

Current User-Agent header values require an average of 174 bytes including the HTTP header name and value to be included in every request.

To communicate the equivalent information via UACH would require an initial request of approximately 1444 bytes, and an additional 470 bytes to be sent on every subsequent request. The saving of the 174 byte associated with the current User-Agent header field name and value would only be achieved once the User-Agent field and value is removed.

The following table shows some of the assumptions behind the numbers.

Request	Bytes	Header name and value	Description
First	144	Accept-CH: Sec-CH-UA-Arch, Sec-CH-UA-Model, Sec-CH-UA-Platform, Sec-CH-UA-Platform-Version, Sec-CH-UA, Sec-CH-UA-Full-Version, Sec-CH-UA-Mobile	Header needing to be set on the first request.
First	~1,300	Other headers	The other headers sent and received in the first request assuming there is no body involved.
Second and subsequent	~470	Sec-CH-UA-Arch: "Key"; v="Value", "Not a key"; v="Not a Value" Sec-CH-UA-Model: "Key"; v="Value", "Not a key"; v="Not a Value" Sec-CH-UA-Platform: "Key"; v="Value", "Not a key"; v="Not a Value" Sec-CH-UA-Platform-Version: "Key"; v="Value", "Not a key"; v="Not a Value" Sec-CH-UA: "Key"; v="Value", "Not a key"; v="Not a Value" Sec-CH-UA-Full-Version: "Key"; v="Value", "Not a key"; v="Not a Value" Sec-CH-UA-Mobile: "Key"; v="Value", "Not a key"; v="Not a Value"	Use of GREASE requires at least two keys, the HTTP name and the key within the resulting value. Then there is at least one fake key and value included in the value.

A more efficient method of representing the data has already been proven by Facebook. The following is an example of a User-Agent field value provided by Facebook applications when accessing the web.

Mozilla/5.0 (iPhone; CPU iPhone OS 14_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 [FBAN/FBIOS;FBDV/iPhone8,1;FBMD/iPhone;FBSN/iOS;FBSV/14.2;FBSS/2;FBID/phone;FBLC/fr_FR;FBOP/5]

The structured information is appended to the User-Agent value within square brackets. Semi colons and forward slashes are used to delineate keys and values and as such would be reserved characters in any specification change. This approach has been proven to work well alongside existing conventions.

The approach has the following benefits.

~106 bytes required to transmit information that currently requires 174 bytes. This assumes that over time the current User-Agent value is removed leaving the characters within the square brackets.
No changes needed to data models built around a single value for all information. This results in significant implementation effort reduction for the entire web eco-system. The precise saving and cost effort is yet to be estimated but could reasonably be considered many millions of developer hours globally.
The order of the items could be varied at random and avoid the need to include bogus keys and values with GREASE where parser compliance is a concern.
Removal of legacy characters would be achieved in a controlled and gradual fashion once consumers of the data have been given time to migrate to a revised parsing approach for the single string value.
This approach provides an opportunity to create a registry of User-Agent keys of two or three characters. For example; HWM = Hardware Model, HWV = Hardware Vendor, BV = Browser Version. The approach extends to other proposals that utilize Client Hints simply by extending the registry. For example; keys for memory (MEM or RAM), battery capacity (BC), or any other value.
This modification is independent of other issues that relate to the circumstances which are required to be met before the information is made available.
The approach deals with the structure of the User-Agent value in a space efficient way. This is one of the two primary problems the proposal is seeking to solve.

The issue of increasing data overhead are raised in other issues including 155, 179, 153, 169, 152, 127 and 66.

This approach improves parsing issues and therefore relates to 196 to 156.

This analysis does not consider the additional overhead needed to implement proposals like Critical-CH discussed in issue 66.

The structure of the values could be moved out of the specification and into a registry providing an element of future proofing and flexibility by proposes for different keys and values. This relates to issue 133. Assuming this proposal progresses to a W3C working group prior to mass deployment the approach of using registries will aid in recommendation track work. Once established as a recommendation the working group would continue to administer the registry.

The text was updated successfully, but these errors were encountered:

erik-anderson · 2021-02-11T16:50:11Z

Note: the registry characteristic shares some similarity to what I described in #54, though in that case I proposed that they should aggressively expire/change as newer versions ship.

miketaylr · 2021-03-12T20:02:18Z

A more efficient method of representing the data has already been proven by Facebook. The following is an example of a User-Agent field value provided by Facebook applications when accessing the web.

Mozilla/5.0 (iPhone; CPU iPhone OS 14_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 [FBAN/FBIOS;FBDV/iPhone8,1;FBMD/iPhone;FBSN/iOS;FBSV/14.2;FBSS/2;FBID/phone;FBLC/fr_FR;FBOP/5]

The approach deals with the structure of the User-Agent value in a space efficient way. This is one of the two primary problems the proposal is seeking to solve.

The problem is this is adding more passive entropy to the User-Agent header and we're trying to accomplish the opposite with UA-CH. I appreciate the effort that went into this issue, but I think this is incompatible with the stated primary goal of this API.

jwrosewell · 2021-03-12T21:04:27Z

@miketaylr I was not aware the goals were ranked in order. Perhaps the document could be updated to clearly rank the relative priority and justification for each goal?

In any case I'm unsure how this proposal is not compatible with the goal associated with so called "fingerprinting". Perhaps you can provide more detail? The keys could be subject to the same rules as the Sec-UA-CH-* headers. In raising the proposal I was mindful to avoid this contentious issue. See point 6.

This modification is independent of other issues that relate to the circumstances which are required to be met before the information is made available.

Once migrated from the current User-Agent value string construction there is no difference in regard to entropy. I observe that this proposal adds entropy when deployed today.

The proposal reduces the data overhead of any solution which improves the structure associated with the transmission of user agent data. As demonstrated when widely deployed the data savings are significant across the web. It also provides a simpler path to migration for the vast majority of web participants. A browser vendor can make a change in days that will result in 10s of millions of hours of developer effort for the ecosystem. I hope we can agree these stakeholders are significant and not represented in debates at W3C and WICG.

There is agreement to improve the structure of the values. GREASE does not assist that. I would prefer a working group with scheduled meetings under the W3C process to progress these changes.

However there is no consensus for proposals like UACH that seek to prevent future crime (think Minority Report dystopia).

The UACH proposal was advanced without considering methods that might be used to detect and sanction bad actors committing crimes. Instead all use cases are impaired including those that support among others, analytics, performance, optimisation and fraud identification. These services are often provided by a vibrant supply chain of small businesses who enable smaller businesses to band together to compete with larger businesses. I have explained these issues in this thread last year. No evidence has been provided to support the UACH proposal or the position of TAG.

A note on use of language. Words like "fingerprinting", "abuse", "threat" and "attack" have been adopted over a prolonged period of time within this community. These words apply neuro linguistic programming (NLP) and make the reader think of crime. Could we not move to adopt engineering words like "probabilistic identifier" instead of "fingerprinting" in future? This is particularly important as the debate is taken to a wider audience and we need to embrace the benefits of technologies, not just the minority of harms that might enable.

miketaylr · 2021-03-12T21:54:51Z

Once migrated from the current User-Agent value string construction there is no difference in regard to entropy. I observe that this proposal adds entropy when deployed today.

The key distinction is active entropy (sent when requested) vs passive entropy (sent for all requests).

Can you explain how it adds entropy? I'm not sure I follow that point.

ronancremin · 2021-03-15T11:12:41Z

I'd be interested in seeing the justification for the goals also, particularly around mitigating the passive fingerprinting that is supposedly so widespread. As far as I can tell no one has provided any evidence for this to date.

I raised this same point in the TAG review of the proposal to freeze the user-agent string. The response was to link to the EFF's Panopticon which 1) mixes both active and passive signals and, 2) says nothing about how widespread passive fingerprinting is, just that it is possible. If this practice is so widespread and harmful there ought to be evidence for it.

It feels like a lot of work is being undertaken to fix an unproven problem. Also, if browser makers want to reduce the entropy available for fingerprinting they already have a perfectly good means for doing so—by changing their user-agent header to something simpler, an approach already adopted by Mozilla etc.

jwrosewell · 2021-03-15T12:17:23Z

@miketaylr

Can you explain how it adds entropy? I'm not sure I follow that point.

Since Chrome 89 was released it's now possible with a high degree of certainty to detect the Brave web browser due to the absence of the "sec-ch-ua" header field name when a User-Agent field value is present that is identical to the one included in Chrome.

amtunlimited · 2021-03-15T15:30:41Z

the absence of the "sec-ch-ua" header field name when a User-Agent field value is present that is identical to the one included in Chrome.

Vivaldi also does not send "sec-ch-ua" headers and has a User-Agent string that is identical to Chrome. Vivaldi and Brave are indistinguishable in this manner

jwrosewell · 2021-03-16T08:07:59Z

Overall entropy has increased across the web because as a minimum we now have three web browsers that prior to the release of Chrome 89 appeared identical when HTTP headers were inspected. Now two of those three web browsers appear differently to the other.

Also as Google opted web browsers at random into the trial prior to Chrome 89 entropy for all Chrome users increased.

miketaylr · 2021-03-17T16:20:06Z

I'd be interested in seeing the justification for the goals also, particularly around mitigating the passive fingerprinting that is supposedly so widespread. As far as I can tell no one has provided any evidence for this to date.

Sure, I'm happy to improve the spec in these areas, and link out to other guidance such as https://w3c.github.io/fingerprinting-guidance/ (which links out to some academic research, but there's plenty more out there).

Also, if browser makers want to reduce the entropy available for fingerprinting they already have a perfectly good means for doing so—by changing their user-agent header to something simpler, an approach already adopted by Mozilla etc.

@ronancremin can you clarify what Mozilla has done to simplify their UA string?

jwrosewell · 2021-03-17T16:48:15Z

@miketaylr the document referenced, and the documents it references explain that probabilistic identifiers (aka fingerprints) exist, but contains very little to support the case for harms actually occurring as a result of the existence of probabilistic identifiers. I believe it was this evidence @ronancremin and myself were asking for.

In order to keep the this document short I would prefer to see the actual case for harms occurring being made rather than the theoretical risk.

Even if harm is occurring, which there undoubtedly will be whilst people have free will, we will need to consider remedies that help identify these harms and bring the perpetrators to justice according to the laws set by law makers rather than WICG or the W3C. For example; helping data protection regulators more easily identify the bad actors causing these harms would be a remedy to the problem. No consideration has been given by the proposers of this solution to such remedies.

At the moment this document does the equivalent of stating that knives exist, identifying they could be used to threaten or injure people, before concluding the only remedy is to ban all knives without considering how we can deter people from using them for harm, and without recognizing they can be used to prepare and consume tasty meals and that this is the primary use.

miketaylr · 2021-03-17T18:47:36Z

At the moment this document does the equivalent of stating that knives exist, identifying they could be used to threaten or injure people, before concluding the only remedy is to ban all knives without considering how we can deter people from using them for harm, and without recognizing they can be used to prepare and consume tasty meals and that this is the primary use.

If the implication is that a knife is the User-Agent string in this metaphor, I don't think it works. UA-CH doesn't ban it, it aims to provide an alternative mechanism that doesn't suffer from the same privacy issues (passive entropy collection), and ideally a nicer API that still allows for default low-entropy collection and active higher-entropy collection in a way a user agent can observe and applies interventions on, if it so chooses. Is that a knife sheath? An edge guard? (It's not important, I don't think the metaphor works in any case).

But, we're veered into unproductive discussion that falls outside of the scope of the UA-CH API, and away from the original idea proposed in this issue. As to the original issue, it doesn't align with the stated goals of the API, so I'm inclined to close it. But thanks for opening it and starting the discussion.

jwrosewell · 2021-03-17T19:16:35Z

@miketaylr As you have opened a new issue to progress adding the justification for the proposal to the document I ask that you either re-open this issue so that the merits of the original issue can be debated, or explain exactly how the original issue does not meet the stated goals of the API. I believe the original issue does meet all the stated goals as explained earlier in this thread and also provides additional benefits concerning migration complexity and reducing the cost for the web ecosystem to migrate.

I took care not to include the issue of justification for the proposal in the original issue. I also note that it was in answering your comment in relation to the stated goals that led to the supposed unproductive discussion. I and others answered you in good faith and we now have a new issue #214 to continue that discussion without encumbering this one. Thank you for separating them.

miketaylr · 2021-03-17T22:02:17Z

or explain exactly how the original issue does not meet the stated goals of the API

Sure, as stated in #200 (comment), the proposal in this issue adds passive entropy to the UA string. The 2 goals of UA-CH are to add a new API that reduces default passive entropy, and to improve upon the UA string. The spec is pretty up front that the UA string isn't a great API.

The current text in the spec abstract states:

This document defines a set of Client Hints that aim to provide developers with the ability to perform agent-based content negotiation when necessary, while avoiding the historical baggage and passive fingerprinting surface exposed by the venerable User-Agent header.

Your proposal is essentially:

...structured information is appended to the User-Agent value within square brackets

So, again, adding passive entropy to the UA string is out of scope for UA-CH, and not aligned with the goals of the spec.

ronancremin · 2021-03-18T10:28:07Z

I'd be interested in seeing the justification for the goals also, particularly around mitigating the passive fingerprinting that is supposedly so widespread. As far as I can tell no one has provided any evidence for this to date.

Sure, I'm happy to improve the spec in these areas, and link out to other guidance such as https://w3c.github.io/fingerprinting-guidance/ (which links out to some academic research, but there's plenty more out there).

Thanks for the link. I scanned through all of the documents cited in the Research section of that page. These research papers did not demonstrate anything about how widespread passive fingerprinting is—they are almost entirely focused on active fingerprinting methods such as font enumeration, plugins, canvas etc. This only further highlights the oddness of the UA-CH focus on passive fingerprinting.

Yes, there is real evidence for active tracking, and a browser's UA string can contribue entopy to that, but Google's own reseach suggests that active tracking is so much more powerful in its ability to fingerprint a user that a browser's UA string seems meaningless by comparison. From Google's Picasso reseach (which uses only canvas techniques): "Our JavaScript implementation of Picasso, when properly configured using the right graphical primitives, is able to successfully distinguish the browser family (Chrome, Firefox, etc.) and the OS family (Windows, iOS, OSX,etc.) of over 52 million clients with 100% accuracy."

Also, if browser makers want to reduce the entropy available for fingerprinting they already have a perfectly good means for doing so—by changing their user-agent header to something simpler, an approach already adopted by Mozilla etc.

@ronancremin can you clarify what Mozilla has done to simplify their UA string?

On mobile devices Firefox opts to leave out any model identifier tokens such as "SAMSUNG SM-G920F" from the UA string, unlike Chrome, Samsung Browser etc. Given that there are at least tens of thousands of distinct Android models alone in the market this has a very meaningful impact on the entropy contained in the UA string. Recall that the User-Agent header is a SHOULD requirement in the HTTP RFCs—browser makers aren't obligated to send anything at all, yet they chose to because overall it is benificial to their users and the web ecosystem.

Browser makers are well-placed to understand their users' needs and some have already made changes to increase user privacy without requiring a new spec to do so. The biggest improvements to user privacy have been made by defaulting third-party cookies to off and other active tracking prevention systems such as those included in Safari and Firefox to counter more sophisticated tracking approaches.

jwrosewell · 2021-03-18T12:26:43Z

@cwilso @LJWatson @travisleithead @yoavweiss @dontcallmedom - as WICG chairs and W3C representative for community groups, please could you re-open this issue so that those participants who feel it has been closed prematurely can signal to the wider community that we consider it unresolved? I understand @miketaylr will be OOO for a few days, hence the request to you.

@miketaylr welcome back. Let me clarify.

There is a difference between the data model used to communicate the information, and the conditions under which that information can be communicated. I only addressed the former aspect of the data model in this issue. I raised the engineering impact and cost for the existing ecosystem associated with the proposed data model, and the comparative inefficiency of the proposals data model. Now that Chrome 89 contains the experiment many web participants are appreciating the impact first hand.

Consider transmitting the platform version value. In this proposal the information would be appended to the User-Agent string in square brackets with a short prefix described in a registry. For example;

[PV/75.6]

In the document as currently written it might appear as a new header as follows.

Sec-CH-UA-Platform-Version: "Platform Version";v="75.6", ";Not A Something";v="99"

This will require those that record this information to change their data model, involves more data being consumed, and the associated costs for both (developer effort and data costs).

The proposed change does not require a change to the Accept-CH header, or any other method that would be used to determine the presence of the information. That is unrelated.

I do have separate concerns over the need for those requirements of the UACH proposal. These are now raised in #214 and #215. That should not prevent this issue being debated purely from the perspective of efficiency and cost to all web participants, many of whom are not represented directly here.

LJWatson · 2021-03-22T19:58:39Z

Reading through the comments, it appears that the solution proposed for this issue would not be compatible with the stated goals of the API. The justification for this was clearly given, and so I do not see a reason to reopen this issue.

If it is felt that an API with a different set of goals is worth exploring, then forking and/or starting an alternative incubation is always an option.

jwrosewell · 2021-03-25T21:17:24Z

@LJWatson Thank you for taking the time to review the issue and responding. Please could you refer me to the clear justification? I'm unable to identify it.

My analysis follows.

Reading 17th March 2021 version of the document I’ve been able to identify the following goals from the abstract.

to provide developers with the ability to perform agent-based content negotiation when necessary

while avoiding the historical baggage and

passive fingerprinting surface [aka probabilistic identifier] exposed by the venerable User-Agent header

The specification should be able to relate all it's requirements to these goals. If this is not done the requirements are merely ancillary and not of importance.

After the Abstract the document goes onto describe a solution where the information associated with content negotiation would appear in HTTP headers additional to the User-Agent header.

The content negotiation is performed using a further HTTP header named Accept-CH which is used to indicate to the user agent that the server would like to receive additional information in the subsequent requests. I believe this is the negotiation referred to in the abstract.

By requiring the server to request the additional information that information is no longer sent with every request in the User-Agent value thus reducing the information available to form a probabilistic identifier.

The proposal retains the User-Agent header whilst the User Agent Client Hint solution operates in parallel. The proposal suggests implementers SHOULD deprecate usage of the User-Agent header. This is an implementation consideration and not a goal of the proposal. The proposals' goal seeks only to “avoid the historical baggage”.

There will be a phase of parallel operation prior to implementers reaching a point where the User-Agent field can be removed entirely. During this period the “historical baggage” may be “locked” or “ratchet” over time.

This issue was raised so that this process can be optimised to reduce the amount of data required to achieve the stated goals and importantly reduce the migration complexity for the majority of web participants.

I therefore conclude that there is a goal I’ve been unable to identify.

Could the reasons for closure relate to the implementation consideration suggesting removal of the User-Agent field being consider a goal rather than a consideration?

If this is true then the abstract should change to clearly state a goal of the proposal is to remove in its entirety the User-Agent field. Such clarity would indeed be justification for the closure of an issue that sought to retain the User-Agent field by modify the data to reduce substantially the migration cost for the entire web ecosystem and reduce the data consumption of the web in practice whilst supporting all the other stated goals of the proposal.

@LJWatson @miketaylr Please could you confirm this interpretation is correct? I will then raise an issue to modify the goals to make this clear and a further issue to include the evidence to justify removing the User-Agent field. Or if this is not a correct interpretation please advise what I have missed in my analysis?

My preference is for this proposal to graduate to a W3C Working Group so that the W3C consensus based process can be applied. However @miketylor does not believe the proposal is sufficiently advanced for this to commence as open issues remain. See #176.

For the avoidance of doubt multiple WICG participants are requesting evidence from the proposer concerning the real harms, not theoretical risk, associated with passive probabilistic identifiers associated with the User-Agent. See #215. However, that is not relevant to this issue or this request. It is relevant to the suggestion to create an alternative proposal. As I do not find evidence to demonstrate the problem seeking to be solved in practice warrants a substantial alteration to this de-facto standard of interoperability. In any case engineering solutions alone may not be appropriate remedies. I’m therefore not in a position to advance an API with a different set of goals.

sanchezzzhak · 2021-09-13T22:25:10Z

I suggest not using ClientHint, but simply reducing the user agent
for example, the current useragent:
Mozilla/5.0 (Linux; Android 10; Model Name) AppleWebKit/537.36 (KHTML, like a gecko) Chrome/93.0.0.0 Mobile Safari/537.36

Remove Mozilla/5.0
Remove Linux;
Remove AppleWebKit/537.36
Remove (KHTML, like a gecko)
Remove Safari/537.36

As a result, we get a new format:
Android WebView
Android 10; Model Name; wv; Chrome/93.0.0.0; Blink; Mobile; (55 chars)
Android mobile:
Android 10; Model Name; Chrome/93.0.0.0; Blink; Mobile (53 chars)
Android table/tv/pc
Android 10; Model Name; Chrome/93.0.0.0; Blink; (45 chars)
Windows:
Windows 10; Chrome/93.0.0.0; Blink 93; (38 chars)
Linux:
Linux; Chrome/93.0.0.0; Blink (30 chars)
IOS
IOS 14_5; IPhone; Chrome/93.0.0.0; WebKit/537.36; Mobile (50 chars)
IPadOS
IPadOS 14_5; IPad; Chrome/93.0.0.0; WebKit/537.36
Mac
MacOs14_5; Mac; Chrome/93.0.0.0; Blink

this is much less than with CH-hints and a bunch of headers.

PS my ClientHint
"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92" (64 chars)

jwrosewell · 2021-09-14T17:36:35Z

@sanchezzzhak thank you for the additional analysis. Perhaps you could ask WICG chairs to re-open the issue? Alternatively post a new issue with your analysis referencing this one?

This was referenced Feb 10, 2021

Consider limits on brand size (and possibly hints in general) #179

Closed

UA full version info is too tightly bound to primary brand instead of engine #196

Closed

miketaylr closed this as completed Mar 17, 2021

miketaylr mentioned this issue Mar 17, 2021

Link to https://w3c.github.io/fingerprinting-guidance/ for passive entropy references #214

Closed

jwrosewell mentioned this issue Jun 11, 2021

Amend proposal to comply with "Google Commitments Offer" to Competition and Markets Authority #244

Closed

sanchezzzhak mentioned this issue Sep 14, 2021

I suggest not using ClientHint, but simply reducing the user agent #256

Closed

jwrosewell mentioned this issue Feb 15, 2022

Can't detecting windows 11 by using MatchEvidence 51Degrees/device-detection-go#4

Closed

jwrosewell mentioned this issue Jul 1, 2022

Remove GREASE #127

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify to reduce data overhead and migration complexity #200

Modify to reduce data overhead and migration complexity #200

jwrosewell commented Feb 10, 2021

erik-anderson commented Feb 11, 2021

miketaylr commented Mar 12, 2021

jwrosewell commented Mar 12, 2021

miketaylr commented Mar 12, 2021

ronancremin commented Mar 15, 2021

jwrosewell commented Mar 15, 2021

amtunlimited commented Mar 15, 2021

jwrosewell commented Mar 16, 2021

miketaylr commented Mar 17, 2021

jwrosewell commented Mar 17, 2021

miketaylr commented Mar 17, 2021

jwrosewell commented Mar 17, 2021

miketaylr commented Mar 17, 2021

ronancremin commented Mar 18, 2021

jwrosewell commented Mar 18, 2021

LJWatson commented Mar 22, 2021

jwrosewell commented Mar 25, 2021

sanchezzzhak commented Sep 13, 2021

jwrosewell commented Sep 14, 2021

Modify to reduce data overhead and migration complexity #200

Modify to reduce data overhead and migration complexity #200

Comments

jwrosewell commented Feb 10, 2021

erik-anderson commented Feb 11, 2021

miketaylr commented Mar 12, 2021

jwrosewell commented Mar 12, 2021

miketaylr commented Mar 12, 2021

ronancremin commented Mar 15, 2021

jwrosewell commented Mar 15, 2021

amtunlimited commented Mar 15, 2021

jwrosewell commented Mar 16, 2021

miketaylr commented Mar 17, 2021

jwrosewell commented Mar 17, 2021

miketaylr commented Mar 17, 2021

jwrosewell commented Mar 17, 2021

miketaylr commented Mar 17, 2021

ronancremin commented Mar 18, 2021

jwrosewell commented Mar 18, 2021

LJWatson commented Mar 22, 2021

jwrosewell commented Mar 25, 2021

sanchezzzhak commented Sep 13, 2021

jwrosewell commented Sep 14, 2021