webconnectivity v0.5: A/B testing using netemx #2634

bassosimone · 2023-11-21T09:46:42Z

We have a common test suite for webconnectivity v0.4 and v0.5 based on netemx. We should perform the following set of activities: (1) write more test cases and (2) make sure the differences between them are not regressions. I already know that there are differences that are regressions. We should try to act upon those differences in a short time frame (e.g., a sprint) and we should document as follow-up issues any other differences we could not fix. Once this issue is done, we can done proceed to A/B testing in the real world through #2555.

See ooni/probe#2628 See ooni/probe#2634

This diff introduces the `minipipeline` package and command. The general idea here is to analyze experiments results using an architecture that matches https://github.com/ooni/data as closely as possible. The advantage of using this architecture is that we can neatly divide the data analysis process into three phases: 1. ingestion converts an OONI measurement into observations, which are flat data structures; 2. analysis converts observations into an analysis structure, which is significantly simpler than observations; 3. scoring is the process (implemented by each experiment) to turn the analysis into top-level keys. That is: ```mermaid stateDiagram-v2 state "{ingestion}" as ingestion MEASUREMENT --> ingestion ingestion --> OBSERVATIONS state "{analysis}" as analysis OBSERVATIONS --> analysis analysis --> ANALYSIS state "{scoring}" as scoring ANALYSIS --> scoring scoring --> TOP_LEVEL_KEYS ``` This diff in particular tackles points 1 and 2. We'll implement 3 for Web Connectivity LTE in a separate diff. The diff itself commits several observations and analysis files along with the measurements from which they originate, which we all collectively use to write regression tests for the `minipipeline` package. The motivation for implementing these changes is that I need a more clearly defined model to address the differences between Web Connectivity LTE and v0.4 (see also #1392). Because there is already an architectural model for implementing these changes in ooni/data, it seems logical to borrow liberally from ooni/data. Incidentally, this change has the benefit of making it easier, in the future, to align ooni/probe-cli and ooni/data. The reference issue is ooni/probe#2634. Implementing this change seems now a precondition for implementing extra changes that would address such an issue for good, by changing LTE's scoring. Unlike the ooni/data pipeline, this pipeline is minimal (hence the name): it only aims to process measurements collected by a recent version of ooniprobe, rather than all measurements from the beginning of time. While there, disable signal integration tests because of ooni/probe#2636.

This diff introduces the qatool, originally developed in #1392. This tool allows to run webconnectivitylte netemx-based QA tests and produces the measurement, observations, and analysis files. We can use this tool to run individual tests and inspect their results when they're failing. Additionally, we can use this tool to regenerate the test suite used by the minipipeline. When running the tool, we can select which QA tests to run, whether to rerun netemx and to reprocess. This feature is handy because we don't need to rerun netemx every time and sometimes we may only want to rerun netemx. This work is part of ooni/probe#2634. It helps us to manage and test the minipipeline and webconnectivitylte in an easier way, and it enables further development and testing.

The new name is more accurate because we need to prepend to requests such that the last request ends up being first in the list. Part of ooni/probe#2634

This diff adds more QA tests extracted from #1392. Part of ooni/probe#2634.

We're not saving "connect" events when TCP connect fails. This diff fixes the codebase so that we always save these events. Fix extracted from #1392. Reference issue ooni/probe#2634.

We're not saving "connect" network events when TCP connect fails. This diff fixes the codebase so that we always save these events. Fix extracted from #1392. Reference issue ooni/probe#2634.

Extracted from #1392. Part of ooni/probe#2634.

While there, make the code slightly more correct and/or compact: 1. do not extract the AS org name 2. only extract the HTTP response if failure is nil 3. only extract the control HTTP response if failure is nil 4. do not include the zero ASN in results Also, write a script that updates all the minipipeline regression tests. Part of ooni/probe#2634

None means that an algorithm did not run or did not find enough data to produce a result. Empty means that the algorithm did run, did find enough data, and did produce an empty result. The difference between these two states, which generally is not important when writing Go code, is extremely important to take the correct decisions when assigning the results of web measurements. Accordingly, this diff goes through each algorithm and ensures we start with a None state and only switch to the empty state when we have seen enough data to determine that the result is indeed empty. Additionally, in this diff we also add the following new analysis rules: - `ComputeDNSPossiblyInvalidAddrsClassic`, which is like `ComputeDNSPossiblyInvalidAddrs` but does not consider TLS, which in turn is useful to emulate the original Web Connectivity v0.4 behavior; - `ComputeDNSPossiblyNonexistingDomains`, which tells us for which domains the probe and the control agree that those domains are undefined (i.e., they both get `NXDOMAIN`), which is useful to detect this specific case; We also renamed `ComputeHTTPFinalResponses` to `ComputeHTTPFinalResponsesWithControl` and changed the definition such that we only include responses for which we have a control. This is the core rule to decide whether we should move forward with considering the results of the HTTP diff set of algorithms. The reference issue is ooni/probe#2634.

Classic filtering is an `WebObservationsContainer` filtering technique that takes in input a `WebObservationsContainer` and only keeps DNS lookups using `getaddrinfo` and endpoints whose IP address has been discovered using `getaddrinfo`. By applying this technique, we reduce the richer dataset produced by Web Connectivity LTE to a smaller dataset comparable to what Web Connectivity v0.4 would return. In turn, by focusing the analysis on the reduced dataset, we hope to emulate the results produced by v0.4 for backward compatible test keys. I named this feature "classic" because it's what we used to do and I don't want to call it legacy. Part of ooni/probe#2634.

bassosimone · 2023-12-01T00:59:00Z

With ooni/probe-cli#1392 and all the other PRs referenced above, we are now able to produce the same top-level test keys for v0.4 and LTE as far as our QA suite is concerned. Hopefully and ideally this addresses the concern that switching to LTE is going to cause many differences with respect to v0.4.

I solved the issue by allowing LTE to choose different engines for data analysis. The previous code and engine is now called "orig" and is not the default. A new engine, called "classic", which is now the default, attempts to be very compatible with v0.4. As follow up work, we should figure out a way to mix these two engines, such that we get both compatibility with v0.4 and more advanced flags that characterize all forms of blocking.

In any case, we're now ready for releasing and doing #2555.

This diff introduces the `minipipeline` package and command. The general idea here is to analyze experiments results using an architecture that matches https://github.com/ooni/data as closely as possible. The advantage of using this architecture is that we can neatly divide the data analysis process into three phases: 1. ingestion converts an OONI measurement into observations, which are flat data structures; 2. analysis converts observations into an analysis structure, which is significantly simpler than observations; 3. scoring is the process (implemented by each experiment) to turn the analysis into top-level keys. That is: ```mermaid stateDiagram-v2 state "{ingestion}" as ingestion MEASUREMENT --> ingestion ingestion --> OBSERVATIONS state "{analysis}" as analysis OBSERVATIONS --> analysis analysis --> ANALYSIS state "{scoring}" as scoring ANALYSIS --> scoring scoring --> TOP_LEVEL_KEYS ``` This diff in particular tackles points 1 and 2. We'll implement 3 for Web Connectivity LTE in a separate diff. The diff itself commits several observations and analysis files along with the measurements from which they originate, which we all collectively use to write regression tests for the `minipipeline` package. The motivation for implementing these changes is that I need a more clearly defined model to address the differences between Web Connectivity LTE and v0.4 (see also ooni#1392). Because there is already an architectural model for implementing these changes in ooni/data, it seems logical to borrow liberally from ooni/data. Incidentally, this change has the benefit of making it easier, in the future, to align ooni/probe-cli and ooni/data. The reference issue is ooni/probe#2634. Implementing this change seems now a precondition for implementing extra changes that would address such an issue for good, by changing LTE's scoring. Unlike the ooni/data pipeline, this pipeline is minimal (hence the name): it only aims to process measurements collected by a recent version of ooniprobe, rather than all measurements from the beginning of time. While there, disable signal integration tests because of ooni/probe#2636.

This diff introduces the qatool, originally developed in ooni#1392. This tool allows to run webconnectivitylte netemx-based QA tests and produces the measurement, observations, and analysis files. We can use this tool to run individual tests and inspect their results when they're failing. Additionally, we can use this tool to regenerate the test suite used by the minipipeline. When running the tool, we can select which QA tests to run, whether to rerun netemx and to reprocess. This feature is handy because we don't need to rerun netemx every time and sometimes we may only want to rerun netemx. This work is part of ooni/probe#2634. It helps us to manage and test the minipipeline and webconnectivitylte in an easier way, and it enables further development and testing.

The new name is more accurate because we need to prepend to requests such that the last request ends up being first in the list. Part of ooni/probe#2634

This diff adds more QA tests extracted from ooni#1392. Part of ooni/probe#2634.

) We're not saving "connect" network events when TCP connect fails. This diff fixes the codebase so that we always save these events. Fix extracted from ooni#1392. Reference issue ooni/probe#2634.

Extracted from ooni#1392. Part of ooni/probe#2634.

While there, make the code slightly more correct and/or compact: 1. do not extract the AS org name 2. only extract the HTTP response if failure is nil 3. only extract the control HTTP response if failure is nil 4. do not include the zero ASN in results Also, write a script that updates all the minipipeline regression tests. Part of ooni/probe#2634

…1401) None means that an algorithm did not run or did not find enough data to produce a result. Empty means that the algorithm did run, did find enough data, and did produce an empty result. The difference between these two states, which generally is not important when writing Go code, is extremely important to take the correct decisions when assigning the results of web measurements. Accordingly, this diff goes through each algorithm and ensures we start with a None state and only switch to the empty state when we have seen enough data to determine that the result is indeed empty. Additionally, in this diff we also add the following new analysis rules: - `ComputeDNSPossiblyInvalidAddrsClassic`, which is like `ComputeDNSPossiblyInvalidAddrs` but does not consider TLS, which in turn is useful to emulate the original Web Connectivity v0.4 behavior; - `ComputeDNSPossiblyNonexistingDomains`, which tells us for which domains the probe and the control agree that those domains are undefined (i.e., they both get `NXDOMAIN`), which is useful to detect this specific case; We also renamed `ComputeHTTPFinalResponses` to `ComputeHTTPFinalResponsesWithControl` and changed the definition such that we only include responses for which we have a control. This is the core rule to decide whether we should move forward with considering the results of the HTTP diff set of algorithms. The reference issue is ooni/probe#2634.

…1402) Classic filtering is an `WebObservationsContainer` filtering technique that takes in input a `WebObservationsContainer` and only keeps DNS lookups using `getaddrinfo` and endpoints whose IP address has been discovered using `getaddrinfo`. By applying this technique, we reduce the richer dataset produced by Web Connectivity LTE to a smaller dataset comparable to what Web Connectivity v0.4 would return. In turn, by focusing the analysis on the reduced dataset, we hope to emulate the results produced by v0.4 for backward compatible test keys. I named this feature "classic" because it's what we used to do and I don't want to call it legacy. Part of ooni/probe#2634.

This commit modifies Web Connectivity LTE to include the depth and fetch_body tags. Together, these tags allow us to know the redirect depth of related transactions and whether a transaction intends to fetch the body. Then, we modify the minipipeline to process these tags and include them into observations. In turn, the availability of these bits of information inside of the observations will allow us to more correctly parse and split data and more easily produce an analysis from which it's easier to emulate Web Connectivity v0.4. While there, make sure we extract `ControlDNSDomain` and `ControlDNSLookupFailure` for TCP endpoints. Also extracting these fields is going to be useful to more correctly emulate v0.4. Reference issue: ooni/probe#2634

Part of ooni/probe#2634.

This information is useful for debugging. Also, I noticed an issue in how I compute DNS consistency, which suggests that I need to refactor how to do that. Having the control/probe resolved addrs is required to do that. Part of ooni/probe#2634

) Part of ooni/probe#2634

We were running the algorithm on single IP addresses resolved by the probe against the set resolved by the control. This is incorrect, because we should compare to the whole set resolved by the probe. This fact is relevant because v0.4 says there's consistency if a single IP address or ASN resolved by the probe intersects with the corresponding control set. Conversely, our comparison required that each IP address resolved by the probe belonged to the control set of addresses or ASNs. This diff solves the problem. While there, remove DNSPossiblyInvalidAddrs. I have determined we need to run DNSDiff independently of bogons checks. Incidentally, the new DNSDiff code I am adding is straight from Web Connectivity v0.4, so we should be good. Part of ooni/probe#2634

Here we start using sets and we are consolidating the algorithm to say that there are unexpected addresses. Part of ooni/probe#2634

Part of ooni/probe#2634

…oni#1410) Part of ooni/probe#2634

We want to know if the failure occurs while fetching a body or while just performing a connectivity check. Part of ooni/probe#2634

Part of ooni/probe#2634

…ni#1413) While there, add additional metrics to understand what we were doing when the redirect failed. Were we checking for connectivity or attempting to fetch a webpage? See ooni/probe#2634

The general idea here is to have more orthogonal metrics and to have simpler code to implement them. Part of ooni/probe#2634

…oni#1415) This diff implements the following useful metrics: * DNSLookupSuccessWithValidAddress * DNSLookupSuccessWithValidAddressClassic * DNSLookupExpectedFailure * DNSLookupExpectedSuccess * HTTPFinalResponseSuccessTCPWithoutControl It also renames the HTTPDiff metrics for consistency, by adding an HTTPFinalResponse prefix to names. It also reimplements DNSExperimentFailure in terms of a generic DNS data collection function. And it also moves the definition of this variable to be near the top of the structure. I have also removed some tests. While in general this goes against my """religion""", I think I had added tests too prematurely and I need to move a bit faster to make this work converge. (Also, there are still several many tests that deal with producing data, which is why we have so much churn at every diff that changes the minipipeline. Also, the tests I am removing were only to cover corner cases, so I think it's overall justifiable to do this.) Part of ooni/probe#2634

…#1416) What we can do with DNSPossiblyNonexistingDomains, we can also do with DNSLookupExpectedFailure. Part of ooni/probe#2634

…1418) In classic mode, we should restrict to what Web Connectivity v0.4 is using to take decisions. Because we're not using information about fetch_body=false observations, let's zap them from classic mode. Ironically, this PR is a ten lines diff with 9k additions and 12k deletions triggered by regenerating tests data. Part of ooni/probe#2634.

…#1419) Up until now, we were setting the control expectation (i.e., the final result of accessing the website from the control) only for TCP endpoints observations. However, there are cases where we're considering DNS lookup failures where we also need this information. So, let's just make sure every observation we produce knows about what's the final expectation according to the test helper. Part of ooni/probe#2634

The linear analysis takes advantage of recent changes in WebConnectivity LTE to build a linear chain of observations where the entries at the beginning of the chain are from the latest redirect followed by the penultimate with the chain ending with observations from the initial request. Within each redirect, entries are sorted by type such that HTTP appears before TLS which appears before TCP which appears before DNS. Within each protocol, entries without errors sort before entries with errors. Finally, if two entries have the same redirect depth, protocol type and error, we sort using the transaction ID (which is enough to avoid making tests nondeterministic). With this functionality, we should be able to write relatively easily an analysis algorithm for Web Connectivity LTE. Part of ooni/probe#2634

This diff modifies Web Connectivity LTE to support different data analysis engines, refactors existing code as the "orig" engine, sets "orig" as the default, and introduces the "classic" analysis engine. Such a new engine aims to produce the same results of Web Connectivity v0.4 (as far as our test suite can test). Part of ooni/probe#2634

This diff switches LTE to use the classic "engine", removes restrictions regarding comparing test keys, and ensures with the existing test suite that LTE and v0.4 produce the same top-level test keys under the same conditions. While there, let's also bump LTE's version number, lest we forget. See ooni/probe#2634

bassosimone self-assigned this Nov 21, 2023

bassosimone added testing priority/high technical task technical tasks e.g. deployment data quality ooni/probe-engine funder/drl2022-2024 labels Nov 21, 2023

bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 22, 2023

chore: start investigating LTE vs v0.4

766d4d0

See ooni/probe#2628 See ooni/probe#2634

This was referenced Nov 22, 2023

feat: make sure LTE and v0.4 emit the same test keys ooni/probe-cli#1392

Merged

feat: minipipeline to analyze measurements ooni/probe-cli#1393

Merged

bassosimone mentioned this issue Nov 28, 2023

feat: qatool to run webconnectivity lte QA ooni/probe-cli#1394

Merged

bassosimone mentioned this issue Nov 28, 2023

refactor(webconnectivitylte): {Append,Prepend}Requests ooni/probe-cli#1396

Merged

bassosimone mentioned this issue Nov 28, 2023

feat(webconnectivityqa): add more QA tests ooni/probe-cli#1397

Merged

bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023

feat(webconnectivityqa): add more QA tests (#1397)

c4241a0

This diff adds more QA tests extracted from #1392. Part of ooni/probe#2634.

bassosimone mentioned this issue Nov 28, 2023

fix(webconnectivitylte): always save "connect" network events ooni/probe-cli#1398

Merged

bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023

chore(webconnectivitylte): sync {cleartext,secure}flow.go

4b14879

Extracted from #1392. Part of ooni/probe#2634.

bassosimone mentioned this issue Nov 28, 2023

chore(webconnectivitylte): sync {cleartext,secure}flow.go ooni/probe-cli#1399

Merged

bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023

chore(webconnectivitylte): sync {cleartext,secure}flow.go (#1399)

0009dd8

Extracted from #1392. Part of ooni/probe#2634.

bassosimone mentioned this issue Nov 28, 2023

cleanup(minipipeline): factor code into utility functions ooni/probe-cli#1400

Merged

bassosimone mentioned this issue Nov 29, 2023

fix(minipipeline/analysis): distinguish between None and empty ooni/probe-cli#1401

Merged

bassosimone mentioned this issue Nov 29, 2023

feat(minipipeline): introduce "classic" observations filtering ooni/probe-cli#1402

Merged

bassosimone closed this as completed Dec 1, 2023

Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024

feat(webconnectivityqa): add more QA tests (ooni#1397)

0f556e5

This diff adds more QA tests extracted from ooni#1392. Part of ooni/probe#2634.

Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024

chore(webconnectivitylte): sync {cleartext,secure}flow.go (ooni#1399)

412145f

Extracted from ooni#1392. Part of ooni/probe#2634.

Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024

chore: regenerate testdata for minipipeline (ooni#1404)

e85c352

Part of ooni/probe#2634.

Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024

cleanup(minipipeline/analysis): move utilities in other files (ooni#1406

b755ecd

) Part of ooni/probe#2634

Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024

refactor(minipipeline): adjust how we compute DNS failures (ooni#1409)

62e47ec

Part of ooni/probe#2634

Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024

refactor(minipipeline): adjust how we compute TCP and TLS anomalies (o…

b710bb9

…oni#1410) Part of ooni/probe#2634

Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024

refactor(minipipeline): tweak how we detect HTTP failures (ooni#1412)

9924da4

Part of ooni/probe#2634

Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024

refactor(minipipeline): tweak HTTP code (ooni#1414)

784495d

The general idea here is to have more orthogonal metrics and to have simpler code to implement them. Part of ooni/probe#2634

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webconnectivity v0.5: A/B testing using netemx #2634

webconnectivity v0.5: A/B testing using netemx #2634

bassosimone commented Nov 21, 2023

bassosimone commented Dec 1, 2023

webconnectivity v0.5: A/B testing using netemx #2634

webconnectivity v0.5: A/B testing using netemx #2634

Comments

bassosimone commented Nov 21, 2023

bassosimone commented Dec 1, 2023