-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
webconnectivity v0.5: A/B testing using netemx #2634
Comments
This diff introduces the `minipipeline` package and command. The general idea here is to analyze experiments results using an architecture that matches https://github.com/ooni/data as closely as possible. The advantage of using this architecture is that we can neatly divide the data analysis process into three phases: 1. ingestion converts an OONI measurement into observations, which are flat data structures; 2. analysis converts observations into an analysis structure, which is significantly simpler than observations; 3. scoring is the process (implemented by each experiment) to turn the analysis into top-level keys. That is: ```mermaid stateDiagram-v2 state "{ingestion}" as ingestion MEASUREMENT --> ingestion ingestion --> OBSERVATIONS state "{analysis}" as analysis OBSERVATIONS --> analysis analysis --> ANALYSIS state "{scoring}" as scoring ANALYSIS --> scoring scoring --> TOP_LEVEL_KEYS ``` This diff in particular tackles points 1 and 2. We'll implement 3 for Web Connectivity LTE in a separate diff. The diff itself commits several observations and analysis files along with the measurements from which they originate, which we all collectively use to write regression tests for the `minipipeline` package. The motivation for implementing these changes is that I need a more clearly defined model to address the differences between Web Connectivity LTE and v0.4 (see also #1392). Because there is already an architectural model for implementing these changes in ooni/data, it seems logical to borrow liberally from ooni/data. Incidentally, this change has the benefit of making it easier, in the future, to align ooni/probe-cli and ooni/data. The reference issue is ooni/probe#2634. Implementing this change seems now a precondition for implementing extra changes that would address such an issue for good, by changing LTE's scoring. Unlike the ooni/data pipeline, this pipeline is minimal (hence the name): it only aims to process measurements collected by a recent version of ooniprobe, rather than all measurements from the beginning of time. While there, disable signal integration tests because of ooni/probe#2636.
This diff introduces the qatool, originally developed in #1392. This tool allows to run webconnectivitylte netemx-based QA tests and produces the measurement, observations, and analysis files. We can use this tool to run individual tests and inspect their results when they're failing. Additionally, we can use this tool to regenerate the test suite used by the minipipeline. When running the tool, we can select which QA tests to run, whether to rerun netemx and to reprocess. This feature is handy because we don't need to rerun netemx every time and sometimes we may only want to rerun netemx. This work is part of ooni/probe#2634. It helps us to manage and test the minipipeline and webconnectivitylte in an easier way, and it enables further development and testing.
The new name is more accurate because we need to prepend to requests such that the last request ends up being first in the list. Part of ooni/probe#2634
The new name is more accurate because we need to prepend to requests such that the last request ends up being first in the list. Part of ooni/probe#2634
This diff adds more QA tests extracted from #1392. Part of ooni/probe#2634.
We're not saving "connect" events when TCP connect fails. This diff fixes the codebase so that we always save these events. Fix extracted from #1392. Reference issue ooni/probe#2634.
We're not saving "connect" network events when TCP connect fails. This diff fixes the codebase so that we always save these events. Fix extracted from #1392. Reference issue ooni/probe#2634.
Extracted from #1392. Part of ooni/probe#2634.
Extracted from #1392. Part of ooni/probe#2634.
While there, make the code slightly more correct and/or compact: 1. do not extract the AS org name 2. only extract the HTTP response if failure is nil 3. only extract the control HTTP response if failure is nil 4. do not include the zero ASN in results Also, write a script that updates all the minipipeline regression tests. Part of ooni/probe#2634
None means that an algorithm did not run or did not find enough data to produce a result. Empty means that the algorithm did run, did find enough data, and did produce an empty result. The difference between these two states, which generally is not important when writing Go code, is extremely important to take the correct decisions when assigning the results of web measurements. Accordingly, this diff goes through each algorithm and ensures we start with a None state and only switch to the empty state when we have seen enough data to determine that the result is indeed empty. Additionally, in this diff we also add the following new analysis rules: - `ComputeDNSPossiblyInvalidAddrsClassic`, which is like `ComputeDNSPossiblyInvalidAddrs` but does not consider TLS, which in turn is useful to emulate the original Web Connectivity v0.4 behavior; - `ComputeDNSPossiblyNonexistingDomains`, which tells us for which domains the probe and the control agree that those domains are undefined (i.e., they both get `NXDOMAIN`), which is useful to detect this specific case; We also renamed `ComputeHTTPFinalResponses` to `ComputeHTTPFinalResponsesWithControl` and changed the definition such that we only include responses for which we have a control. This is the core rule to decide whether we should move forward with considering the results of the HTTP diff set of algorithms. The reference issue is ooni/probe#2634.
Classic filtering is an `WebObservationsContainer` filtering technique that takes in input a `WebObservationsContainer` and only keeps DNS lookups using `getaddrinfo` and endpoints whose IP address has been discovered using `getaddrinfo`. By applying this technique, we reduce the richer dataset produced by Web Connectivity LTE to a smaller dataset comparable to what Web Connectivity v0.4 would return. In turn, by focusing the analysis on the reduced dataset, we hope to emulate the results produced by v0.4 for backward compatible test keys. I named this feature "classic" because it's what we used to do and I don't want to call it legacy. Part of ooni/probe#2634.
With ooni/probe-cli#1392 and all the other PRs referenced above, we are now able to produce the same top-level test keys for v0.4 and LTE as far as our QA suite is concerned. Hopefully and ideally this addresses the concern that switching to LTE is going to cause many differences with respect to v0.4. I solved the issue by allowing LTE to choose different engines for data analysis. The previous code and engine is now called "orig" and is not the default. A new engine, called "classic", which is now the default, attempts to be very compatible with v0.4. As follow up work, we should figure out a way to mix these two engines, such that we get both compatibility with v0.4 and more advanced flags that characterize all forms of blocking. In any case, we're now ready for releasing and doing #2555. |
This diff introduces the `minipipeline` package and command. The general idea here is to analyze experiments results using an architecture that matches https://github.com/ooni/data as closely as possible. The advantage of using this architecture is that we can neatly divide the data analysis process into three phases: 1. ingestion converts an OONI measurement into observations, which are flat data structures; 2. analysis converts observations into an analysis structure, which is significantly simpler than observations; 3. scoring is the process (implemented by each experiment) to turn the analysis into top-level keys. That is: ```mermaid stateDiagram-v2 state "{ingestion}" as ingestion MEASUREMENT --> ingestion ingestion --> OBSERVATIONS state "{analysis}" as analysis OBSERVATIONS --> analysis analysis --> ANALYSIS state "{scoring}" as scoring ANALYSIS --> scoring scoring --> TOP_LEVEL_KEYS ``` This diff in particular tackles points 1 and 2. We'll implement 3 for Web Connectivity LTE in a separate diff. The diff itself commits several observations and analysis files along with the measurements from which they originate, which we all collectively use to write regression tests for the `minipipeline` package. The motivation for implementing these changes is that I need a more clearly defined model to address the differences between Web Connectivity LTE and v0.4 (see also ooni#1392). Because there is already an architectural model for implementing these changes in ooni/data, it seems logical to borrow liberally from ooni/data. Incidentally, this change has the benefit of making it easier, in the future, to align ooni/probe-cli and ooni/data. The reference issue is ooni/probe#2634. Implementing this change seems now a precondition for implementing extra changes that would address such an issue for good, by changing LTE's scoring. Unlike the ooni/data pipeline, this pipeline is minimal (hence the name): it only aims to process measurements collected by a recent version of ooniprobe, rather than all measurements from the beginning of time. While there, disable signal integration tests because of ooni/probe#2636.
This diff introduces the qatool, originally developed in ooni#1392. This tool allows to run webconnectivitylte netemx-based QA tests and produces the measurement, observations, and analysis files. We can use this tool to run individual tests and inspect their results when they're failing. Additionally, we can use this tool to regenerate the test suite used by the minipipeline. When running the tool, we can select which QA tests to run, whether to rerun netemx and to reprocess. This feature is handy because we don't need to rerun netemx every time and sometimes we may only want to rerun netemx. This work is part of ooni/probe#2634. It helps us to manage and test the minipipeline and webconnectivitylte in an easier way, and it enables further development and testing.
The new name is more accurate because we need to prepend to requests such that the last request ends up being first in the list. Part of ooni/probe#2634
This diff adds more QA tests extracted from ooni#1392. Part of ooni/probe#2634.
) We're not saving "connect" network events when TCP connect fails. This diff fixes the codebase so that we always save these events. Fix extracted from ooni#1392. Reference issue ooni/probe#2634.
Extracted from ooni#1392. Part of ooni/probe#2634.
While there, make the code slightly more correct and/or compact: 1. do not extract the AS org name 2. only extract the HTTP response if failure is nil 3. only extract the control HTTP response if failure is nil 4. do not include the zero ASN in results Also, write a script that updates all the minipipeline regression tests. Part of ooni/probe#2634
…1401) None means that an algorithm did not run or did not find enough data to produce a result. Empty means that the algorithm did run, did find enough data, and did produce an empty result. The difference between these two states, which generally is not important when writing Go code, is extremely important to take the correct decisions when assigning the results of web measurements. Accordingly, this diff goes through each algorithm and ensures we start with a None state and only switch to the empty state when we have seen enough data to determine that the result is indeed empty. Additionally, in this diff we also add the following new analysis rules: - `ComputeDNSPossiblyInvalidAddrsClassic`, which is like `ComputeDNSPossiblyInvalidAddrs` but does not consider TLS, which in turn is useful to emulate the original Web Connectivity v0.4 behavior; - `ComputeDNSPossiblyNonexistingDomains`, which tells us for which domains the probe and the control agree that those domains are undefined (i.e., they both get `NXDOMAIN`), which is useful to detect this specific case; We also renamed `ComputeHTTPFinalResponses` to `ComputeHTTPFinalResponsesWithControl` and changed the definition such that we only include responses for which we have a control. This is the core rule to decide whether we should move forward with considering the results of the HTTP diff set of algorithms. The reference issue is ooni/probe#2634.
…1402) Classic filtering is an `WebObservationsContainer` filtering technique that takes in input a `WebObservationsContainer` and only keeps DNS lookups using `getaddrinfo` and endpoints whose IP address has been discovered using `getaddrinfo`. By applying this technique, we reduce the richer dataset produced by Web Connectivity LTE to a smaller dataset comparable to what Web Connectivity v0.4 would return. In turn, by focusing the analysis on the reduced dataset, we hope to emulate the results produced by v0.4 for backward compatible test keys. I named this feature "classic" because it's what we used to do and I don't want to call it legacy. Part of ooni/probe#2634.
This commit modifies Web Connectivity LTE to include the depth and fetch_body tags. Together, these tags allow us to know the redirect depth of related transactions and whether a transaction intends to fetch the body. Then, we modify the minipipeline to process these tags and include them into observations. In turn, the availability of these bits of information inside of the observations will allow us to more correctly parse and split data and more easily produce an analysis from which it's easier to emulate Web Connectivity v0.4. While there, make sure we extract `ControlDNSDomain` and `ControlDNSLookupFailure` for TCP endpoints. Also extracting these fields is going to be useful to more correctly emulate v0.4. Reference issue: ooni/probe#2634
This information is useful for debugging. Also, I noticed an issue in how I compute DNS consistency, which suggests that I need to refactor how to do that. Having the control/probe resolved addrs is required to do that. Part of ooni/probe#2634
We were running the algorithm on single IP addresses resolved by the probe against the set resolved by the control. This is incorrect, because we should compare to the whole set resolved by the probe. This fact is relevant because v0.4 says there's consistency if a single IP address or ASN resolved by the probe intersects with the corresponding control set. Conversely, our comparison required that each IP address resolved by the probe belonged to the control set of addresses or ASNs. This diff solves the problem. While there, remove DNSPossiblyInvalidAddrs. I have determined we need to run DNSDiff independently of bogons checks. Incidentally, the new DNSDiff code I am adding is straight from Web Connectivity v0.4, so we should be good. Part of ooni/probe#2634
Here we start using sets and we are consolidating the algorithm to say that there are unexpected addresses. Part of ooni/probe#2634
We want to know if the failure occurs while fetching a body or while just performing a connectivity check. Part of ooni/probe#2634
…ni#1413) While there, add additional metrics to understand what we were doing when the redirect failed. Were we checking for connectivity or attempting to fetch a webpage? See ooni/probe#2634
The general idea here is to have more orthogonal metrics and to have simpler code to implement them. Part of ooni/probe#2634
…oni#1415) This diff implements the following useful metrics: * DNSLookupSuccessWithValidAddress * DNSLookupSuccessWithValidAddressClassic * DNSLookupExpectedFailure * DNSLookupExpectedSuccess * HTTPFinalResponseSuccessTCPWithoutControl It also renames the HTTPDiff metrics for consistency, by adding an HTTPFinalResponse prefix to names. It also reimplements DNSExperimentFailure in terms of a generic DNS data collection function. And it also moves the definition of this variable to be near the top of the structure. I have also removed some tests. While in general this goes against my """religion""", I think I had added tests too prematurely and I need to move a bit faster to make this work converge. (Also, there are still several many tests that deal with producing data, which is why we have so much churn at every diff that changes the minipipeline. Also, the tests I am removing were only to cover corner cases, so I think it's overall justifiable to do this.) Part of ooni/probe#2634
…#1416) What we can do with DNSPossiblyNonexistingDomains, we can also do with DNSLookupExpectedFailure. Part of ooni/probe#2634
…1418) In classic mode, we should restrict to what Web Connectivity v0.4 is using to take decisions. Because we're not using information about fetch_body=false observations, let's zap them from classic mode. Ironically, this PR is a ten lines diff with 9k additions and 12k deletions triggered by regenerating tests data. Part of ooni/probe#2634.
…#1419) Up until now, we were setting the control expectation (i.e., the final result of accessing the website from the control) only for TCP endpoints observations. However, there are cases where we're considering DNS lookup failures where we also need this information. So, let's just make sure every observation we produce knows about what's the final expectation according to the test helper. Part of ooni/probe#2634
The linear analysis takes advantage of recent changes in WebConnectivity LTE to build a linear chain of observations where the entries at the beginning of the chain are from the latest redirect followed by the penultimate with the chain ending with observations from the initial request. Within each redirect, entries are sorted by type such that HTTP appears before TLS which appears before TCP which appears before DNS. Within each protocol, entries without errors sort before entries with errors. Finally, if two entries have the same redirect depth, protocol type and error, we sort using the transaction ID (which is enough to avoid making tests nondeterministic). With this functionality, we should be able to write relatively easily an analysis algorithm for Web Connectivity LTE. Part of ooni/probe#2634
This diff modifies Web Connectivity LTE to support different data analysis engines, refactors existing code as the "orig" engine, sets "orig" as the default, and introduces the "classic" analysis engine. Such a new engine aims to produce the same results of Web Connectivity v0.4 (as far as our test suite can test). Part of ooni/probe#2634
This diff switches LTE to use the classic "engine", removes restrictions regarding comparing test keys, and ensures with the existing test suite that LTE and v0.4 produce the same top-level test keys under the same conditions. While there, let's also bump LTE's version number, lest we forget. See ooni/probe#2634
We have a common test suite for webconnectivity v0.4 and v0.5 based on netemx. We should perform the following set of activities: (1) write more test cases and (2) make sure the differences between them are not regressions. I already know that there are differences that are regressions. We should try to act upon those differences in a short time frame (e.g., a sprint) and we should document as follow-up issues any other differences we could not fix. Once this issue is done, we can done proceed to A/B testing in the real world through #2555.
The text was updated successfully, but these errors were encountered: