Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webconnectivity v0.5: A/B testing using netemx #2634

Closed
bassosimone opened this issue Nov 21, 2023 · 1 comment
Closed

webconnectivity v0.5: A/B testing using netemx #2634

bassosimone opened this issue Nov 21, 2023 · 1 comment

Comments

@bassosimone
Copy link
Contributor

We have a common test suite for webconnectivity v0.4 and v0.5 based on netemx. We should perform the following set of activities: (1) write more test cases and (2) make sure the differences between them are not regressions. I already know that there are differences that are regressions. We should try to act upon those differences in a short time frame (e.g., a sprint) and we should document as follow-up issues any other differences we could not fix. Once this issue is done, we can done proceed to A/B testing in the real world through #2555.

@bassosimone bassosimone self-assigned this Nov 21, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 22, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023
This diff introduces the `minipipeline` package and command. The general
idea here is to analyze experiments results using an architecture that
matches https://github.com/ooni/data as closely as possible.

The advantage of using this architecture is that we can neatly divide
the data analysis process into three phases:

1. ingestion converts an OONI measurement into observations, which are
flat data structures;

2. analysis converts observations into an analysis structure, which is
significantly simpler than observations;

3. scoring is the process (implemented by each experiment) to turn the
analysis into top-level keys.

That is:

```mermaid
stateDiagram-v2
  state "{ingestion}" as ingestion
  MEASUREMENT --> ingestion
  ingestion --> OBSERVATIONS
  state "{analysis}" as analysis
  OBSERVATIONS --> analysis
  analysis --> ANALYSIS
  state "{scoring}" as scoring
  ANALYSIS --> scoring
  scoring --> TOP_LEVEL_KEYS
```

This diff in particular tackles points 1 and 2. We'll implement 3 for
Web Connectivity LTE in a separate diff.

The diff itself commits several observations and analysis files along
with the measurements from which they originate, which we all
collectively use to write regression tests for the `minipipeline`
package.

The motivation for implementing these changes is that I need a more
clearly defined model to address the differences between Web
Connectivity LTE and v0.4 (see also
#1392). Because there is already
an architectural model for implementing these changes in ooni/data, it
seems logical to borrow liberally from ooni/data.

Incidentally, this change has the benefit of making it easier, in the
future, to align ooni/probe-cli and ooni/data.

The reference issue is ooni/probe#2634.
Implementing this change seems now a precondition for implementing extra
changes that would address such an issue for good, by changing LTE's
scoring.

Unlike the ooni/data pipeline, this pipeline is minimal (hence the
name): it only aims to process measurements collected by a recent
version of ooniprobe, rather than all measurements from the beginning of
time.

While there, disable signal integration tests because of
ooni/probe#2636.
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023
This diff introduces the qatool, originally developed in
#1392. This tool allows to run
webconnectivitylte netemx-based QA tests and produces the measurement,
observations, and analysis files.

We can use this tool to run individual tests and inspect their results
when they're failing. Additionally, we can use this tool to regenerate
the test suite used by the minipipeline.

When running the tool, we can select which QA tests to run, whether to
rerun netemx and to reprocess. This feature is handy because we don't
need to rerun netemx every time and sometimes we may only want to rerun
netemx.

This work is part of ooni/probe#2634. It helps
us to manage and test the minipipeline and webconnectivitylte in an
easier way, and it enables further development and testing.
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023
The new name is more accurate because we need to prepend to requests
such that the last request ends up being first in the list.

Part of ooni/probe#2634
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023
The new name is more accurate because we need to prepend to requests
such that the last request ends up being first in the list.

Part of ooni/probe#2634
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023
This diff adds more QA tests extracted from
#1392.

Part of ooni/probe#2634.
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023
We're not saving "connect" events when TCP connect fails. This
diff fixes the codebase so that we always save these events.

Fix extracted from #1392.

Reference issue ooni/probe#2634.
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023
We're not saving "connect" network events when TCP connect fails. This
diff fixes the codebase so that we always save these events.

Fix extracted from #1392.

Reference issue ooni/probe#2634.
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 28, 2023
While there, make the code slightly more correct and/or compact:

1. do not extract the AS org name
2. only extract the HTTP response if failure is nil
3. only extract the control HTTP response if failure is nil
4. do not include the zero ASN in results

Also, write a script that updates all the minipipeline regression tests.

Part of ooni/probe#2634
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 29, 2023
None means that an algorithm did not run or did not find enough data to
produce a result. Empty means that the algorithm did run, did find
enough data, and did produce an empty result.

The difference between these two states, which generally is not
important when writing Go code, is extremely important to take the
correct decisions when assigning the results of web measurements.

Accordingly, this diff goes through each algorithm and ensures we start
with a None state and only switch to the empty state when we have seen
enough data to determine that the result is indeed empty.

Additionally, in this diff we also add the following new analysis rules:

- `ComputeDNSPossiblyInvalidAddrsClassic`, which is like
`ComputeDNSPossiblyInvalidAddrs` but does not consider TLS, which in
turn is useful to emulate the original Web Connectivity v0.4 behavior;
- `ComputeDNSPossiblyNonexistingDomains`, which tells us for which
domains the probe and the control agree that those domains are undefined
(i.e., they both get `NXDOMAIN`), which is useful to detect this
specific case;

We also renamed `ComputeHTTPFinalResponses` to
`ComputeHTTPFinalResponsesWithControl` and changed the definition such
that we only include responses for which we have a control. This is the
core rule to decide whether we should move forward with considering the
results of the HTTP diff set of algorithms.

The reference issue is ooni/probe#2634.
bassosimone added a commit to ooni/probe-cli that referenced this issue Nov 29, 2023
Classic filtering is an `WebObservationsContainer` filtering technique
that takes in input a `WebObservationsContainer` and only keeps DNS
lookups using `getaddrinfo` and endpoints whose IP address has been
discovered using `getaddrinfo`. By applying this technique, we reduce
the richer dataset produced by Web Connectivity LTE to a smaller dataset
comparable to what Web Connectivity v0.4 would return. In turn, by
focusing the analysis on the reduced dataset, we hope to emulate the
results produced by v0.4 for backward compatible test keys.

I named this feature "classic" because it's what we used to do and I
don't want to call it legacy.

Part of ooni/probe#2634.
@bassosimone
Copy link
Contributor Author

With ooni/probe-cli#1392 and all the other PRs referenced above, we are now able to produce the same top-level test keys for v0.4 and LTE as far as our QA suite is concerned. Hopefully and ideally this addresses the concern that switching to LTE is going to cause many differences with respect to v0.4.

I solved the issue by allowing LTE to choose different engines for data analysis. The previous code and engine is now called "orig" and is not the default. A new engine, called "classic", which is now the default, attempts to be very compatible with v0.4. As follow up work, we should figure out a way to mix these two engines, such that we get both compatibility with v0.4 and more advanced flags that characterize all forms of blocking.

In any case, we're now ready for releasing and doing #2555.

Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
This diff introduces the `minipipeline` package and command. The general
idea here is to analyze experiments results using an architecture that
matches https://github.com/ooni/data as closely as possible.

The advantage of using this architecture is that we can neatly divide
the data analysis process into three phases:

1. ingestion converts an OONI measurement into observations, which are
flat data structures;

2. analysis converts observations into an analysis structure, which is
significantly simpler than observations;

3. scoring is the process (implemented by each experiment) to turn the
analysis into top-level keys.

That is:

```mermaid
stateDiagram-v2
  state "{ingestion}" as ingestion
  MEASUREMENT --> ingestion
  ingestion --> OBSERVATIONS
  state "{analysis}" as analysis
  OBSERVATIONS --> analysis
  analysis --> ANALYSIS
  state "{scoring}" as scoring
  ANALYSIS --> scoring
  scoring --> TOP_LEVEL_KEYS
```

This diff in particular tackles points 1 and 2. We'll implement 3 for
Web Connectivity LTE in a separate diff.

The diff itself commits several observations and analysis files along
with the measurements from which they originate, which we all
collectively use to write regression tests for the `minipipeline`
package.

The motivation for implementing these changes is that I need a more
clearly defined model to address the differences between Web
Connectivity LTE and v0.4 (see also
ooni#1392). Because there is already
an architectural model for implementing these changes in ooni/data, it
seems logical to borrow liberally from ooni/data.

Incidentally, this change has the benefit of making it easier, in the
future, to align ooni/probe-cli and ooni/data.

The reference issue is ooni/probe#2634.
Implementing this change seems now a precondition for implementing extra
changes that would address such an issue for good, by changing LTE's
scoring.

Unlike the ooni/data pipeline, this pipeline is minimal (hence the
name): it only aims to process measurements collected by a recent
version of ooniprobe, rather than all measurements from the beginning of
time.

While there, disable signal integration tests because of
ooni/probe#2636.
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
This diff introduces the qatool, originally developed in
ooni#1392. This tool allows to run
webconnectivitylte netemx-based QA tests and produces the measurement,
observations, and analysis files.

We can use this tool to run individual tests and inspect their results
when they're failing. Additionally, we can use this tool to regenerate
the test suite used by the minipipeline.

When running the tool, we can select which QA tests to run, whether to
rerun netemx and to reprocess. This feature is handy because we don't
need to rerun netemx every time and sometimes we may only want to rerun
netemx.

This work is part of ooni/probe#2634. It helps
us to manage and test the minipipeline and webconnectivitylte in an
easier way, and it enables further development and testing.
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
The new name is more accurate because we need to prepend to requests
such that the last request ends up being first in the list.

Part of ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
This diff adds more QA tests extracted from
ooni#1392.

Part of ooni/probe#2634.
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
)

We're not saving "connect" network events when TCP connect fails. This
diff fixes the codebase so that we always save these events.

Fix extracted from ooni#1392.

Reference issue ooni/probe#2634.
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
While there, make the code slightly more correct and/or compact:

1. do not extract the AS org name
2. only extract the HTTP response if failure is nil
3. only extract the control HTTP response if failure is nil
4. do not include the zero ASN in results

Also, write a script that updates all the minipipeline regression tests.

Part of ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
…1401)

None means that an algorithm did not run or did not find enough data to
produce a result. Empty means that the algorithm did run, did find
enough data, and did produce an empty result.

The difference between these two states, which generally is not
important when writing Go code, is extremely important to take the
correct decisions when assigning the results of web measurements.

Accordingly, this diff goes through each algorithm and ensures we start
with a None state and only switch to the empty state when we have seen
enough data to determine that the result is indeed empty.

Additionally, in this diff we also add the following new analysis rules:

- `ComputeDNSPossiblyInvalidAddrsClassic`, which is like
`ComputeDNSPossiblyInvalidAddrs` but does not consider TLS, which in
turn is useful to emulate the original Web Connectivity v0.4 behavior;
- `ComputeDNSPossiblyNonexistingDomains`, which tells us for which
domains the probe and the control agree that those domains are undefined
(i.e., they both get `NXDOMAIN`), which is useful to detect this
specific case;

We also renamed `ComputeHTTPFinalResponses` to
`ComputeHTTPFinalResponsesWithControl` and changed the definition such
that we only include responses for which we have a control. This is the
core rule to decide whether we should move forward with considering the
results of the HTTP diff set of algorithms.

The reference issue is ooni/probe#2634.
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
…1402)

Classic filtering is an `WebObservationsContainer` filtering technique
that takes in input a `WebObservationsContainer` and only keeps DNS
lookups using `getaddrinfo` and endpoints whose IP address has been
discovered using `getaddrinfo`. By applying this technique, we reduce
the richer dataset produced by Web Connectivity LTE to a smaller dataset
comparable to what Web Connectivity v0.4 would return. In turn, by
focusing the analysis on the reduced dataset, we hope to emulate the
results produced by v0.4 for backward compatible test keys.

I named this feature "classic" because it's what we used to do and I
don't want to call it legacy.

Part of ooni/probe#2634.
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
This commit modifies Web Connectivity LTE to include the depth and
fetch_body tags. Together, these tags allow us to know the redirect
depth of related transactions and whether a transaction intends to fetch
the body. Then, we modify the minipipeline to process these tags and
include them into observations.

In turn, the availability of these bits of information inside of the
observations will allow us to more correctly parse and split data and
more easily produce an analysis from which it's easier to emulate Web
Connectivity v0.4.

While there, make sure we extract `ControlDNSDomain` and
`ControlDNSLookupFailure` for TCP endpoints. Also extracting these
fields is going to be useful to more correctly emulate v0.4.

Reference issue: ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
This information is useful for debugging. Also, I noticed an issue in
how I compute DNS consistency, which suggests that I need to refactor
how to do that. Having the control/probe resolved addrs is required to
do that.

Part of ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
We were running the algorithm on single IP addresses resolved by the
probe against the set resolved by the control.

This is incorrect, because we should compare to the whole set resolved
by the probe. This fact is relevant because v0.4 says there's
consistency if a single IP address or ASN resolved by the probe
intersects with the corresponding control set.

Conversely, our comparison required that each IP address resolved by the
probe belonged to the control set of addresses or ASNs.

This diff solves the problem.

While there, remove DNSPossiblyInvalidAddrs. I have determined we need
to run DNSDiff independently of bogons checks.

Incidentally, the new DNSDiff code I am adding is straight from Web
Connectivity v0.4, so we should be good.

Part of ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
Here we start using sets and we are consolidating the algorithm to say
that there are unexpected addresses.

Part of ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
We want to know if the failure occurs while fetching a body or while
just performing a connectivity check.

Part of ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
…ni#1413)

While there, add additional metrics to understand what we were doing
when the redirect failed. Were we checking for connectivity or
attempting to fetch a webpage?

See ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
The general idea here is to have more orthogonal metrics and to have
simpler code to implement them.

Part of ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
…oni#1415)

This diff implements the following useful metrics:

* DNSLookupSuccessWithValidAddress
* DNSLookupSuccessWithValidAddressClassic
* DNSLookupExpectedFailure
* DNSLookupExpectedSuccess
* HTTPFinalResponseSuccessTCPWithoutControl

It also renames the HTTPDiff metrics for consistency, by adding an
HTTPFinalResponse prefix to names.

It also reimplements DNSExperimentFailure in terms of a generic DNS data
collection function. And it also moves the definition of this variable
to be near the top of the structure.

I have also removed some tests. While in general this goes against my
"""religion""", I think I had added tests too prematurely and I need to
move a bit faster to make this work converge. (Also, there are still
several many tests that deal with producing data, which is why we have
so much churn at every diff that changes the minipipeline. Also, the
tests I am removing were only to cover corner cases, so I think it's
overall justifiable to do this.)

Part of ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
…#1416)

What we can do with DNSPossiblyNonexistingDomains, we can also do with
DNSLookupExpectedFailure.

Part of ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
…1418)

In classic mode, we should restrict to what Web Connectivity v0.4 is
using to take decisions. Because we're not using information about
fetch_body=false observations, let's zap them from classic mode.

Ironically, this PR is a ten lines diff with 9k additions and 12k
deletions triggered by regenerating tests data.

Part of ooni/probe#2634.
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
…#1419)

Up until now, we were setting the control expectation (i.e., the final
result of accessing the website from the control) only for TCP endpoints
observations. However, there are cases where we're considering DNS
lookup failures where we also need this information.

So, let's just make sure every observation we produce knows about what's
the final expectation according to the test helper.

Part of ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
The linear analysis takes advantage of recent changes in WebConnectivity
LTE to build a linear chain of observations where the entries at the
beginning of the chain are from the latest redirect followed by the
penultimate with the chain ending with observations from the initial
request. Within each redirect, entries are sorted by type such that HTTP
appears before TLS which appears before TCP which appears before DNS.
Within each protocol, entries without errors sort before entries with
errors. Finally, if two entries have the same redirect depth, protocol
type and error, we sort using the transaction ID (which is enough to
avoid making tests nondeterministic).

With this functionality, we should be able to write relatively easily an
analysis algorithm for Web Connectivity LTE.

Part of ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
This diff modifies Web Connectivity LTE to support different data
analysis engines, refactors existing code as the "orig" engine, sets
"orig" as the default, and introduces the "classic" analysis engine.
Such a new engine aims to produce the same results of Web Connectivity
v0.4 (as far as our test suite can test).

Part of ooni/probe#2634
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
This diff switches LTE to use the classic "engine", removes restrictions
regarding comparing test keys, and ensures with the existing test suite
that LTE and v0.4 produce the same top-level test keys under the same
conditions.

While there, let's also bump LTE's version number, lest we forget.

See ooni/probe#2634
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant