Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automod: test capture framework #470

Merged
merged 8 commits into from
Dec 15, 2023
Merged

automod: test capture framework #470

merged 8 commits into from
Dec 15, 2023

Conversation

bnewbold
Copy link
Collaborator

@bnewbold bnewbold commented Dec 9, 2023

This PR is currently rebased on top of #466, to demonstrate testing that rule. UPDATE: that PR merged, so now against main

Adds a hepa command to "capture" the current state of a real-world account: currently some account metadata (identity, profile, etc), plus some recent post records. This gets serialized to JSON for easy dumping to file, like:

go run ./cmd/hepa/ capture-recent atproto.com > automod/testdata/capture_atprotocom.json

Then, a test helper function which loads this file, and processes all the post records using an engine fixture.

Combined, these fixtures make it easy to do test-driven-development of new rules. You find an account which recently sent spam or violated some policy, take a capture snapshot, set up a test case, and then write a rule which triggers and satisfies the test.

Some notes:

  • tried moving the "test helpers" in to a sub-package (indigo/automod/automodtest) but hit a circular import, so left where it is
  • this won't work with all rule types, and some captures/rules may need additional mocking (eg, additional identities in the mock directory), but that should be fine
  • it usually isn't appropriate to capture real-world content in to public code. we can be careful about what we add in this repo (indigo); the "hackerdarkweb" example included in this PR seems fine to snapshot to me. the code does strip "Private" account metadata by default.
  • probably could use docs/comments. i'm not sure where best to put effort, feedback welcome!

@bnewbold bnewbold requested a review from warpfork December 9, 2023 06:35
@warpfork
Copy link
Contributor

fwiw about packages and cyclic imports: since that issue comes up lot as code grows, and file moving is such a bummer for git history, I have kind of a semi-standard convention for preemptively avoiding that:

  • foo package -- as empty as possible! contains mostly helper functions, usage examples, and aliases. top (e.g. largest transitives) of the import graph.
  • foo/core package -- most of the type definitions and interfaces. bottom (e.g. fewest transitives) of the import graph.
  • foo/feature_a -- imports foo/core. Maybe imported and used directly by library consumers; maybe just exposed via foo's use of it and aliasing of it.

There's lots of ways to lay out packages, but I like the above because people usually start looking for docs and examples at the shorter paths into the package tree.

Putting all the most essential types in the rootwards packages is a natural thing to do... but almost invariably creates frustration in the long run. Having the highest level usages and the coremost type defns in the same package is pretty much guaranteed to result in cycle issues when attempting to extract things.

Aliasing (especially nowadays that we have type aliasing) also makes it pretty easy to have the rootmost package expose "everything" a downstream consumer needs, without necessarily exposing them to the internal package graph details. (So for example, feature_a package refers to core.WhateverType a lot, but... foo can still expose func(WhateverType) without tossing more package imports at its consumers by also exporting type WhateverType = core.WhateverType.)

(Maybe this is all familiar old hat to you, but, 2c :))

@bnewbold
Copy link
Collaborator Author

package structure: makes sense! I'd be open to refactoring automod to fit that pattern. I think we should wait until a moment when there are not other PRs in flight though.

Manually resolved conficts:
  automod/engine_test.go
  automod/rules/fixture_test.go
@bnewbold
Copy link
Collaborator Author

Merged main, resolving conflicts.

Copy link
Contributor

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, including all the caveats you mentioned. One question about the schema.

@@ -198,12 +200,42 @@ var runCmd = &cli.Command{
},
}

// for simple commands, not long-running daemons
func configEphemeralServer(cctx *cli.Context) (*Server, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// Test helper which processes all the records from a capture. Intentionally exported, for use in other packages.
//
// This method replaces any pre-existing directory on the engine with a mock directory.
func ProcessCaptureRules(e *Engine, capture AccountCapture) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely feeling the lack of packages here or other strong organizational cues for what's testing-land and what's not. Understood that that's a future PR topic, but just want to ratify that outloud :)

The comment being explicit that this function is rewiring the engine is definitely very very good and appreciated 👍 , because I wouldn't necessarily presume that from the signature or name otherwise.


type AccountCapture struct {
CapturedAt syntax.Datetime `json:"capturedAt"`
AccountMeta AccountMeta `json:"accountMeta"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, main question: if there will be rules that want to look at multiple identities, should we brace for that now by making this a list or a map? Or is that overkill.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a reasonable question, but I think it is less ergonomic and maybe overkill. For simple rules it is nice to have a direct struct field to get access to the account meta for the owner/creator/author of the repo/record, which is by far the common access case.

For fetching more account meta, I think the pattern we should use is tiered caching. The rule code should just call evt.GetAccountMeta(did-or-handle), and the engine should read-through whatever layers of cache to get that info. That caching may even include per-event caching (not implemented currently!) which would be 1) fast and 2) ensure consistent behavior between rules executed on the same event (aka, the whole event should see the same account meta for all accounts).

This is following the facebook FXL functional pattern of basically memoizing all function calls within the scope of an event.

(side note: still mulling a better name than "event")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 I guess we could have an internal accountMetaMap and have a helper method (CurrentAccount()) which would access the current account?

@bnewbold bnewbold merged commit 705a15d into main Dec 15, 2023
7 checks passed
@bnewbold bnewbold deleted the bnewbold/test-capture branch December 15, 2023 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants