Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim Request #716

amydevs · 2024-05-09T03:40:41Z

Description

The goal of this PR is to refactor the Discovery domain so that discoverVertexHandler does not timeout upon making multiple GET requests for claims on the identity provider.

This will fix the issue listed as currently, what is failing the discovery of multiple claims is purely the fact that Discovery.verifyIdentityClaims is not an atomic operation. The method will only return once all ClaimIds and their associated Claims have been retrieved from the identity provider.

In the case of GitHub, where n = number of claims, the requests made for the discovery of a given identity will be ceiling(n / 10) + n. The reason for the ceiling(n / 10) requests is that GitHub will only show 10 Gists per page, meaning that we have to paginate to get all ClaimIds.

Given that the amount of requests, and therefore amount of time taken scales linearly, that is why discoverVertexHandler times out when processing more than 1 Claim.

Task Aggregation

Upon realizing debugging many, many, many tasks might be a pain, I've realized the most painless solution.

Instead of scheduling a task for each request made, we can simply refresh the ctx to emulate this behaviour. Refreshing a ctx will not reset the timer if the timer has already hit the deadline, so it makes it very easy to emulate this behaviour. Furthermore, since the task is rescheduled when lazily cancelled by the task manager stopping, we can just paramaritise our progress before the task manager shuts down.

From here on out, please keep this in mind when I talk about creating new tasks, as I am now doing this instead.

Claim Pagination

Given that the time taken to paginate across a gist search to find all ClaimIds is ceiling(n / 10), we could run into a timeout on this task. This would happen with the current implementation:

In order to apply our timeout deadline to each request individually, we can instead have the task that processes the first page schedule a task to process the second page once it is done:

In order to do this we need a way to paginate through each page of Claims that is:

the least common denominator of all APIs
is serializable

The way we can do this is by returning a generic Opaque String that represents the PaginationToken to to be handed to the next task to execute their query. This type will have no meaning other than being what should be provided to the next function to get the next page.

Claim Requests

Each task to process a page can represent a pipeline to get all of the contents of the claims on that particular page. The directed graph in the order of which tasks are scheduled can be represented as so:

This way, we can streamline the meaning of Discovery.discoverVertexTimeoutTime to be the timeout time of each request in the Discovery task pipelines. By binding the timeout time with a specific expectation, we can expect less unexpected errors such as that found in the issue.

Identities `Provider` Abstract Class Interface

Because we now have to account for pagination + getting claim ids to schedule tasks for requesting each claim, we need to have a getClaimIds async generator method on Provider interface.

We will also need to add paginated versions of the getClaims and getClaimIds methods that will also return a paginationToken. The reason for this, is so that the task to process each request triggered by the method call to a paginated get... method, as well as resume if the task were stopped. This will be a bit messy as we cannot follow our existing pagination conventions, considering that there different providers will use different pagination methods. GitHub will use page index, but Twitter will use tokens and timestamps.

Issues Fixed

Tasks

1. Update Identities.Provider abstract class to include paginated API options
2. Change Discovery.verifyIdentityClaims so that we can schedule a task for requesting and processing each claim separately.
3. Implementing progress saving to the Discovery handler
4. Implement flag that will prefer usage of getClaimPages directly in the case where data can be aggregated more efficiently

Final checklist

tegefaulkes · 2024-05-09T04:06:35Z

This is really well explained, good work!

CMCDragonkai · 2024-05-09T04:11:21Z

Nice! I want to point out that we do have some notion of "time destructuring". https://github.com/MatrixAI/MatrixAI-Graph/issues/42

https://github.com/MatrixAI/MatrixAI-Graph/issues/43 - this is the right one.

CMCDragonkai · 2024-05-09T04:16:11Z

Regarding your ideas on parallelising the request process. Just beware of a couple of things:

Third party IdPs may change their UI - so it's best to rely on things that are unlikely to change alot (as in prefer things like URL patterns over like specific buttons/text located in specific HTML elements)
So if you want the count of gists... that may not always be the most reliable way of doing this.
Beware of "scraping" detectors, you really need to balance the throughput with reliability.
At the end of the day, I do believe in step-through timing, things need to run in the background with minimal CPU usage... etc - inch by inch is a better idea.

amydevs · 2024-05-09T06:58:05Z

hmm alright, i think i'll just stick with the step-through timing then

amydevs · 2024-05-09T07:07:06Z

Parallelism Draft

However, this is still kind of slow. We would ideally want the pages to be processed in parallel. GH search will provide the amount of total Gists returned from a specific query:

We can simply take this number and apply it to ceiling(n / 10) to figure out how many pages we need to request.

This will dramatically cut down on the amount of time required to get the ClaimIds.

Note, when there are no gist results, a different message is shown instead, so we should expect that the element to find amount of matching gists may not exist.

amydevs · 2024-05-10T05:30:10Z

according to the new implementation notes in the PR description, I have solved the issue of multiple claims. There are still some optimizations that still need to be done tho (listed in tasks) :333

amydevs · 2024-05-10T05:33:22Z

I can abunch of warnings about dangling tasks being put in to the queue each time i

run polykey identities discover
run polykey identities list to see that the discovery background task has completed
ctrl-c in the agent window
the agent shuts down gracefully
run polykey agent start again

I'm not exactly sure what these tasks are, given that the task manager is shutting down gracefully and that the discovery background task has already long-completed before i shut down the agent. This problem only happens sometimes, so not what sure it could be... I was consistently getting it, but after my latest commits, i am not

CMCDragonkai · 2024-05-15T06:02:08Z

I think in general tasks need to be checked for any weird errors. The latest staging version still comes up with things like this:

WARN:polykey.PolykeyAgent.task v0pi41vb17do01e9e5btfbaadtc:Failed - Reason: ErrorUtilsUndefinedBehaviour

amydevs · 2024-05-16T04:32:05Z

THIS IS NOT A PROBLEM, DISREGARD THIS! THE ROOT CAUSE HAS BEEN IDENTIFIED IN THE PROCEEDING COMMENT

for discovery resumption there needs to be changes in the code path. I have implemented the ProviderPaginationToken in order to support resumption of identity vertex discovery if the TaskManager is shutdown. However, I have come to realize that the discovery of a Node Vetex and all connected Vertices are all performed in a single Task. Therefore, passing a single ProviderPaginationToken around is not enough, as the discovery of a Node Vertex will potentially cause the discovery of multiple other Identity Vertices in one Task.

The solutions are either:

Split the discovery of each Vertex to be a Task each. This might be also useful for streamlining the queueDiscoveryByIdentity and queueDiscoveryByNode methods, as it will allow us to avoid rescheduling discovery for an identity if it is already scheduled to be discovered by the discovery task of a neighbor Vertex.
Pass around a map of ProviderPaginationTokens instead for the claims discovery of each Identity Vertex
Store ProviderPaginationTokens in the Database so that the same Task can be run multiple times, but will resume from the ProviderPaginationToken in the DB.

amydevs · 2024-05-16T06:31:27Z

The main problem right now, is that I've found discovering a Node vertex will make a Polykey agent request and iterate through the pages of claims on an identity in order to just obtain the providerIdentityClaimId used to index the claim on the Provider service. This is very inefficient, and has been a blocker as it would require us to keep the PaginationToken for Discovery resumption logic per Identity Claim on a given discovered node. Ideally we would want to be able to obtain the providerIdentityClaimId from the peer node directly.

There are 3 options for this I have discussed with @tegefaulkes about:

Create the Claim locally, upload the Claim to the provider, get the ProviderIdentityClaimId, append it to the Claim locally, edit the corresponding remote claim on the provider, publish the claim. This will mean that the ProviderIdentityClaimId is stored identitcally both on the local claim in the SIgChain, as well on the Provider. However, this would rely on the Provider having the ability to a Claim, which may not exist on a service like Twitter. Furthermore, the ProviderIdentityClaimId is redundant when being stored on the Provider, as there is always some way to retrieve it from the Provider already.
Create the Claim locally, publish the Claim to the provider, get the ProviderIdentityClaimId, append it to the Claim locally. This would solve the problem on depending the Provider to have functionality to edit a Claim. However, this would mean that the contents of the Claim deviate from the local copy in the SigChain
Create the Claim locally, publish the Claim to the provider, get the ProviderIdentityClaimId, store it locally in the database. When a peer node calls our nodesClaimsGet RPC handler, we respond not only with the ClaimIdEncoded and SignedTokenEncoded, but also the ProviderIdentityClaimId. However, the issue with this, is that this information is not apart of the provenance history in the sigchain.

Edit:

It has been deemed that solution 2 is the probably the way to go.

amydevs · 2024-05-20T02:33:06Z

@tegefaulkes roger wanted to let you know that the PR can be handed off to you. The last part is just fixing #723 and then it should be ready to merge

tegefaulkes · 2024-05-20T03:39:53Z

Sure thing, I'll get to it after addressing #720 and #651

tegefaulkes · 2024-05-22T07:56:35Z

Alright, I've fixed the main problem that was left. The main solution was to add a flag to scheduleDiscoveryForVertex called ignoreActive to ignore any active tasks when scheduling a new one. We only use this when rescheduling the task inside the task handler.

Besides that I had to apply some small fixes here and there.

This should be ready for review now.

src/discovery/Discovery.ts

src/identities/Provider.ts

tests/discovery/Discovery.test.ts

src/identities/types.ts

…every request

…ovider` API supports getting ClaimIds and Claims in single request

…laims stored in the local `SigChain`

…staltGraph` ctx is aborted before all Claims have been processed

fix: registering `checkRediscoveryHandler` fix: discovery ignores active tasks when scheduling a new task fix: fixed problem with `verifyIdentityClaims` handling abortion [ci skip]

[ci skip]

amydevs · 2024-05-23T02:46:06Z

Alright, I've fixed the main problem that was left. The main solution was to add a flag to scheduleDiscoveryForVertex called ignoreActive to ignore any active tasks when scheduling a new one. We only use this when rescheduling the task inside the task handler.

Besides that I had to apply some small fixes here and there.

This should be ready for review now.

everything seems good, merging now

amydevs self-assigned this May 9, 2024

amydevs changed the title ~~Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim~~ Draft: Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim May 9, 2024

amydevs marked this pull request as draft May 9, 2024 04:03

amydevs force-pushed the feature-discovery-task-atomization branch from d4c82b9 to 033a7e2 Compare May 9, 2024 05:01

amydevs force-pushed the feature-discovery-task-atomization branch 2 times, most recently from db6d2a7 to 7a2c868 Compare May 10, 2024 03:35

amydevs force-pushed the feature-discovery-task-atomization branch from 719d047 to a73e196 Compare May 13, 2024 06:31

amydevs force-pushed the feature-discovery-task-atomization branch from fb75acd to 2f7e77b Compare May 16, 2024 05:05

amydevs force-pushed the feature-discovery-task-atomization branch 6 times, most recently from 2decf7e to dbf2363 Compare May 17, 2024 04:01

amydevs mentioned this pull request May 20, 2024

Discovery domain does not reschedule lazily executed Tasks #723

Closed

amydevs assigned tegefaulkes May 20, 2024

tegefaulkes marked this pull request as ready for review May 22, 2024 07:56

tegefaulkes reviewed May 22, 2024

View reviewed changes

src/discovery/Discovery.ts Outdated Show resolved Hide resolved

src/identities/Provider.ts Outdated Show resolved Hide resolved

tests/discovery/Discovery.test.ts Outdated Show resolved Hide resolved

src/identities/types.ts Outdated Show resolved Hide resolved

amydevs force-pushed the feature-discovery-task-atomization branch 2 times, most recently from 0d0f537 to 6657dda Compare May 23, 2024 02:17

amydevs added 9 commits May 23, 2024 12:19

feat: faster fetching of GithubProvider claims

c1a6ece

feat: exposed pagination features for identities/Provider

cc71654

fix: verifyIdentityClaims can now take a ctx that refreshes before …

748a0b0

…every request

feat: default Provider.getClaimsPage implementation

5e09b9b

fix: TestProvider now correctly implements new Provider changes

0343a72

fix: inline documentation for identities/Provider

24e67e9

feat: optimize amount of requests needed for retrieving claims if `Pr…

531cfd6

…ovider` API supports getting ClaimIds and Claims in single request

feat: providerIdentityClaimId is now apart of ClaimLinkIdentity c…

d3e8204

…laims stored in the local `SigChain`

feat: Discovery now stores the lastPaginationProviderToken in `Ge…

d17326b

…staltGraph` ctx is aborted before all Claims have been processed

amydevs force-pushed the feature-discovery-task-atomization branch 3 times, most recently from 97dad8b to b442e54 Compare May 23, 2024 02:26

amydevs and others added 3 commits May 23, 2024 12:44

feat: Discovery tasks rescheduling after abortion

edb4e9c

fix: registering `checkRediscoveryHandler` fix: discovery ignores active tasks when scheduling a new task fix: fixed problem with `verifyIdentityClaims` handling abortion [ci skip]

fix: processing identities will schedule linked nodes for discovery

b0d7b57

[ci skip]

fix: only ignore active vertices when rescheduling inside the handler

4d98422

[ci skip]

amydevs force-pushed the feature-discovery-task-atomization branch from b442e54 to 4d98422 Compare May 23, 2024 02:44

amydevs changed the title ~~Draft: Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim~~ Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim May 23, 2024

amydevs merged commit cc06e16 into staging May 23, 2024
1 check passed

amydevs changed the title ~~Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim~~ Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim Request May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim Request #716

Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim Request #716

amydevs commented May 9, 2024 •

edited

Loading

tegefaulkes commented May 9, 2024

CMCDragonkai commented May 9, 2024 •

edited

Loading

CMCDragonkai commented May 9, 2024

amydevs commented May 9, 2024

amydevs commented May 9, 2024

amydevs commented May 10, 2024 •

edited

Loading

amydevs commented May 10, 2024 •

edited

Loading

CMCDragonkai commented May 15, 2024

amydevs commented May 16, 2024 •

edited

Loading

amydevs commented May 16, 2024 •

edited

Loading

amydevs commented May 20, 2024

tegefaulkes commented May 20, 2024

tegefaulkes commented May 22, 2024

amydevs commented May 23, 2024

Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim Request #716

Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim Request #716

Conversation

amydevs commented May 9, 2024 • edited Loading

Description

Task Aggregation

Claim Pagination

Claim Requests

Identities Provider Abstract Class Interface

Issues Fixed

Tasks

Final checklist

tegefaulkes commented May 9, 2024

CMCDragonkai commented May 9, 2024 • edited Loading

CMCDragonkai commented May 9, 2024

amydevs commented May 9, 2024

amydevs commented May 9, 2024

Parallelism Draft

amydevs commented May 10, 2024 • edited Loading

amydevs commented May 10, 2024 • edited Loading

CMCDragonkai commented May 15, 2024

amydevs commented May 16, 2024 • edited Loading

THIS IS NOT A PROBLEM, DISREGARD THIS! THE ROOT CAUSE HAS BEEN IDENTIFIED IN THE PROCEEDING COMMENT

amydevs commented May 16, 2024 • edited Loading

amydevs commented May 20, 2024

tegefaulkes commented May 20, 2024

tegefaulkes commented May 22, 2024

amydevs commented May 23, 2024

amydevs commented May 9, 2024 •

edited

Loading

Identities `Provider` Abstract Class Interface

CMCDragonkai commented May 9, 2024 •

edited

Loading

amydevs commented May 10, 2024 •

edited

Loading

amydevs commented May 10, 2024 •

edited

Loading

amydevs commented May 16, 2024 •

edited

Loading

amydevs commented May 16, 2024 •

edited

Loading