Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim Request #716

Merged
merged 12 commits into from
May 23, 2024

Conversation

amydevs
Copy link
Contributor

@amydevs amydevs commented May 9, 2024

Description

The goal of this PR is to refactor the Discovery domain so that discoverVertexHandler does not timeout upon making multiple GET requests for claims on the identity provider.

This will fix the issue listed as currently, what is failing the discovery of multiple claims is purely the fact that Discovery.verifyIdentityClaims is not an atomic operation. The method will only return once all ClaimIds and their associated Claims have been retrieved from the identity provider.

In the case of GitHub, where n = number of claims, the requests made for the discovery of a given identity will be ceiling(n / 10) + n. The reason for the ceiling(n / 10) requests is that GitHub will only show 10 Gists per page, meaning that we have to paginate to get all ClaimIds.

Given that the amount of requests, and therefore amount of time taken scales linearly, that is why discoverVertexHandler times out when processing more than 1 Claim.

Task Aggregation

Upon realizing debugging many, many, many tasks might be a pain, I've realized the most painless solution.

Instead of scheduling a task for each request made, we can simply refresh the ctx to emulate this behaviour. Refreshing a ctx will not reset the timer if the timer has already hit the deadline, so it makes it very easy to emulate this behaviour. Furthermore, since the task is rescheduled when lazily cancelled by the task manager stopping, we can just paramaritise our progress before the task manager shuts down.

From here on out, please keep this in mind when I talk about creating new tasks, as I am now doing this instead.

Claim Pagination

Given that the time taken to paginate across a gist search to find all ClaimIds is ceiling(n / 10), we could run into a timeout on this task. This would happen with the current implementation:

Untitled-2023-10-23-0424 excalidraw

In order to apply our timeout deadline to each request individually, we can instead have the task that processes the first page schedule a task to process the second page once it is done:

Untitled-2023-10-23-0424 excalidraw

In order to do this we need a way to paginate through each page of Claims that is:

  1. the least common denominator of all APIs
  2. is serializable

The way we can do this is by returning a generic Opaque String that represents the PaginationToken to to be handed to the next task to execute their query. This type will have no meaning other than being what should be provided to the next function to get the next page.

Claim Requests

Each task to process a page can represent a pipeline to get all of the contents of the claims on that particular page. The directed graph in the order of which tasks are scheduled can be represented as so:

Untitled-2023-10-23-0424 excalidraw

This way, we can streamline the meaning of Discovery.discoverVertexTimeoutTime to be the timeout time of each request in the Discovery task pipelines. By binding the timeout time with a specific expectation, we can expect less unexpected errors such as that found in the issue.

Identities Provider Abstract Class Interface

Because we now have to account for pagination + getting claim ids to schedule tasks for requesting each claim, we need to have a getClaimIds async generator method on Provider interface.

We will also need to add paginated versions of the getClaims and getClaimIds methods that will also return a paginationToken. The reason for this, is so that the task to process each request triggered by the method call to a paginated get... method, as well as resume if the task were stopped. This will be a bit messy as we cannot follow our existing pagination conventions, considering that there different providers will use different pagination methods. GitHub will use page index, but Twitter will use tokens and timestamps.

Issues Fixed

Tasks

  • 1. Update Identities.Provider abstract class to include paginated API options
  • 2. Change Discovery.verifyIdentityClaims so that we can schedule a task for requesting and processing each claim separately.
  • 3. Implementing progress saving to the Discovery handler
  • 4. Implement flag that will prefer usage of getClaimPages directly in the case where data can be aggregated more efficiently

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@amydevs amydevs self-assigned this May 9, 2024
@amydevs amydevs changed the title Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim Draft: Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim May 9, 2024
@amydevs amydevs marked this pull request as draft May 9, 2024 04:03
@tegefaulkes
Copy link
Contributor

This is really well explained, good work!

@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 9, 2024

Nice! I want to point out that we do have some notion of "time destructuring". https://github.com/MatrixAI/MatrixAI-Graph/issues/42

https://github.com/MatrixAI/MatrixAI-Graph/issues/43 - this is the right one.

@CMCDragonkai
Copy link
Member

Regarding your ideas on parallelising the request process. Just beware of a couple of things:

  1. Third party IdPs may change their UI - so it's best to rely on things that are unlikely to change alot (as in prefer things like URL patterns over like specific buttons/text located in specific HTML elements)
  2. So if you want the count of gists... that may not always be the most reliable way of doing this.
  3. Beware of "scraping" detectors, you really need to balance the throughput with reliability.
  4. At the end of the day, I do believe in step-through timing, things need to run in the background with minimal CPU usage... etc - inch by inch is a better idea.

@amydevs amydevs force-pushed the feature-discovery-task-atomization branch from d4c82b9 to 033a7e2 Compare May 9, 2024 05:01
@amydevs
Copy link
Contributor Author

amydevs commented May 9, 2024

hmm alright, i think i'll just stick with the step-through timing then

@amydevs
Copy link
Contributor Author

amydevs commented May 9, 2024

Parallelism Draft

However, this is still kind of slow. We would ideally want the pages to be processed in parallel. GH search will provide the amount of total Gists returned from a specific query:

image

We can simply take this number and apply it to ceiling(n / 10) to figure out how many pages we need to request.

Untitled-2023-10-23-0424 excalidraw

This will dramatically cut down on the amount of time required to get the ClaimIds.

Note, when there are no gist results, a different message is shown instead, so we should expect that the element to find amount of matching gists may not exist.

image

@amydevs amydevs force-pushed the feature-discovery-task-atomization branch 2 times, most recently from db6d2a7 to 7a2c868 Compare May 10, 2024 03:35
@amydevs
Copy link
Contributor Author

amydevs commented May 10, 2024

according to the new implementation notes in the PR description, I have solved the issue of multiple claims. There are still some optimizations that still need to be done tho (listed in tasks) :333

image

@amydevs
Copy link
Contributor Author

amydevs commented May 10, 2024

I can abunch of warnings about dangling tasks being put in to the queue each time i

  1. run polykey identities discover
  2. run polykey identities list to see that the discovery background task has completed
  3. ctrl-c in the agent window
  4. the agent shuts down gracefully
  5. run polykey agent start again

image

I'm not exactly sure what these tasks are, given that the task manager is shutting down gracefully and that the discovery background task has already long-completed before i shut down the agent. This problem only happens sometimes, so not what sure it could be... I was consistently getting it, but after my latest commits, i am not

@amydevs amydevs force-pushed the feature-discovery-task-atomization branch from 719d047 to a73e196 Compare May 13, 2024 06:31
@CMCDragonkai
Copy link
Member

I think in general tasks need to be checked for any weird errors. The latest staging version still comes up with things like this:

WARN:polykey.PolykeyAgent.task v0pi41vb17do01e9e5btfbaadtc:Failed - Reason: ErrorUtilsUndefinedBehaviour

@amydevs
Copy link
Contributor Author

amydevs commented May 16, 2024

THIS IS NOT A PROBLEM, DISREGARD THIS! THE ROOT CAUSE HAS BEEN IDENTIFIED IN THE PROCEEDING COMMENT

for discovery resumption there needs to be changes in the code path. I have implemented the ProviderPaginationToken in order to support resumption of identity vertex discovery if the TaskManager is shutdown. However, I have come to realize that the discovery of a Node Vetex and all connected Vertices are all performed in a single Task. Therefore, passing a single ProviderPaginationToken around is not enough, as the discovery of a Node Vertex will potentially cause the discovery of multiple other Identity Vertices in one Task.

The solutions are either:

  • Split the discovery of each Vertex to be a Task each. This might be also useful for streamlining the queueDiscoveryByIdentity and queueDiscoveryByNode methods, as it will allow us to avoid rescheduling discovery for an identity if it is already scheduled to be discovered by the discovery task of a neighbor Vertex.
  • Pass around a map of ProviderPaginationTokens instead for the claims discovery of each Identity Vertex
  • Store ProviderPaginationTokens in the Database so that the same Task can be run multiple times, but will resume from the ProviderPaginationToken in the DB.

@amydevs amydevs force-pushed the feature-discovery-task-atomization branch from fb75acd to 2f7e77b Compare May 16, 2024 05:05
@amydevs
Copy link
Contributor Author

amydevs commented May 16, 2024

The main problem right now, is that I've found discovering a Node vertex will make a Polykey agent request and iterate through the pages of claims on an identity in order to just obtain the providerIdentityClaimId used to index the claim on the Provider service. This is very inefficient, and has been a blocker as it would require us to keep the PaginationToken for Discovery resumption logic per Identity Claim on a given discovered node. Ideally we would want to be able to obtain the providerIdentityClaimId from the peer node directly.

There are 3 options for this I have discussed with @tegefaulkes about:

  1. Create the Claim locally, upload the Claim to the provider, get the ProviderIdentityClaimId, append it to the Claim locally, edit the corresponding remote claim on the provider, publish the claim. This will mean that the ProviderIdentityClaimId is stored identitcally both on the local claim in the SIgChain, as well on the Provider. However, this would rely on the Provider having the ability to a Claim, which may not exist on a service like Twitter. Furthermore, the ProviderIdentityClaimId is redundant when being stored on the Provider, as there is always some way to retrieve it from the Provider already.
  2. Create the Claim locally, publish the Claim to the provider, get the ProviderIdentityClaimId, append it to the Claim locally. This would solve the problem on depending the Provider to have functionality to edit a Claim. However, this would mean that the contents of the Claim deviate from the local copy in the SigChain
  3. Create the Claim locally, publish the Claim to the provider, get the ProviderIdentityClaimId, store it locally in the database. When a peer node calls our nodesClaimsGet RPC handler, we respond not only with the ClaimIdEncoded and SignedTokenEncoded, but also the ProviderIdentityClaimId. However, the issue with this, is that this information is not apart of the provenance history in the sigchain.

Edit:

It has been deemed that solution 2 is the probably the way to go.

@amydevs amydevs force-pushed the feature-discovery-task-atomization branch 6 times, most recently from 2decf7e to dbf2363 Compare May 17, 2024 04:01
@amydevs
Copy link
Contributor Author

amydevs commented May 20, 2024

@tegefaulkes roger wanted to let you know that the PR can be handed off to you. The last part is just fixing #723 and then it should be ready to merge

@tegefaulkes
Copy link
Contributor

Sure thing, I'll get to it after addressing #720 and #651

@tegefaulkes
Copy link
Contributor

Alright, I've fixed the main problem that was left. The main solution was to add a flag to scheduleDiscoveryForVertex called ignoreActive to ignore any active tasks when scheduling a new one. We only use this when rescheduling the task inside the task handler.

Besides that I had to apply some small fixes here and there.

This should be ready for review now.

@tegefaulkes tegefaulkes marked this pull request as ready for review May 22, 2024 07:56
src/discovery/Discovery.ts Outdated Show resolved Hide resolved
src/identities/Provider.ts Outdated Show resolved Hide resolved
tests/discovery/Discovery.test.ts Outdated Show resolved Hide resolved
src/identities/types.ts Outdated Show resolved Hide resolved
@amydevs amydevs force-pushed the feature-discovery-task-atomization branch 2 times, most recently from 0d0f537 to 6657dda Compare May 23, 2024 02:17
@amydevs amydevs force-pushed the feature-discovery-task-atomization branch 3 times, most recently from 97dad8b to b442e54 Compare May 23, 2024 02:26
amydevs and others added 3 commits May 23, 2024 12:44
fix: registering `checkRediscoveryHandler`

fix: discovery ignores active tasks when scheduling a new task

fix: fixed problem with `verifyIdentityClaims` handling abortion

[ci skip]
@amydevs amydevs force-pushed the feature-discovery-task-atomization branch from b442e54 to 4d98422 Compare May 23, 2024 02:44
@amydevs amydevs changed the title Draft: Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim May 23, 2024
@amydevs
Copy link
Contributor Author

amydevs commented May 23, 2024

Alright, I've fixed the main problem that was left. The main solution was to add a flag to scheduleDiscoveryForVertex called ignoreActive to ignore any active tasks when scheduling a new one. We only use this when rescheduling the task inside the task handler.

Besides that I had to apply some small fixes here and there.

This should be ready for review now.

everything seems good, merging now

@amydevs amydevs merged commit cc06e16 into staging May 23, 2024
1 check passed
@amydevs amydevs changed the title Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim Request May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants