Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve onboarding_cli resilience #376

Closed
wants to merge 10 commits into from
Closed

Conversation

srieteja
Copy link
Contributor

@srieteja srieteja commented Jul 5, 2023

- What I did

  • removed qr_code related components from code
  • rename OnboardingUtil -> RegistrarApiUtil and consume this change
  • Delete .atKeys file if onboarding process encounters an unrecoverable exception/error
  • Retry each of the onboarding steps in case they encounter an exception

- How I did it

  • Introduced _deleteAtKeys() which checks if the atsign is already onboarded, if not deletes the atKeys file that has been created
  • Introduced OnboardingTask which is an abstract class that contains strucutre to retry a specific task in case it has encountered an Exception/Error or has not fetched necessary data. All the onboarding tasks have been rewritten by extending this class with varying maxRetries count to improve resilience.

- How to verify it

  • activate_cli should be working as usual
  • In case a certain task (updating pkamPubKey, encPubKey to remote) fails, onboarding_cli should retry those specific tasks. In case maxRetry count is exceeded, the .atKeys file should be deleted.
  • Unit tests on the way

- Description for the changelog

  • removed qr_code related components from code
  • rename OnboardingUtil -> RegistrarApiUtil and consume this change
  • Delete .atKeys file if onboarding process encounters an unrecoverable exception/error
  • Retry each of the onboarding steps in case they encounter an exception

@srieteja srieteja self-assigned this Jul 5, 2023
@srieteja
Copy link
Contributor Author

srieteja commented Jul 5, 2023

Unit tests on the way

@gkc
Copy link
Contributor

gkc commented Feb 15, 2024

@srieteja What's the status of this PR?

@srieteja
Copy link
Contributor Author

@gkc this PR is from before at_auth was introduced, so it requires some capacity to bring this PR up to speed. And since its not linked to a ticket anymore, it's low priority for me. Will bring this up to speed and get it closed as soon as I find some time.

@gkc
Copy link
Contributor

gkc commented Feb 15, 2024

@gkc this PR is from before at_auth was introduced, so it requires some capacity to bring this PR up to speed. And since its not linked to a ticket anymore, it's low priority for me. Will bring this up to speed and get it closed as soon as I find some time.

Makes sense, thanks @srieteja

@srieteja srieteja marked this pull request as draft April 17, 2024 15:50
@gkc
Copy link
Contributor

gkc commented Oct 8, 2024

@srieteja @sitaram-kalluri This PR is now very old. Should it be abandoned?

@srieteja
Copy link
Contributor Author

I think so @gkc. It is pre at-auth so It would take too much effort to make it relevant again. But if you think this kind of modularity is needed in onboarding_cli/at_auth, I can implement this style with the current code.

I will close this an PR now.

@srieteja srieteja closed this Oct 11, 2024
@srieteja srieteja deleted the failure_del_atkeys branch October 11, 2024 06:49
@gkc
Copy link
Contributor

gkc commented Oct 11, 2024

Thanks @srieteja

@sitaram-kalluri - since you are currently working on another aspect of CLI resilience (not proceeding if it doesn't have permission to create the atKeys file) can you also add resilience in other places? We need to ensure

  1. atKeys file is created once all of the keys have been cut
  2. But the atKeys file is deleted on exit if the public keys have not been successfully saved to the atServer
  3. We are left with a conundrum regarding the CRAM key. The final step should be to delete the CRAM key from the atServer; but what to do if that fails after several retries? The most sensible thing I can think to do is to implement logic in the CramVerbHandler in atServer such that if the public encryption key and at least one PKAM public key are present, then the CramVerbHandler should reject the attempt.

Please also ensure that the retries are done with exponential backoff. I'd suggest up to 5 retries with 5, 8, 13, 21, 34-seconds delays between attempts (resetting state whenever a request succeeds)

@sitaram-kalluri
Copy link
Member

sitaram-kalluri commented Oct 14, 2024

Thanks @srieteja

@sitaram-kalluri - since you are currently working on another aspect of CLI resilience (not proceeding if it doesn't have permission to create the atKeys file) can you also add resilience in other places? We need to ensure

1. atKeys file is created once all of the keys have been cut

2. But the atKeys file is deleted on exit if the public keys have not been successfully saved to the atServer

3. We are left with a conundrum regarding the CRAM key. The final step should be to delete the CRAM key from the atServer; but what to do if that fails after several retries? The most sensible thing I can think to do is to implement logic in the CramVerbHandler in atServer such that if the public encryption key and at least one PKAM public key are present, then the CramVerbHandler should reject the attempt.

Please also ensure that the retries are done with exponential backoff. I'd suggest up to 5 retries with 5, 8, 13, 21, 34-seconds delays between attempts (resetting state whenever a request succeeds)

Sure, @gkc . Will create git tickets for each item and work on them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants