Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[21pt] AWS Step Function notification system #1367

Open
RobHanna-NOAA opened this issue Dec 10, 2024 · 0 comments
Open

[21pt] AWS Step Function notification system #1367

RobHanna-NOAA opened this issue Dec 10, 2024 · 0 comments
Labels
AWS Fix or Contribution for running HAND FIM in AWS CI/CD CI/CD - devOps related Sys Admin

Comments

@RobHanna-NOAA
Copy link
Contributor

RobHanna-NOAA commented Dec 10, 2024

Note: Now part of 1377 EPIC: FIM Sys Admin Tasks (and a few related FIM tasks)

As it stand currently, Rob has to do some manual editing of code in various AWS tools before kicking off a "Step function" run. The biggest problem here is that we have no notification system in place. I am pretty sure AWS has this type of system that can be bolted in.

The system can fail in a number of ways;

  • Incorrect adjustments or arguments to AWS Code to kick off Step functions. Most of these show them selves in the AWS Step function window within 10 - 20 mins. But you have to watch that specific AWS Step function window.
  • A FIM code level error occurs. We generally don't see these until the AWS Step function until it starts processing the first HUC which is 20 to 30 mins after starting. Most of these errors do not show up in the EFS FIM error log systems but some due. And they generally do not have much detail. It is pretty tricky to drill down and find the correct AWS logs to help show what went wrong.
  • A HUC level issue. 95% of these show up in the EFS error logs but you want to watch to make sure at least a few HUCs make it this far in case all of them are failing for various reasons such as bad input data or pre-clips, etc. You have to manually watch various screens in AWS to ensure you past this milestone and this can take 30 to 40 mins for the first couple to fully complete.

When Rob kicks off a run, he has to watch if quite closely for up to 30 mins to make sure it hits key milestones and has made it to the point of successfully completing at least one HUC. Then it can take up to 10 hrs (or more depending on if it is a UAT, BED or other run). The system can fail in various points as seen above. 95% of all errors listed above will roll up to the master "Step Function Run" page, but it has been noticed that on rare occasion, the step function hangs with little or no explanation. It is guessed this is usually due to a AWS update or HydroVIS region update or something. They are very hard to detect and can happen at any time. My hunch is that there will be no options to detect this type of failure short of some sort of different system that can check periodically or check after x period that it expects to be done. Not sure if AWS has this type of functionally but that would be a different issue.

This whole system of monitoring Step function runs is very intensive and easy to miss problems.

Adding this system will open up the option of developer kicking off their own runs as mentioned in 1366.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AWS Fix or Contribution for running HAND FIM in AWS CI/CD CI/CD - devOps related Sys Admin
Projects
None yet
Development

No branches or pull requests

1 participant