You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: Now part of 1377 EPIC: FIM Sys Admin Tasks (and a few related FIM tasks)
As it stand currently, Rob has to do some manual editing of code in various AWS tools before kicking off a "Step function" run. The biggest problem here is that we have no notification system in place. I am pretty sure AWS has this type of system that can be bolted in.
The system can fail in a number of ways;
Incorrect adjustments or arguments to AWS Code to kick off Step functions. Most of these show them selves in the AWS Step function window within 10 - 20 mins. But you have to watch that specific AWS Step function window.
A FIM code level error occurs. We generally don't see these until the AWS Step function until it starts processing the first HUC which is 20 to 30 mins after starting. Most of these errors do not show up in the EFS FIM error log systems but some due. And they generally do not have much detail. It is pretty tricky to drill down and find the correct AWS logs to help show what went wrong.
A HUC level issue. 95% of these show up in the EFS error logs but you want to watch to make sure at least a few HUCs make it this far in case all of them are failing for various reasons such as bad input data or pre-clips, etc. You have to manually watch various screens in AWS to ensure you past this milestone and this can take 30 to 40 mins for the first couple to fully complete.
When Rob kicks off a run, he has to watch if quite closely for up to 30 mins to make sure it hits key milestones and has made it to the point of successfully completing at least one HUC. Then it can take up to 10 hrs (or more depending on if it is a UAT, BED or other run). The system can fail in various points as seen above. 95% of all errors listed above will roll up to the master "Step Function Run" page, but it has been noticed that on rare occasion, the step function hangs with little or no explanation. It is guessed this is usually due to a AWS update or HydroVIS region update or something. They are very hard to detect and can happen at any time. My hunch is that there will be no options to detect this type of failure short of some sort of different system that can check periodically or check after x period that it expects to be done. Not sure if AWS has this type of functionally but that would be a different issue.
This whole system of monitoring Step function runs is very intensive and easy to miss problems.
Adding this system will open up the option of developer kicking off their own runs as mentioned in 1366.
The text was updated successfully, but these errors were encountered:
Note: Now part of 1377 EPIC: FIM Sys Admin Tasks (and a few related FIM tasks)
As it stand currently, Rob has to do some manual editing of code in various AWS tools before kicking off a "Step function" run. The biggest problem here is that we have no notification system in place. I am pretty sure AWS has this type of system that can be bolted in.
The system can fail in a number of ways;
When Rob kicks off a run, he has to watch if quite closely for up to 30 mins to make sure it hits key milestones and has made it to the point of successfully completing at least one HUC. Then it can take up to 10 hrs (or more depending on if it is a UAT, BED or other run). The system can fail in various points as seen above. 95% of all errors listed above will roll up to the master "Step Function Run" page, but it has been noticed that on rare occasion, the step function hangs with little or no explanation. It is guessed this is usually due to a AWS update or HydroVIS region update or something. They are very hard to detect and can happen at any time. My hunch is that there will be no options to detect this type of failure short of some sort of different system that can check periodically or check after x period that it expects to be done. Not sure if AWS has this type of functionally but that would be a different issue.
This whole system of monitoring Step function runs is very intensive and easy to miss problems.
Adding this system will open up the option of developer kicking off their own runs as mentioned in 1366.
The text was updated successfully, but these errors were encountered: