Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Optimization] Look into if it's worth switching some/most of watchdog to use ec2 networking metrics #21

Closed
Cameronsplaze opened this issue Aug 9, 2024 · 3 comments · Fixed by #74
Labels
enhancement New feature or request

Comments

@Cameronsplaze
Copy link
Owner

Cameronsplaze commented Aug 9, 2024

Is your feature request related to a problem? Please describe.
Right now, the watchdog works by running SSM commands on the host w/ lambda, and pushing the results to custom metrics. If you look in the console, there's already ec2 metrics you can use and filter by autoscaling group. (CloudWatch Metrics => EC2 => By Auto Scaling Group => <ASG Name>. Then check NetworkIn and NetworkOut. We can add these two together w/ metric math too).

I don't think we can get rid of the lambda, so is it worth it? (I don't want the container to spin down if someone is ssh'd in, but I'd be surprised if ssh showed much traffic at all. Worth looking into though.).

  • If we can get rid of the lambda, then this simplifies the architecture and it's more easy to justify. And maybe if they're ssh'd in but not doing anything (even sftp would spike the traffic), then spinning down is okay?
  • If we CAN'T get rid of the lambda, there's still an argument for making it more simple and moving out the non-ssh checks.

To get farther, I need to spin up minecraft/valheim instances, and see what the metrics look like both with/without players connected. Test the following:

  • Normal connection w/ Minecraft/Valheim. (Both w/ players, then idle)
  • Check trying to connect to a container, no players. Is there enough traffic to keep the container up? (Maybe the container is done updating, so no network traffic on it's end, but is unpacking/installing the update. You don't want it to go down in the middle of this).
  • Same, but use hotspot for bad connection. (How much is packet count affected?)
  • Test idle SSH.
  • Test SSH, with SFTP.
  • Test SSH, just doing ls/cd/cat/etc in home, then inside EFS mount.
  • Is EFS used in this metric? Is S3 with #10?

Describe the solution you'd like
If this makes the architecture simpler/cheaper, do it.

Describe alternatives you've considered
The current way the architecture is now.

Additional context
N/A

@Cameronsplaze Cameronsplaze added the enhancement New feature or request label Aug 9, 2024
@Cameronsplaze
Copy link
Owner Author

Cameronsplaze commented Nov 1, 2024

For Minecraft, metric info:

  • SSH'd into Host, doing nothing, sum of traffic over 5 minutes:
    • Traffic out: 711k
    • Traffic in: 247k
      • Total: 958k
    • Packets out: 502
    • Packets in: 449
      • Total: 951

For Valheim, Lot more unstable so lowest values:

  • SSH'd into Host, do nothing, sum of traffic over 5 minutes:
    • Traffic out: 143k
    • Traffic in: 56k
      • Total: 200k
    • Packets out: 379
    • Packets in: 309
      • Total: 688

@Cameronsplaze
Copy link
Owner Author

Note, if this DOESN'T change, look at the Watchdog Errors (ContainerManager-*-Stack) alarm. You can't do 3 in a row, the container will reset and push one green status on re-try.

@Cameronsplaze
Copy link
Owner Author

Cameronsplaze commented Nov 20, 2024

Decided traffic for SSH isn't worth it. You can just connect to the container the "normal" way at the same time. The architecture becomes MUCH similar (removing lambda stuff), and flexible (can do tcp and udp at same time out of the box) by making this switch.

The trick is to only look at traffic going INTO the container. If you add IN and OUT traffic together, it's too noisy. If the container sends metrics somewhere for example, it can do that whenever, and trip the OUT threshold. By only watching IN, you only see people connecting, or container Downloading (updating) something. Either case, you don't want to shut down.

@Cameronsplaze Cameronsplaze linked a pull request Nov 24, 2024 that will close this issue
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant