Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move test helpers from digital ocean to AWS #47

Closed
15 tasks done
hellais opened this issue Apr 22, 2024 · 2 comments
Closed
15 tasks done

Move test helpers from digital ocean to AWS #47

hellais opened this issue Apr 22, 2024 · 2 comments
Assignees
Labels
priority/high technical task technical tasks e.g. deployment

Comments

@hellais
Copy link
Member

hellais commented Apr 22, 2024

Test helper rotation script is broken and manual changes were made to DNS to unbrick it on 18th March 2024: https://openobservatory.slack.com/archives/C38EJ0CET/p1710780947922739.

Following this incident the NS delegation of th.ooni.org has been migrated over to AWS, which currently hosts the following A records:
0.th.ooni.org -> 146.190.119.3, 2604:a880:4:1d0::69e:f000
​​1.th.ooni.org -> 161.35.89.250, 2a03:b0c0:2:d0::1768:9001
2.th.ooni.org -> 161.35.89.250, 2a03:b0c0:2:d0::1768:9001
3.th.ooni.org -> 146.190.119.3, 2604:a880:4:1d0::69e:f000

Note that 1 and 2 and 0 and 3 point to the same IP, because there were only 2 running VPS that were not broken from the auto rotation script.

Plan for migration

We plan to migrate all these test helpers over to the AWS ECS based configuration, see: https://github.com/ooni/devops/blob/main/tf/environments/prod/main.tf#L505.

All the previous addresses will be configured to point to ALB entry (see: https://github.com/ooni/devops/blob/main/tf/modules/oonith_service/main.tf#L176) for the oonith_service as aliases (effectively it behaves like a CNAME, but costs less: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resource-record-sets-choosing-alias-non-alias.html).

Checklist

  • Add support for IPv6 connectivity on test helpers
  • Setup 4.th.ooni.org on AWS (done 19th April 2024)
  • Update check-in to return 4.th.ooni.org (done 22nd April 2024)
  • Drop test helper migration script from backend-fsn (done 22nd April 2024)
  • Drop 3.th.ooni.org from prod/dns_records.tf and have it point to aws_alb.oonith_service: Move 3.th.ooni.org over to AWS #48
  • Monitor failure rate for 3.th.ooni.org
  • Monitor load of test helper to see if capacity is enough
  • Bump up capacity of machine and ensure that it’s increased with zero downtime
  • Edit backend config to return only 1,2,3,4.th.ooni.org (done 10:45 23th April 2024 CEST): Random sort of test helper addresses backend#838
  • Drop 0.th.ooni.org from prod/dns_records.tf and have it point to aws_alb.oonith_service
  • Monitor failure rate for 0.th.ooni.org\
  • Edit backend config to return only 0,3,4.th.ooni.org: Th full migrate backend#840
  • Drop 1-2.th.ooni.org from prod/dns_records.tf and have it point to aws_alb.oonith_service: Point 1,2.th.ooni.org to the AWS instances #52
  • Monitor failure rate for 1-2.th.ooni.org
  • Delete all test helper related hosts from digital ocean
@hellais hellais self-assigned this Apr 22, 2024
@hellais hellais added the technical task technical tasks e.g. deployment label Apr 23, 2024
@hellais
Copy link
Member Author

hellais commented Apr 24, 2024

The charts for the migration show that it worked well without any major issue.

The jumps in the chart were caused by two incidents during the migration:

  1. When we flipped 3.th, we hadn't dropped it from the returned addresses and so there were about 15 minutes of unavailability due to it taking some time to perform the flip
  2. We learned from that and performed the flip of 0.th only after dropping it from the rotation and ensuring the traffic dropped to near zero (Random sort of test helper addresses backend#838), however there was a bug in the availability zone mapping which lead 2 minutes in downtime: Fix bug in availability zone mapping #49.

Below I share the latest charts of the failure rates and measurement counts for historical record:
visualization (37)
visualization (38)
visualization (39)
visualization (40)
visualization (41)
visualization (42)

I am now going to move forward with the final step which is destroying the 2 remaining hosts on digital ocean.

@hellais
Copy link
Member Author

hellais commented Apr 24, 2024

The droplets are deleted on digital ocean

@hellais hellais closed this as completed Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/high technical task technical tasks e.g. deployment
Projects
None yet
Development

No branches or pull requests

1 participant