Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[get.jenkins.io, azure.updates.jenkins.io] MaxMind GeoIP Rate Limit hit when redeploying/upgrading mirrorbits chart #4240

Closed
dduportal opened this issue Aug 14, 2024 · 10 comments

Comments

@dduportal
Copy link
Contributor

Service(s)

get.jenkins.io, mirrors.jenkins.io, Other

Summary

We recently have been hit with the GeoIP MaxMind API rate limit.

Recently, @timja received alerts about this on his own account (which we realize we were using in production - fixed in #4195 ).

We continue receiving these alerts by email every day when we perform more than 2 deployments / day of mirrorbits (either get.jenkins.io or the new azure.updates.jenkins.io Update Center system.

These rate limits are blocking our production mirrorbits instances and threatens the service of outage.


Root cause is located on the "GeoIP" addition containers running on each mirrorbit pod:

  • The GeoIP init container is expected to initialize the GeoIP database directory at startup => If it fails (which happens when we have the API rate limit) then the Pod fails to start with the InitError state. If it succeeds then it downloads the database once, stops and then the other pod containers are starting.
  • The GeoIP side container then start, in parallel with the mirrorbits container. If it fails (which happens when we have the API rate limit) then the Pod fails to start with the Error state.
  • We set up GeoIP to update every 24 hours while MaxMind documentation recommends twice a week
  • Also, each pod of mirrorbits (we are running 3 of these) has its own GeoIP database living in an emptyDir (ref. https://github.com/jenkins-infra/helm-charts/blob/mirrorbits-2.4.5/charts/mirrorbits/templates/deployment.yaml#L115-L116) => It means we do not keep the data when a pod is upgraded or restarted on another node. We also do not share the data between all the mirrorbits instances: it adds a 4x constraint on the update frequency

Ref. https://support.maxmind.com/hc/en-us/articles/4408216129947-Download-and-Update-Databases

Reproduction steps

No response

@dduportal
Copy link
Contributor Author

The idea would be to have a persistent volume to store the GeoIP data and remove the init + side containers.

=> shared between mirrorbits instances avoid duplicating the downloads and we keep the data instead of downloading it from MaxMin on each pod restart/re-create.

Sharing a database such as this one between pods means we have to mount it as readonly in mirrorbits to avoid any write tentative.

=> It's already the case for the emptyDir but we should also set up the PV/PVC to be a ReadOnlyMany

It means we need a way to populate and update the PV data content: the GeoIP side containers should not run and be duplicated for each instance.

=> We need to run it as a separate deployment than mirrorbits but with only 1 replica and with the PV mounted in read+write. This separated deployement would take care or initializing and updating the database replacing the init and side container in our pods.

@dduportal
Copy link
Contributor Author

Two other challenges:

  • The lifecycle of the mirrorbits pods might be weird (loop restarts) when starting from scratch (not a problem but worth checking)
  • The cost, in our deployment, of the PV. We need to use Azurefile to ensure Read*Many but either we use a Premium (which requires to provision 100G ) or a Standard (pay by requests).

@dduportal
Copy link
Contributor Author

Proposal: Let's start with a PV in non Premium Azurefile and we'll see how it behaves. If it costs too much, then we'll have to move to premium.

dduportal added a commit to jenkins-infra/azure that referenced this issue Aug 16, 2024
Ref. jenkins-infra/helpdesk#4240

This PR adds a new PVC statically provisioned based on an Azurefile to
ensure a `ReadyMany*` access mode.
The goal is to have a centralized data dir. for the GeoIP database.

- Using a storage account reusable for the whole cluster, hence the
naming. Type is Storage v2: means billing per request: only define file
shares with low workload!

Signed-off-by: Damien Duportal <[email protected]>
@dduportal
Copy link
Contributor Author

Update:

=> Tested manually and worked (for populating the data). Need to validate the mirrorbits 4.x chart once installed

@dduportal
Copy link
Contributor Author

=> Manual test on updates.jenkins.io did work 👍 Let's roll!

@dduportal
Copy link
Contributor Author

Update: let's roll for updates.jenkins.io first: jenkins-infra/kubernetes-management#5565

@dduportal
Copy link
Contributor Author

Caused #4261 due to the PVC errors:

The geoipupdate pod was stuck in CrashLoopBack since the yesterday's cluster upgrade #4161, but was also failing every 72 hours when trying to update the database.

The database files where stuck with SMB file handles in delete/concurrent writes 😡 . Visible in the Azure Storage Explored with:

Capture d’écran 2024-08-24 à 10 30 17 Capture d’écran 2024-08-24 à 10 35 21

and on the geoipdata Linux container with weird errors such as cp: can't create '/usr/share/GeoIP/GeoLite2-City.mmdb': No such file or directory

@dduportal dduportal removed this from the infra-team-sync-2024-08-20 milestone Aug 24, 2024
dduportal pushed a commit to jenkins-infra/azure that referenced this issue Sep 23, 2024
@dduportal
Copy link
Contributor Author

Closing as per #4278 (comment)

  • We now have a cron job running the update with big enough interval to avoid triggering the rate limit
  • Database has been updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants