-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Update Center] "eventually consistent" Azure Shared File storage for HTTPD htdocs
(aka. corrupted files some time to time)
#4402
Comments
It looks like the most probable culprit would be
Alas switching to We did a design mistake: => it means we might not need to revamp the |
htdocs
(aka. corrupted files some time to time)htdocs
(aka. corrupted files some time to time)
Update: working on the jenkins-infra/docker-rsyncd Docker image to support using rsync over SSH instead of rsyncd to ensure we have encrypted (rsyncd is not) and secured (key base authentication instead of plain text user/password...) connections Then we'll install a new rsyncd helm release for the httpd, which will mount the azure file in R/W and is exposed through an Azure PLS Ref initial work:
|
disabled the rollout restart cronjob to avoid umount/remount problem with disk access. |
#878) Related to jenkins-infra/helpdesk#4402 The new RsyncD service aimed at replacing the `azcopy` service will have to write data, so we want a new `RWX` persistent volume. This new PV/PVC will also replace the 2 existing (long term) with sub-directories (so we will provision only 1 time a premium 100 Gb instead of 2 today). Signed-off-by: Damien Duportal <[email protected]>
…tes-jenkins-io-data' (#879) Related to jenkins-infra/helpdesk#4402 Requires jenkins-infra/kubernetes-management#5933 Signed-off-by: Damien Duportal <[email protected]>
Update:
Test is successful from the trusted.ci permanent agent: $ rsync -av --rsh='ssh -i .ssh/id_new_uc' [email protected]:/updates-jenkins-io-data # /update-jenkins-io-data/
receiving incremental file list
drwxrwxrwx 0 2024/11/25 09:36:02 updates-jenkins-io-data
sent 25 bytes received 82 bytes 71.33 bytes/sec
total size is 0 speedup is 0.00 => proceeding to update the ZIP credentials on trusted.ci.jenkins.io (need a message in matrix + carefull check of the next update_center2 run) |
Update: tried to add the new
|
I wonder if the PVC should not be changed to an NFS one instead of this awful SMB |
Update:
|
Ow yeah, NFS + rsync works like a charm \o/ Next steps: send PR to configure all of this with IaC |
…s PV/PVC (#887) Related to jenkins-infra/helpdesk#4402 Tested manually with success! --------- Signed-off-by: Damien Duportal <[email protected]>
Update:
|
Update:
|
Update:
=> crawler blocks everything as httpd shows HTTP/404 for https://updates.jenkins.io/updates/. Fix needs:
=> tests on trusted.ci with replay (curl-ing the temp sh script) on crawler shows we need to set up the private link + DNS in the subnet of ephemeral VM blocking everything. WiP on this as top priority |
…#892) Ref. jenkins-infra/helpdesk#4402 (comment) The goal is to fix the error `ssh: connect to host updates.jenkins.io-data.trusted.ci.jenkins.io port 22: Connection timed out` --------- Signed-off-by: Damien Duportal <[email protected]>
Update:
|
Update: we've fully switched to the NFS filesystem with success -Both jenkins-infra/crawler#155 and jenkins-infra/update-center2#830 have been merged and we see files updates
Now, both https://updates.jenkins.io/updates/ and https://updates.jenkins.io/current/updates/ are now HTTP/200 (however, https://updates.jenkins.io/current/updates/ links are HTTP/404 now since they do not exist on mirrorbits...) Next steps:
|
The
=> we need to migrate this last service before being able to clean up the PVC. The
|
…d adapt PV/PVC namings (#894) Related to jenkins-infra/helpdesk#4402 Both PVCs `updates-jenkins-io` and `updates-jenkins-io-redirects` are not used anymore. Also adapting resource and outputs namings. Should resolve jenkins-infra/kubernetes-management#5962 and jenkins-infra/kubernetes-management#5969 Signed-off-by: Damien Duportal <[email protected]>
Update: Done:
Impacts: The amount of errored requests to the storage account drastically decreased by switching to NFS: => httpd containers are using a bit more CPU (~30 mcore to ~65 mcore) since then change. Other than that, we haven't seen any visible resource usage change: switching to NFS keeps the same performances => The httpd error rate (unexpected) HTTP/404 is low and stays low! We'll have to wait a few day for a confirmation though |
Update: credentials for Builds triggerered for both update_center2 and crawler to validate |
Update: we cleaned up the old resources - jenkins-infra/azure#895
|
#2649 introduced a new Update Center system built on top of mirrorbits + httpd.
During the brownouts, and since the production deployment, we see a few HTTP/500 errors in the httpd logs:
error.log
, where thexxx
of theInvalid command 'xxx'
part is any of the htaccess keyword but truncated.=> of course, if we look at the
azcopy
mechanism we use to fill the shared volume, we see the following:We have to find a way to get rid of these errors
The text was updated successfully, but these errors were encountered: