[Update Center] "eventually consistent" Azure Shared File storage for HTTPD `htdocs` (aka. corrupted files some time to time) #4402

dduportal · 2024-11-19T15:44:31Z

#2649 introduced a new Update Center system built on top of mirrorbits + httpd.

During the brownouts, and since the production deployment, we see a few HTTP/500 errors in the httpd logs:

In the access logs, we have an HTTP/500 on an URL expected to work:

10.100.0.4 - - [19/Nov/2024:15:18:44 +0000] "GET /updates/hudson.plugins.sonar.SonarRunnerInstaller.json.html HTTP/1.1" 500 548

Correlated to a line like this in the error.log, where the xxx of the Invalid command 'xxx' part is any of the htaccess keyword but truncated.

[Tue Nov 19 15:18:44.031172 2024] [core:alert] [pid 10:tid 43] [client 10.100.0.4:46862] /usr/local/apache2/htdocs/.htaccess: Invalid command 'RewriteRul', perhaps misspelled or defined by a module not included in the server configuration

=> of course, if we look at the azcopy mechanism we use to fill the shared volume, we see the following:

WARN: AzCopy sync is supported but not fully recommended for Azure Files. AzCopy sync doesn't support differential copies at scale, and some file fidelity might be lost.

We have to find a way to get rid of these errors

The text was updated successfully, but these errors were encountered:

dduportal · 2024-11-19T16:30:44Z

It looks like the most probable culprit would be azcopy sync which corrupts the files:

Alas switching to azcopy cp (https://learn.microsoft.com/en-us/azure/storage/common/storage-ref-azcopy-copy) does not allow to delete files if existing.

We did a design mistake: rsync clearly seems a better fit.

=> it means we might not need to revamp the httpd architecture (using a shared htdocs file share). Let's keep the shared file system for now, and add an rsync pod which mounts the filesystem in R/W and is restricted to trusted.ci only (network + auth).

dduportal · 2024-11-19T20:27:12Z

Update: working on the jenkins-infra/docker-rsyncd Docker image to support using rsync over SSH instead of rsyncd to ensure we have encrypted (rsyncd is not) and secured (key base authentication instead of plain text user/password...) connections

Then we'll install a new rsyncd helm release for the httpd, which will mount the azure file in R/W and is exposed through an Azure PLS

Ref initial work:

Support SVC annotations for rsyncd chart (for Azure PLS): feat(rsyncd) support annotations on Service helm-charts#1422
Bump mirrorbits-parent subchart versions helm-charts#1425 -> Bump rsyncd docker images and helm chart versions helm-charts#1424 -> Docker image for rsyncd: bump dependencies (required for ssh support) and updatecli cleanup

smerle33 · 2024-11-21T16:50:24Z

disabled the rollout restart cronjob to avoid umount/remount problem with disk access.

#878) Related to jenkins-infra/helpdesk#4402 The new RsyncD service aimed at replacing the `azcopy` service will have to write data, so we want a new `RWX` persistent volume. This new PV/PVC will also replace the 2 existing (long term) with sub-directories (so we will provision only 1 time a premium 100 Gb instead of 2 today). Signed-off-by: Damien Duportal <[email protected]>

…tes-jenkins-io-data' (#879) Related to jenkins-infra/helpdesk#4402 Requires jenkins-infra/kubernetes-management#5933 Signed-off-by: Damien Duportal <[email protected]>

dduportal · 2024-11-25T16:20:18Z

Update:

A new rsyncd service has been created, only reachable privately or from trusted.ci permanent agent
- RsyncD Docker Image, Helm Chart:
  - feat!: add SSHD support as alternative to the default rsyncd server docker-rsyncd#24
  - feat!(rsyncd): add sshd as an alternative Rsync Daemon to rsyncd helm-charts#1434
- New PVC: feat(updates.jenkins.io) add a new PV/PVC in RWX to allow rsync write… azure#878
- Deployment: feat(publick8s) add a new rsyncd-data internal release for updates.jenkins.io kubernetes-management#5933
- Azure PLS/Private Endpoint (for trusted.ci access):
  - feat(trusted.ci) add an Azure Private Endpoint to reach the PLS 'updates-jenkins-io-data' azure#879
- SOPS updates:
  - Adding the SSH private key for this new service: https://github.com/jenkins-infra/charts-secrets/commit/5ce8882cadb3bf86745cca8148194eb9920a5415
  - Updated the trusted.ci credentials ZIP generation: https://github.com/jenkins-infra/charts-secrets/commit/3692e20fda960d6ba260f0361ba88d73f857dbf8

Test is successful from the trusted.ci permanent agent:

$ rsync -av --rsh='ssh -i .ssh/id_new_uc' [email protected]:/updates-jenkins-io-data # /update-jenkins-io-data/
receiving incremental file list
drwxrwxrwx              0 2024/11/25 09:36:02 updates-jenkins-io-data

sent 25 bytes  received 82 bytes  71.33 bytes/sec
total size is 0  speedup is 0.00

=> proceeding to update the ZIP credentials on trusted.ci.jenkins.io (need a message in matrix + carefull check of the next update_center2 run)

dduportal · 2024-11-25T17:36:52Z

Update: tried to add the new rsync targets with jenkins-infra/update-center2#825 but had to roll back (see jenkins-infra/update-center2#826) due to rsync errors which are most probably related to the combination of unprivileged container and a non POSIX file system (using SMB):

Change group errors:

rsync: [generator] chgrp "/updates-jenkins-io-data/content/." failed: Operation not permitted (1)

Set time errors (on inodes):

rsync: [generator] failed to set times on "/updates-jenkins-io-data/content/.": Operation not permitted (1)

and:

rsync: [receiver] mkstemp "/updates-jenkins-io-data/content/dynamic-stable-2.452.1/.update-center.actual.json.aKVdCl" failed: Operation not permitted (1)

dduportal · 2024-11-25T17:37:12Z

Update: tried to add the new rsync targets with jenkins-infra/update-center2#825 but had to roll back (see jenkins-infra/update-center2#826) due to rsync errors which are most probably related to the combination of unprivileged container and a non POSIX file system (using SMB):
* Change group errors:
rsync: [generator] chgrp "/updates-jenkins-io-data/content/." failed: Operation not permitted (1)
* Set time errors (on inodes):
rsync: [generator] failed to set times on "/updates-jenkins-io-data/content/.": Operation not permitted (1)
* and:
rsync: [receiver] mkstemp "/updates-jenkins-io-data/content/dynamic-stable-2.452.1/.update-center.actual.json.aKVdCl" failed: Operation not permitted (1)

I wonder if the PVC should not be changed to an NFS one instead of this awful SMB

dduportal · 2024-11-26T18:29:24Z

Update: tried to add the new rsync targets with jenkins-infra/update-center2#825 but had to roll back (see jenkins-infra/update-center2#826) due to rsync errors which are most probably related to the combination of unprivileged container and a non POSIX file system (using SMB):
* Change group errors:
rsync: [generator] chgrp "/updates-jenkins-io-data/content/." failed: Operation not permitted (1)
* Set time errors (on inodes):
rsync: [generator] failed to set times on "/updates-jenkins-io-data/content/.": Operation not permitted (1)
* and:
rsync: [receiver] mkstemp "/updates-jenkins-io-data/content/dynamic-stable-2.452.1/.update-center.actual.json.aKVdCl" failed: Operation not permitted (1)

Update:

These errors are due to the SMB file system:
- Cannot reproduce on a local k3s (same rsyncd release, same dataset, same rsync version, same options)
- Can reproduce against the production rsyncd
- The rsync command works as expected without the 3 following rsync flags: --chown, --perms and --time
Since this is a brand new file share, let's try with NFS

dduportal · 2024-11-26T19:01:00Z

Ow yeah, NFS + rsync works like a charm \o/ Next steps: send PR to configure all of this with IaC

…s PV/PVC (#887) Related to jenkins-infra/helpdesk#4402 Tested manually with success! --------- Signed-off-by: Damien Duportal <[email protected]>

dduportal · 2024-11-28T14:20:10Z

Update:

Rsync is up and running
update center 2 is copying in the new rsync destinations with success
Let's roll for http in NFS: feat(publick8s/updates.jenkins.io) use the new NFS PVC with subdir mounts for HTTPD kubernetes-management#5961

dduportal · 2024-11-28T16:26:53Z

Update:

httpd is now using the new NFS PVC.
Next candidates
- crawler, which needs to populate the NFS data over rsync
- mirrorbits to use NFS

dduportal · 2024-11-28T17:28:42Z

Update:

PR to finalize removing azcopy on both crawler and update-center2 are ready but set as draft:
- chore(publish) switch azcopy to rsync tasks crawler#155
- chore(wrappers/publish) remove all azcopy/azsync tasks (replaced by rsync on NFS) as it is unreliable update-center2#830
PR to switch mirrorbits to NFS is ready (in draft): feat(publick8s/updates.jenkins.io) switch mirrorbits instances to the new NFS PVC kubernetes-management#5964

=> crawler blocks everything as httpd shows HTTP/404 for https://updates.jenkins.io/updates/. Fix needs:

crawler to publish on rsync (instead of azcopy)
The symlink (easy with rsync on update center2)

=> tests on trusted.ci with replay (curl-ing the temp sh script) on crawler shows we need to set up the private link + DNS in the subnet of ephemeral VM blocking everything. WiP on this as top priority

…#892) Ref. jenkins-infra/helpdesk#4402 (comment) The goal is to fix the error `ssh: connect to host updates.jenkins.io-data.trusted.ci.jenkins.io port 22: Connection timed out` --------- Signed-off-by: Damien Duportal <[email protected]>

dduportal · 2024-11-29T06:56:38Z

Update:

Private endpoint added for ephemeral agents:
- feat(trusted.ci) set up PLS and PEs for agents azure#891
- fix(trusted.ci/ephemeral-agents) add NSG rules to allow access the PE azure#892
Crawler is now tested with success in chore(publish) switch azcopy to rsync tasks crawler#155
- Initial manual replayed worked very well: https://updates.jenkins.io/updates/ is not HTTP/404 anymore

dduportal · 2024-11-29T07:35:42Z

Update: we've fully switched to the NFS filesystem with success

-Both jenkins-infra/crawler#155 and jenkins-infra/update-center2#830 have been merged and we see files updates

feat(publick8s/updates.jenkins.io) switch mirrorbits instances to the new NFS PVC kubernetes-management#5964 has been deployed

Now, both https://updates.jenkins.io/updates/ and https://updates.jenkins.io/current/updates/ are now HTTP/200 (however, https://updates.jenkins.io/current/updates/ links are HTTP/404 now since they do not exist on mirrorbits...)

Next steps:

Remove the 2 former PVCs and PV (data is retained) from publick8s to ensure we don't use them anymore in the cluster
Remove azcopy credentials from trusted.ci
Remove SP (and other AzureAD resources) associated to these credentials in jenkins-infra/azure
Remove the file shares

dduportal · 2024-11-29T08:58:54Z

The updates-jenkins-io PVC is still used by the "dummy" rsync server (used to mimic Cloudflare data for mirrorbits scan):

$ kubectl -n updates-jenkins-io describe pvc updates-jenkins-io                      
Name:          updates-jenkins-io
Namespace:     updates-jenkins-io
StorageClass:  statically-provisioned
Status:        Bound
Volume:        updates-jenkins-io
# ...
Used By:       updates-jenkins-io-rsync-rsyncd-745644d955-jh8sv

=> we need to migrate this last service before being able to clean up the PVC.

The updates-jenkins-io-redirects PVC is not used anymore:

Name:          updates-jenkins-io-redirects
Namespace:     updates-jenkins-io
#...
Used By:       <none>

…d adapt PV/PVC namings (#894) Related to jenkins-infra/helpdesk#4402 Both PVCs `updates-jenkins-io` and `updates-jenkins-io-redirects` are not used anymore. Also adapting resource and outputs namings. Should resolve jenkins-infra/kubernetes-management#5962 and jenkins-infra/kubernetes-management#5969 Signed-off-by: Damien Duportal <[email protected]>

dduportal · 2024-11-29T13:30:39Z

Update:

Done:

"Dummy-S3" rsync service has been migrated to NFS:
- feat(rsyncd) allow specifying a sub directory for each component to mount (instead of the root) helm-charts#1458
- feat(publick8s/updates.jenkins.io) switch dummy-s3 rsync to NFS volume kubernetes-management#5967
The 2 PV, PVCs had been removed: cleanup(publick8s + updates.jenkins.io) Removed unused SMB volumes and adapt PV/PVC namings azure#894
The rsyncd-data service had been given more resource, to decrease the update_center2 execution time: fix(updates.jenkins.io) add resources to rsyncd-data kubernetes-management#5970
/current/latest endpoint is now answering a directory listing: symlink support on NFS is great!
- New monitoring endpoints had been added on datadog: feat(synthetics_updatecenter) add more endpoints to monitor datadog#277 (including /current/latest)

Impacts:

The amount of errored requests to the storage account drastically decreased by switching to NFS:

=> httpd containers are using a bit more CPU (~30 mcore to ~65 mcore) since then change. Other than that, we haven't seen any visible resource usage change: switching to NFS keeps the same performances

=> The httpd error rate (unexpected) HTTP/404 is low and stays low! We'll have to wait a few day for a confirmation though

dduportal · 2024-11-29T13:37:29Z

On the agent-1 on trusted.ci (where the Update Center JSON metadata is generated), we are back to the (new) "nominal" average build time (around 5:30, less than 6 min), since the Cloudflare R2 error did stopped today (high latency and a lot of HTTP/500 preventing mirrors to be updated) + adding more resources to rsync + cleanup of azcopy and SMB:

The agent is now having less load average and less CPU usage:

dduportal · 2024-11-29T14:00:44Z

Update: credentials for azcopy removed from trusted.ci (and from the UC ZIP generator).

Builds triggerered for both update_center2 and crawler to validate

dduportal · 2024-11-29T15:10:02Z

Update: we cleaned up the old resources - jenkins-infra/azure#895

No more file shares with SMB only content
No more Azcopy-only SPs with expiration

dduportal mentioned this issue Nov 19, 2024

[INFRA-3100] Migrate updates.jenkins.io to another Cloud #2649

Open

dduportal added the updateCenter label Nov 19, 2024

dduportal self-assigned this Nov 19, 2024

dduportal added this to the infra-team-sync-2024-11-26 milestone Nov 19, 2024

dduportal changed the title ~~[Upodate Center] "eventually consistent" Azure Shared File storage for HTTPD htdocs (aka. corrupted files some time to time)~~ [Update Center] "eventually consistent" Azure Shared File storage for HTTPD htdocs (aka. corrupted files some time to time) Nov 19, 2024

dduportal mentioned this issue Nov 19, 2024

feat(rsyncd) support annotations on Service jenkins-infra/helm-charts#1422

Merged

smerle33 mentioned this issue Nov 21, 2024

feat(update.jio) disable httpd Restart cronjob jenkins-infra/kubernetes-management#5921

Merged

This was referenced Nov 25, 2024

feat(publick8s) add a new rsyncd-data internal release for updates.jenkins.io jenkins-infra/kubernetes-management#5933

Merged

feat(trusted.ci) add an Azure Private Endpoint to reach the PLS 'updates-jenkins-io-data' jenkins-infra/azure#879

Merged

dduportal mentioned this issue Nov 25, 2024

chore(wrappers/publish) add 3 new rsync targets on the new data service jenkins-infra/update-center2#825

Merged

dduportal modified the milestones: infra-team-sync-2024-11-26, infra-team-sync-2024-12-03 Nov 26, 2024

dduportal mentioned this issue Nov 27, 2024

feat(updates.jio) add a new NFS storage share and associated publick8s PV/PVC jenkins-infra/azure#887

Merged

This was referenced Nov 27, 2024

feat(publick8s/updates.jenkins.io) restore the rsyncd-data service jenkins-infra/kubernetes-management#5945

Merged

feat(trusted.ci.jenkins.io) re enable updates.jio data Private Link jenkins-infra/azure#889

Merged

dduportal mentioned this issue Nov 28, 2024

Revert "Revert "chore(wrappers/publish) add 3 new rsync targets on the new data service"" jenkins-infra/update-center2#829

Merged

dduportal mentioned this issue Nov 28, 2024

fix(mirrorbits) correct volume mount subpath for repository jenkins-infra/helm-charts#1456

Merged

This was referenced Nov 28, 2024

feat(trusted.ci) set up PLS and PEs for agents jenkins-infra/azure#891

Merged

fix(trusted.ci/ephemeral-agents) add NSG rules to allow access the PE jenkins-infra/azure#892

Merged

dduportal mentioned this issue Nov 29, 2024

feat(synthetics_updatecenter) add more endpoints to monitor jenkins-infra/datadog#277

Merged

This was referenced Nov 29, 2024

HTTP errors on multiple sites #4428

Closed

Wrong update-center returned for 2.452.4 #4427

Closed

dduportal closed this as completed Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Update Center] "eventually consistent" Azure Shared File storage for HTTPD `htdocs` (aka. corrupted files some time to time) #4402

[Update Center] "eventually consistent" Azure Shared File storage for HTTPD `htdocs` (aka. corrupted files some time to time) #4402

dduportal commented Nov 19, 2024

dduportal commented Nov 19, 2024

dduportal commented Nov 19, 2024

smerle33 commented Nov 21, 2024

dduportal commented Nov 25, 2024 •

edited

Loading

dduportal commented Nov 25, 2024

dduportal commented Nov 25, 2024

dduportal commented Nov 26, 2024

dduportal commented Nov 26, 2024

dduportal commented Nov 28, 2024

dduportal commented Nov 28, 2024

dduportal commented Nov 28, 2024

dduportal commented Nov 29, 2024

dduportal commented Nov 29, 2024 •

edited

Loading

dduportal commented Nov 29, 2024

dduportal commented Nov 29, 2024

dduportal commented Nov 29, 2024

dduportal commented Nov 29, 2024

dduportal commented Nov 29, 2024

[Update Center] "eventually consistent" Azure Shared File storage for HTTPD htdocs (aka. corrupted files some time to time) #4402

[Update Center] "eventually consistent" Azure Shared File storage for HTTPD htdocs (aka. corrupted files some time to time) #4402

Comments

dduportal commented Nov 19, 2024

dduportal commented Nov 19, 2024

dduportal commented Nov 19, 2024

smerle33 commented Nov 21, 2024

dduportal commented Nov 25, 2024 • edited Loading

dduportal commented Nov 25, 2024

dduportal commented Nov 25, 2024

dduportal commented Nov 26, 2024

dduportal commented Nov 26, 2024

dduportal commented Nov 28, 2024

dduportal commented Nov 28, 2024

dduportal commented Nov 28, 2024

dduportal commented Nov 29, 2024

dduportal commented Nov 29, 2024 • edited Loading

dduportal commented Nov 29, 2024

dduportal commented Nov 29, 2024

dduportal commented Nov 29, 2024

dduportal commented Nov 29, 2024

dduportal commented Nov 29, 2024

[Update Center] "eventually consistent" Azure Shared File storage for HTTPD `htdocs` (aka. corrupted files some time to time) #4402

[Update Center] "eventually consistent" Azure Shared File storage for HTTPD `htdocs` (aka. corrupted files some time to time) #4402

dduportal commented Nov 25, 2024 •

edited

Loading

dduportal commented Nov 29, 2024 •

edited

Loading