Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Update Center] "eventually consistent" Azure Shared File storage for HTTPD htdocs (aka. corrupted files some time to time) #4402

Closed
dduportal opened this issue Nov 19, 2024 · 20 comments
Assignees

Comments

@dduportal
Copy link
Contributor

#2649 introduced a new Update Center system built on top of mirrorbits + httpd.

During the brownouts, and since the production deployment, we see a few HTTP/500 errors in the httpd logs:

  • In the access logs, we have an HTTP/500 on an URL expected to work:
10.100.0.4 - - [19/Nov/2024:15:18:44 +0000] "GET /updates/hudson.plugins.sonar.SonarRunnerInstaller.json.html HTTP/1.1" 500 548
  • Correlated to a line like this in the error.log, where the xxx of the Invalid command 'xxx' part is any of the htaccess keyword but truncated.
[Tue Nov 19 15:18:44.031172 2024] [core:alert] [pid 10:tid 43] [client 10.100.0.4:46862] /usr/local/apache2/htdocs/.htaccess: Invalid command 'RewriteRul', perhaps misspelled or defined by a module not included in the server configuration

=> of course, if we look at the azcopy mechanism we use to fill the shared volume, we see the following:

WARN: AzCopy sync is supported but not fully recommended for Azure Files. AzCopy sync doesn't support differential copies at scale, and some file fidelity might be lost.

We have to find a way to get rid of these errors

@dduportal
Copy link
Contributor Author

It looks like the most probable culprit would be azcopy sync which corrupts the files:

Alas switching to azcopy cp (https://learn.microsoft.com/en-us/azure/storage/common/storage-ref-azcopy-copy) does not allow to delete files if existing.

We did a design mistake: rsync clearly seems a better fit.

=> it means we might not need to revamp the httpd architecture (using a shared htdocs file share). Let's keep the shared file system for now, and add an rsync pod which mounts the filesystem in R/W and is restricted to trusted.ci only (network + auth).

@dduportal dduportal self-assigned this Nov 19, 2024
@dduportal dduportal added this to the infra-team-sync-2024-11-26 milestone Nov 19, 2024
@dduportal dduportal changed the title [Upodate Center] "eventually consistent" Azure Shared File storage for HTTPD htdocs (aka. corrupted files some time to time) [Update Center] "eventually consistent" Azure Shared File storage for HTTPD htdocs (aka. corrupted files some time to time) Nov 19, 2024
@dduportal
Copy link
Contributor Author

Update: working on the jenkins-infra/docker-rsyncd Docker image to support using rsync over SSH instead of rsyncd to ensure we have encrypted (rsyncd is not) and secured (key base authentication instead of plain text user/password...) connections

Then we'll install a new rsyncd helm release for the httpd, which will mount the azure file in R/W and is exposed through an Azure PLS

Ref initial work:

@smerle33
Copy link
Contributor

disabled the rollout restart cronjob to avoid umount/remount problem with disk access.

dduportal added a commit to jenkins-infra/azure that referenced this issue Nov 25, 2024
#878)

Related to jenkins-infra/helpdesk#4402

The new RsyncD service aimed at replacing the `azcopy` service will have
to write data, so we want a new `RWX` persistent volume.

This new PV/PVC will also replace the 2 existing (long term) with
sub-directories (so we will provision only 1 time a premium 100 Gb
instead of 2 today).

Signed-off-by: Damien Duportal <[email protected]>
dduportal added a commit to jenkins-infra/azure that referenced this issue Nov 25, 2024
@dduportal
Copy link
Contributor Author

dduportal commented Nov 25, 2024

Update:

Test is successful from the trusted.ci permanent agent:

$ rsync -av --rsh='ssh -i .ssh/id_new_uc' [email protected]:/updates-jenkins-io-data # /update-jenkins-io-data/
receiving incremental file list
drwxrwxrwx              0 2024/11/25 09:36:02 updates-jenkins-io-data

sent 25 bytes  received 82 bytes  71.33 bytes/sec
total size is 0  speedup is 0.00

=> proceeding to update the ZIP credentials on trusted.ci.jenkins.io (need a message in matrix + carefull check of the next update_center2 run)

@dduportal
Copy link
Contributor Author

Update: tried to add the new rsync targets with jenkins-infra/update-center2#825 but had to roll back (see jenkins-infra/update-center2#826) due to rsync errors which are most probably related to the combination of unprivileged container and a non POSIX file system (using SMB):

  • Change group errors:
rsync: [generator] chgrp "/updates-jenkins-io-data/content/." failed: Operation not permitted (1)
  • Set time errors (on inodes):
rsync: [generator] failed to set times on "/updates-jenkins-io-data/content/.": Operation not permitted (1)
  • and:
rsync: [receiver] mkstemp "/updates-jenkins-io-data/content/dynamic-stable-2.452.1/.update-center.actual.json.aKVdCl" failed: Operation not permitted (1)

@dduportal
Copy link
Contributor Author

Update: tried to add the new rsync targets with jenkins-infra/update-center2#825 but had to roll back (see jenkins-infra/update-center2#826) due to rsync errors which are most probably related to the combination of unprivileged container and a non POSIX file system (using SMB):

* Change group errors:
rsync: [generator] chgrp "/updates-jenkins-io-data/content/." failed: Operation not permitted (1)
* Set time errors (on inodes):
rsync: [generator] failed to set times on "/updates-jenkins-io-data/content/.": Operation not permitted (1)
* and:
rsync: [receiver] mkstemp "/updates-jenkins-io-data/content/dynamic-stable-2.452.1/.update-center.actual.json.aKVdCl" failed: Operation not permitted (1)

I wonder if the PVC should not be changed to an NFS one instead of this awful SMB

@dduportal
Copy link
Contributor Author

Update: tried to add the new rsync targets with jenkins-infra/update-center2#825 but had to roll back (see jenkins-infra/update-center2#826) due to rsync errors which are most probably related to the combination of unprivileged container and a non POSIX file system (using SMB):

* Change group errors:
rsync: [generator] chgrp "/updates-jenkins-io-data/content/." failed: Operation not permitted (1)
* Set time errors (on inodes):
rsync: [generator] failed to set times on "/updates-jenkins-io-data/content/.": Operation not permitted (1)
* and:
rsync: [receiver] mkstemp "/updates-jenkins-io-data/content/dynamic-stable-2.452.1/.update-center.actual.json.aKVdCl" failed: Operation not permitted (1)

Update:

  • These errors are due to the SMB file system:
    • Cannot reproduce on a local k3s (same rsyncd release, same dataset, same rsync version, same options)
    • Can reproduce against the production rsyncd
    • The rsync command works as expected without the 3 following rsync flags: --chown, --perms and --time
  • Since this is a brand new file share, let's try with NFS

@dduportal
Copy link
Contributor Author

Ow yeah, NFS + rsync works like a charm \o/ Next steps: send PR to configure all of this with IaC

@dduportal
Copy link
Contributor Author

Update:

@dduportal
Copy link
Contributor Author

Update:

  • httpd is now using the new NFS PVC.
  • Next candidates
    • crawler, which needs to populate the NFS data over rsync
    • mirrorbits to use NFS

@dduportal
Copy link
Contributor Author

Update:

=> crawler blocks everything as httpd shows HTTP/404 for https://updates.jenkins.io/updates/. Fix needs:

  • crawler to publish on rsync (instead of azcopy)
  • The symlink (easy with rsync on update center2)

=> tests on trusted.ci with replay (curl-ing the temp sh script) on crawler shows we need to set up the private link + DNS in the subnet of ephemeral VM blocking everything. WiP on this as top priority

dduportal added a commit to jenkins-infra/azure that referenced this issue Nov 29, 2024
…#892)

Ref.
jenkins-infra/helpdesk#4402 (comment)

The goal is to fix the error `ssh: connect to host
updates.jenkins.io-data.trusted.ci.jenkins.io port 22: Connection timed
out`

---------

Signed-off-by: Damien Duportal <[email protected]>
@dduportal
Copy link
Contributor Author

Update:

@dduportal
Copy link
Contributor Author

dduportal commented Nov 29, 2024

Update: we've fully switched to the NFS filesystem with success

-Both jenkins-infra/crawler#155 and jenkins-infra/update-center2#830 have been merged and we see files updates

Now, both https://updates.jenkins.io/updates/ and https://updates.jenkins.io/current/updates/ are now HTTP/200 (however, https://updates.jenkins.io/current/updates/ links are HTTP/404 now since they do not exist on mirrorbits...)

Next steps:

  • Remove the 2 former PVCs and PV (data is retained) from publick8s to ensure we don't use them anymore in the cluster
  • Remove azcopy credentials from trusted.ci
  • Remove SP (and other AzureAD resources) associated to these credentials in jenkins-infra/azure
  • Remove the file shares

@dduportal
Copy link
Contributor Author

The updates-jenkins-io PVC is still used by the "dummy" rsync server (used to mimic Cloudflare data for mirrorbits scan):

$ kubectl -n updates-jenkins-io describe pvc updates-jenkins-io                      
Name:          updates-jenkins-io
Namespace:     updates-jenkins-io
StorageClass:  statically-provisioned
Status:        Bound
Volume:        updates-jenkins-io
# ...
Used By:       updates-jenkins-io-rsync-rsyncd-745644d955-jh8sv

=> we need to migrate this last service before being able to clean up the PVC.

The updates-jenkins-io-redirects PVC is not used anymore:

Name:          updates-jenkins-io-redirects
Namespace:     updates-jenkins-io
#...
Used By:       <none>

@dduportal
Copy link
Contributor Author

Update:

Done:

Impacts:

The amount of errored requests to the storage account drastically decreased by switching to NFS:

Capture d’écran 2024-11-29 à 14 20 11

=> httpd containers are using a bit more CPU (~30 mcore to ~65 mcore) since then change. Other than that, we haven't seen any visible resource usage change: switching to NFS keeps the same performances

=> The httpd error rate (unexpected) HTTP/404 is low and stays low! We'll have to wait a few day for a confirmation though

@dduportal
Copy link
Contributor Author

On the agent-1 on trusted.ci (where the Update Center JSON metadata is generated), we are back to the (new) "nominal" average build time (around 5:30, less than 6 min), since the Cloudflare R2 error did stopped today (high latency and a lot of HTTP/500 preventing mirrors to be updated) + adding more resources to rsync + cleanup of azcopy and SMB:

Capture d’écran 2024-11-29 à 11 49 11

The agent is now having less load average and less CPU usage:

Capture d’écran 2024-11-29 à 14 33 19 Capture d’écran 2024-11-29 à 14 36 48

@dduportal
Copy link
Contributor Author

Update: credentials for azcopy removed from trusted.ci (and from the UC ZIP generator).

Builds triggerered for both update_center2 and crawler to validate

@dduportal
Copy link
Contributor Author

Update: we cleaned up the old resources - jenkins-infra/azure#895

  • No more file shares with SMB only content
  • No more Azcopy-only SPs with expiration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants