Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cert Renewal script #14667

Closed
wants to merge 773 commits into from
Closed

Conversation

nevoodoo
Copy link

This script helps run some of the steps from https://populationgenomics.readthedocs.io/en/latest/hail.html#updating-tls-https-certificates through an automatic script that can be fetched and run.

Daniel King and others added 30 commits May 13, 2022 10:24
OK, there were two problems:

1. A timeout of 5s appears to be now too short for Google Cloud Storage. I am not sure why but we
   timeout substantially more frequently. I have observed this myself on my laptop. Just this
   morning I saw it happen to Daniel.

2. When using an `aiohttp.AsyncIterablePayload`, it is *critical* to always check if the coroutine
   which actually writes to GCS (which is stashed in the variable `request_task`) is still
   alive. In the current `main`, we do not do this which causes hangs (in particular the timeout
   exceptions are never thrown ergo we never retry).

To understand the second problem, you must first recall how writing works in aiogoogle. There are
two Tasks and an `asyncio.Queue`. The terms "writer" and "reader" are somewhat confusing, so let's
use left and right. The left Task has the owning reference to both the source "file" and the
destination "file". In particular, it is the *left* Task which closes both "files". Moreover, the
left Task reads chunks from the source file and places those chunks on the `asyncio.Queue`. The
right Task takes chunks off the queue and writes those chunks to the destination file.

This situation can go awry in two ways.

First, if the right Task encounters any kind of failure, it will stop taking chunks off of the
queue. When the queue (which has a size limit of one) is full, the left Task will hang. The system
is stuck. The left Task will wait forever for the right Task to empty the queue.

The second scenario is exactly the same except that the left Task is trying to add the "stop"
message to the queue rather than a chunk.

In either case, it is critical that the left Task waits simultaneously on the queue operation *and*
on the right Task completing. If the right Task has died, no further writes can occur and the left
Task must raise an exception. In the first scenario, we do not observe the right Task's exception
because that will be done when we close the `InsertObjectStream` (which represents the destination
"file").

---

I also added several types, assertions, and a few missing `async with ... as resp:` blocks.
[copy] fix the TimeoutError and ServerDisconnected issues in copy
Allow to select a pool for a job through a label
* Revert "Sort"

This reverts commit c08b295.

* Revert "Use spot machines on GCP"

This reverts commit 13c377b.
* Revert "Sort"

This reverts commit c08b295.

* Revert "Use spot machines on GCP"

This reverts commit 13c377b.

* Revert "Revert "Use spot machines on GCP""

This reverts commit b4f4371.

* Only set instanceTerminationAction for Spot VMs

* Use dict.update
…s file (#203)

* ORGANIZATION_DOMAIN --> GITHUB_ORGANIZATION

* Add github_organization to global.tfvars template

* Use github_organization in main.tf for credentials paths

* Update credentials paths
milo-hyben and others added 27 commits May 14, 2024 16:36
Merge upstream HEAD(b7bde56, 2024-05-14) Stop writing to V2 tables
Merge upstream HEAD(e68103e, 2024-05-14) Remove V2 tables
Merge upstream HEAD(dc7fce0, 2024-05-14) Use CI's credentials for image pushing instead of gcr-push
Merge upstream HEAD(13de4e6, 2024-05-14) Add job groups [migration might take a while!]
Merge upstream HEAD(6a6c38d, 2024-05-21) Expose HAIL_CI_STORAGE_URI as ci_storage_uri in CI Steps
Merge upstream HEAD(bea04d9, 2024-05-21) [release] 0.2.130 (hail-is#14454)
* Try adding extra route

* Add another route

---------

Co-authored-by: Michael Franklin <[email protected]>
…l causing Error: Column 'time_completed' in where clause is ambiguous. (#340)
…342)

Pool labels are a CPG-local addition (PR #197) that are considered
in select_inst_coll() so should be displayed in this error message.
This 22.04.4 version updates the kernel from 5.19 to 6.5, which is not
supported by NVIDIA-Linux 530.30.2. Update that to the current latest,
and verify that that supports L4 GPUs as used by G2 VMs.
…tials

This secret was superseded in upstream PR hail-is#14031 and later deleted.
That PR replaced use of /registry-push-credentials/credentials.json
with $GOOGLE_APPLICATION_CREDENTIALS instead, which is presumably
already activated for gcloud purposes.
Recent upstream changes include a rewrite of several of the web page
templates. Hence this merge drops the functionality locally added
by PRs #270, #272, and #311 to make the command <PRE> resizeable
and to add job state quick links. We may re-add these improvements
later by reimplementing them in the new template code.
Merge upstream 0.2.132 release (678e1f5, 2024-07-09).
…hen collecting batches and jobs. (#345)

* Fix for /api/v1alpha/batches/completed, picking only ROOT_JOB_GROUP when collecting batches and jobs.

* Removing DISTINCT
* Add support for public-ip-address in dataproc

* Use the same code style as previous code

---------

Co-authored-by: Michael Franklin <[email protected]>
@nevoodoo nevoodoo closed this Aug 22, 2024
@nevoodoo
Copy link
Author

Sorry, mistakenly tried to merge upstream facepalm

@nevoodoo nevoodoo deleted the cert-renewal-script branch September 23, 2024 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.