Skip to content

Commit

Permalink
EWMS Pilot Updates (#182)
Browse files Browse the repository at this point in the history
Co-authored-by: wipacdevbot <[email protected]>
  • Loading branch information
ric-evans and wipacdevbot authored Apr 28, 2023
1 parent f2de17c commit 0ffca3d
Show file tree
Hide file tree
Showing 15 changed files with 31 additions and 29 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ env:
SKYSCAN_OUTPUT_DIR: output-dir
SKYSCAN_BROKER_CLIENT: rabbitmq
# note: auth env vars are in job(s)
EWMS_PILOT_SUBPROC_TIMEOUT: 600
EWMS_PILOT_TASK_TIMEOUT: 600
SKYSCAN_DEBUG_DIR: debug-pkl-dir
SKYSCAN_MQ_TIMEOUT_TO_CLIENTS: 60
# SKYSCAN_MQ_TIMEOUT_FROM_CLIENTS: 60 # use default
Expand Down Expand Up @@ -129,7 +129,7 @@ jobs:
nclients=$(( $CLIENTS_PER_CPU * $(nproc) ))
echo "Launching $nclients clients"
mkdir $SKYSCAN_DEBUG_DIR
export EWMS_PILOT_SUBPROC_TIMEOUT=1800 # 30 mins
export EWMS_PILOT_TASK_TIMEOUT=1800 # 30 mins
for i in $( seq 1 $nclients ); do
singularity run skymap_scanner.sif \
python -m skymap_scanner.client \
Expand Down Expand Up @@ -227,7 +227,7 @@ jobs:
nclients=$(( $CLIENTS_PER_CPU * $(nproc) ))
echo "Launching $nclients clients"
mkdir $SKYSCAN_DEBUG_DIR
export EWMS_PILOT_SUBPROC_TIMEOUT=1800 # 30 mins
export EWMS_PILOT_TASK_TIMEOUT=1800 # 30 mins
for i in $( seq 1 $nclients ); do
./resources/launch_scripts/docker/launch_client.sh \
--client-startup-json ./startup.json \
Expand Down Expand Up @@ -332,7 +332,7 @@ jobs:
nclients=$(( $CLIENTS_PER_CPU * $(nproc) ))
echo "Launching $nclients clients"
mkdir $SKYSCAN_DEBUG_DIR
export EWMS_PILOT_SUBPROC_TIMEOUT=1800 # 30 mins
export EWMS_PILOT_TASK_TIMEOUT=1800 # 30 mins
for i in $( seq 1 $nclients ); do
./resources/launch_scripts/docker/launch_client.sh \
--client-startup-json ./startup.json \
Expand Down
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,10 @@ Env variables
export SKYSCAN_BROKER_ADDRESS=<hostname>/<vhost>
export SKYSCAN_BROKER_AUTH=<token>
export EWMS_PILOT_QUARANTINE_TIME=1200 # helps decrease condor blackhole nodes
export EWMS_PILOT_SUBPROC_TIMEOUT=1200
export EWMS_PILOT_TASK_TIMEOUT=1200
```

Currently, RabbitMQ uses URL parameters for the hostname, virtual host, and port (`[https://]HOST[:PORT][/VIRTUAL_HOST]`). The heartbeat is configured by `EWMS_PILOT_SUBPROC_TIMEOUT`. This may change in future updates.
Currently, RabbitMQ uses URL parameters for the hostname, virtual host, and port (`[https://]HOST[:PORT][/VIRTUAL_HOST]`). The heartbeat is configured by `EWMS_PILOT_TASK_TIMEOUT`. This may change in future updates.

Python install:
```
Expand All @@ -40,7 +40,7 @@ export SKYSCAN_BROKER_CLIENT=pulsar
export SKYSCAN_BROKER_ADDRESS=<ip address>
export SKYSCAN_BROKER_AUTH=<token>
export EWMS_PILOT_QUARANTINE_TIME=1200 # helps decrease condor blackhole nodes
export EWMS_PILOT_SUBPROC_TIMEOUT=1200
export EWMS_PILOT_TASK_TIMEOUT=1200
```

Python install:
Expand All @@ -63,7 +63,7 @@ The server can be launched from anywhere with a stable network connection. You c
export SKYSCAN_BROKER_ADDRESS=BROKER_ADDRESS
# export SKYSCAN_BROKER_CLIENT=rabbitmq # rabbitmq is the default so env var is not needed
export SKYSCAN_BROKER_AUTH=$(cat ~/skyscan-broker.token) # obfuscated for security
export EWMS_PILOT_SUBPROC_TIMEOUT=1200
export EWMS_PILOT_TASK_TIMEOUT=1200
```
###### Command-Line Arguments
```
Expand Down Expand Up @@ -94,7 +94,7 @@ _NOTE: By default the launch script will pull, build, and run the latest image f
```
export SKYSCAN_DOCKER_IMAGE_TAG='x.y.z' # defaults to 'latest'
export SKYSCAN_DOCKER_PULL_ALWAYS=0 # defaults to 1 which maps to '--pull=always'
export EWMS_PILOT_SUBPROC_TIMEOUT=1200
export EWMS_PILOT_TASK_TIMEOUT=1200
```

#### 2. Launch Each Client
Expand All @@ -107,7 +107,7 @@ export SKYSCAN_BROKER_ADDRESS=BROKER_ADDRESS
# export SKYSCAN_BROKER_CLIENT=rabbitmq # rabbitmq is the default so env var is not needed
export SKYSCAN_BROKER_AUTH=$(cat ~/skyscan-broker.token) # obfuscated for security
export EWMS_PILOT_QUARANTINE_TIME=1200 # helps decrease condor blackhole nodes
export EWMS_PILOT_SUBPROC_TIMEOUT=1200
export EWMS_PILOT_TASK_TIMEOUT=1200
```
###### Command-Line Arguments
_See notes about `--client-startup-json` below. See `client.py` for additional optional args._
Expand Down Expand Up @@ -171,7 +171,7 @@ ls /scratch/$USER/run*.condor | head -nN | xargs -I{} condor_submit {}
executable = /bin/sh
arguments = /usr/local/icetray/env-shell.sh python -m skymap_scanner.client --client-startup-json ./client-startup.json
+SingularityImage = "/cvmfs/icecube.opensciencegrid.org/containers/realtime/skymap_scanner:x.y.z"
environment = "SKYSCAN_BROKER_AUTH=AUTHTOKEN SKYSCAN_BROKER_ADDRESS=BROKER_ADDRESS EWMS_PILOT_SUBPROC_TIMEOUT=1200 EWMS_PILOT_QUARANTINE_TIME=1200"
environment = "SKYSCAN_BROKER_AUTH=AUTHTOKEN SKYSCAN_BROKER_ADDRESS=BROKER_ADDRESS EWMS_PILOT_TASK_TIMEOUT=1200 EWMS_PILOT_QUARANTINE_TIME=1200"
Requirements = HAS_CVMFS_icecube_opensciencegrid_org && has_avx
output = /scratch/$USER/UID.out
error = /scratch/$USER/UID.err
Expand Down Expand Up @@ -214,7 +214,7 @@ The Skymap Scanner is designed to have realistic timeouts for HTCondor. That sai
# - normal expiration scenario: server died (ex: tried to read corrupted event file), otherwise never
SKYSCAN_MQ_CLIENT_TIMEOUT_WAIT_FOR_FIRST_MESSAGE: int = 60 * 60 # 60 mins
```
Relatedly, the environment variable `EWMS_PILOT_SUBPROC_TIMEOUT` & `EWMS_PILOT_QUARANTINE_TIME` can also be configured (see [1. Launch the Server](#1-launch-the-server) and [2. Launch Each Client](#2-launch-each-client)).
Relatedly, the environment variable `EWMS_PILOT_TASK_TIMEOUT` & `EWMS_PILOT_QUARANTINE_TIME` can also be configured (see [1. Launch the Server](#1-launch-the-server) and [2. Launch Each Client](#2-launch-each-client)).

#### Command-Line Arguments
There are more command-line arguments than those shown in [Example Startup](#example-startup). See `skymap_scanner.server.start_scan.main()` and `skymap_scanner.client.client.main()` for more detail.
Expand Down
2 changes: 1 addition & 1 deletion requirements-all.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ cycler==0.11.0
# via matplotlib
ed25519==1.5
# via nkeys
ewms-pilot==0.6.0
ewms-pilot==0.9.1
# via skymap-scanner (setup.py)
fonttools==4.39.3
# via matplotlib
Expand Down
2 changes: 1 addition & 1 deletion requirements-client-starter.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ cryptography==40.0.2
# via pyjwt
cycler==0.11.0
# via matplotlib
ewms-pilot==0.6.0
ewms-pilot==0.9.1
# via skymap-scanner (setup.py)
fonttools==4.39.3
# via matplotlib
Expand Down
2 changes: 1 addition & 1 deletion requirements-nats.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ cycler==0.11.0
# via matplotlib
ed25519==1.5
# via nkeys
ewms-pilot==0.6.0
ewms-pilot==0.9.1
# via skymap-scanner (setup.py)
fonttools==4.39.3
# via matplotlib
Expand Down
2 changes: 1 addition & 1 deletion requirements-pulsar.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ cryptography==40.0.2
# via pyjwt
cycler==0.11.0
# via matplotlib
ewms-pilot==0.6.0
ewms-pilot==0.9.1
# via skymap-scanner (setup.py)
fonttools==4.39.3
# via matplotlib
Expand Down
2 changes: 1 addition & 1 deletion requirements-rabbitmq.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ cryptography==40.0.2
# via pyjwt
cycler==0.11.0
# via matplotlib
ewms-pilot==0.6.0
ewms-pilot==0.9.1
# via skymap-scanner (setup.py)
fonttools==4.39.3
# via matplotlib
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ cryptography==40.0.2
# via pyjwt
cycler==0.11.0
# via matplotlib
ewms-pilot==0.6.0
ewms-pilot==0.9.1
# via skymap-scanner (setup.py)
fonttools==4.39.3
# via matplotlib
Expand Down
4 changes: 2 additions & 2 deletions resources/client_starter.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,8 @@ def make_condor_job_description( # pylint: disable=too-many-arguments
# Build the environment specification for condor
vars = []
# EWMS_* are inherited via condor `getenv`, but we have default in case these are not set.
if not os.getenv("EWMS_PILOT_SUBPROC_TIMEOUT"):
vars.append("EWMS_PILOT_SUBPROC_TIMEOUT=1200")
if not os.getenv("EWMS_PILOT_TASK_TIMEOUT"):
vars.append("EWMS_PILOT_TASK_TIMEOUT=1200")
if not os.getenv("EWMS_PILOT_QUARANTINE_TIME"):
vars.append("EWMS_PILOT_QUARANTINE_TIME=1800")
# The container sets I3_DATA to /opt/i3-data, however `millipede_wilks` requires files (spline tables) that are not available in the image. For the time being we require CVFMS and we load I3_DATA from there. In order to override the environment variables we need to prepend APPTAINERENV_ or SINGULARITYENV_ to the variable name. There are site-dependent behaviour but these two should cover all cases. See https://github.com/icecube/skymap_scanner/issues/135#issuecomment-1449063054.
Expand Down
2 changes: 1 addition & 1 deletion resources/k8s/k8s_skydriver_worker_job.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ spec:
value: "/cvmfs/icecube.opensciencegrid.org/data/i3-test-data-svn/trunk"
- name: I3_DATA
value: "/cvmfs/icecube.opensciencegrid.org/data"
- name: EWMS_PILOT_SUBPROC_TIMEOUT
- name: EWMS_PILOT_TASK_TIMEOUT
value: "600"
image: icecube/skymap_scanner:3.0.68
imagePullPolicy: Always
Expand Down
2 changes: 1 addition & 1 deletion resources/launch_scripts/docker/launch_client.sh
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ docker run --network="host" $pull_policy --rm -i \
--env PY_COLORS=1 \
$(env | grep '^SKYSCAN_' | awk '$0="--env "$0') \
$(env | grep '^EWMS_' | awk '$0="--env "$0') \
--env "EWMS_PILOT_SUBPROC_TIMEOUT=${EWMS_PILOT_SUBPROC_TIMEOUT:-900}" \
--env "EWMS_PILOT_TASK_TIMEOUT=${EWMS_PILOT_TASK_TIMEOUT:-900}" \
icecube/skymap_scanner:${SKYSCAN_DOCKER_IMAGE_TAG:-"latest"} \
python -m skymap_scanner.client \
$PY_ARGS
2 changes: 1 addition & 1 deletion resources/launch_scripts/docker/launch_server.sh
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ docker run --network="host" $pull_policy --rm -i \
--env PY_COLORS=1 \
$(env | grep '^SKYSCAN_' | awk '$0="--env "$0') \
$(env | grep '^EWMS_' | awk '$0="--env "$0') \
--env "EWMS_PILOT_SUBPROC_TIMEOUT=${EWMS_PILOT_SUBPROC_TIMEOUT:-900}" \
--env "EWMS_PILOT_TASK_TIMEOUT=${EWMS_PILOT_TASK_TIMEOUT:-900}" \
icecube/skymap_scanner:${SKYSCAN_DOCKER_IMAGE_TAG:-"latest"} \
python -m skymap_scanner.server \
$PY_ARGS
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ spec:
value: "/cvmfs/icecube.opensciencegrid.org/data/i3-test-data-svn/trunk"
- name: I3_DATA
value: "/cvmfs/icecube.opensciencegrid.org/data"
- name: EWMS_PILOT_SUBPROC_TIMEOUT
- name: EWMS_PILOT_TASK_TIMEOUT
value: "300"
image: icecube/skymap_scanner:3.0.68
imagePullPolicy: Always
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ spec:
value: "/cvmfs/icecube.opensciencegrid.org/data/i3-test-data-svn/trunk"
- name: I3_DATA
value: "/cvmfs/icecube.opensciencegrid.org/data"
- name: EWMS_PILOT_SUBPROC_TIMEOUT
- name: EWMS_PILOT_TASK_TIMEOUT
value: "300"
image: icecube/skymap_scanner:3.2.1
imagePullPolicy: Always
Expand Down
10 changes: 6 additions & 4 deletions skymap_scanner/client/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,10 @@ def main() -> None:
raise FileNotFoundError(startup_json_dict["baseline_GCD_file"])

cmd = (
f"python -m skymap_scanner.client.reco_icetray "
f" --in-pkl in.pkl"
f" --out-pkl out.pkl"
f" --gcdqp-packet-json GCDQp_packet.json"
"python -m skymap_scanner.client.reco_icetray "
" --in-pkl {{INFILE}}" # no f-string b/c want to preserve '{{..}}'
" --out-pkl {{OUTFILE}}" # ^^^
" --gcdqp-packet-json GCDQp_packet.json"
f" --baseline-gcd-file {startup_json_dict['baseline_GCD_file']}"
)

Expand All @@ -88,6 +88,8 @@ def main() -> None:
auth_token=cfg.ENV.SKYSCAN_BROKER_AUTH,
queue_incoming=f"to-clients-{startup_json_dict['mq_basename']}",
queue_outgoing=f"from-clients-{startup_json_dict['mq_basename']}",
ftype_to_subproc=".pkl",
ftype_from_subproc=".pkl",
timeout_incoming=cfg.ENV.SKYSCAN_MQ_TIMEOUT_TO_CLIENTS,
timeout_outgoing=cfg.ENV.SKYSCAN_MQ_TIMEOUT_FROM_CLIENTS,
timeout_wait_for_first_message=cfg.ENV.SKYSCAN_MQ_CLIENT_TIMEOUT_WAIT_FOR_FIRST_MESSAGE,
Expand Down

0 comments on commit 0ffca3d

Please sign in to comment.