-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scion.sh, topogen: adopt docker compose plugin and cleanup #4398
Conversation
…position. The new name is "monitoring". This affects the usage of tools/dc and scion.sh in the following way: * monitoring is composed of both jaeger and prometheus. * scion.sh no-longer accepts the commands "start_yeager" and "stop_yeager". * scion.sh accepts two new commands: "start_monitoring" and "stop monitoring" that do what the name implies. * prom and jeager are no-longer valid services for tools/dc * tools/dc accepts a new service: "monitoring" * jeager is no-longer started automatically by "scions.sh run" or "scion.sh start". * jeager is no-longer stopped automatically by "scions.sh stop". * monitoring is **not** started or stopped automatically by "scion.sh start/stop". * monitoring can safely stay up accross restarts of scion. * monitoring *is* stopped automatically by "scion.sh topology".
...plus other minor improvements.
…" as shim. The opposite of "stop" is "start". So the standard start commands are now all named "m?start[_something]". The run command is an alias to "start", not the other way around.
* Severed our dependency on docker-compose V1 (including any shim script in /usr/bin): updated installation instructions and other docs accordingly. * Gave a $HOME dir to integration tests. It's not actually used, but docker-compose v2 fails (in a very cryptic way) without it. * Deflaked some integration tests (timing issues). Those were playing with my head when I was chasing other issues. * Clarified the instructions for running tests wrt to supervisor mode vs docker mode (e.g. the e2e integration tests must be run with -d in the docker mode).
…ges. The script was parsing the output of docker-compose. This was already incompatible with the latest docker-compose-v1 (possibly not affecting the buildkite environment) not mentionning docker compose v2. I preserved the original mstat behavior, although not all of the intent is very obvious: why output something only about what is *not* running?
That test was broken when using docker; possibly when using supervisor too: * It was failing to pass the "-d" arg when invoking end2end_integration while using docker. $DOCKER_ARGS got lost through prior refactorings. * It stopped the border routers but never started them again. * The $SLEEP variable, also lost to refactorings was still being used (cosmetic only).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 20 files at r2.
Reviewable status: 1 of 26 files reviewed, 9 unresolved discussions (waiting on @jiceatscion)
a discussion (no related file):
Reading through the docker docs, I could not find out what the --compatibility
flag really does. Is this just defensive, or do we need this?
And I believe this is related, but I'm not sure: if we are now doing a big pass over all the compose files, would it make sense to change to the compose file version 3?
-- commits
line 61 at r3:
FWIW, I don't get that either. The behavior is the same for supervisor and the status subcommand. Personally, I've never liked this; when everything is Ok, the output is empty and that is just not very informative.
scion.sh
line 145 at r3 (raw file):
rscount=$(./tools/dc scion ps --status=running --format "{{.Name}}" $services | wc -l) tscount=$(echo "$services" | wc -w) # Number of all globed services ./tools/dc scion ps -a \
This file seems to use space for indention
scion.sh
line 279 at r3 (raw file):
$PROGRAM help Show this text. $PROGRAM traces [folder]
I'm not entirely sure, but isn't traces
/stop_traces
now "covered" by the start/stop monitoring subcommands?
scion.sh
line 294 at r3 (raw file):
case "$COMMAND" in help|start|start_monitoring|mstart|mstatus|mstop|stop|stop_monitoring|status|topology|sciond-addr|traces|stop_traces|topo_clean|topodot|bazel_remote)
Can we please use kebab-case for cli subcommands and flags?
acceptance/cert_renewal/test.py
line 62 at r3 (raw file):
# Services in their containers need time to start. time.sleep(5)
I hate these sleeps. IMO, the right way to go about this is to explicitly wait for the required condition. It's usually much quicker, more reliable and explicitly documents what needs to happen to continue.
In many cases (e.g. here), a sensible condition is that the path discovery process has run for enough time so that at least one path between any two ASes have been found. This is implemented in tools/await-connectivity
, but so far none of the acceptance tests have been modifed to use this. Note that the bazel harness requires to jump through some hoops to use that script. So, unless the flakyness is so bad that we cannot continue without this, I'd prefer to keep this flaky and fix it properly in a separate PR.
tools/dc
line 40 at r3 (raw file):
- [GROUP]: - scion: For all scion services. - monitoring: For the prometheus service.
Nit: "For the monitoring services, i.e. prometheus and jaeger."
tools/integration/revocation_test.sh
line 41 at r3 (raw file):
# Was there a time when end2end_integration would start the # scion service containers? ./scion.sh mstart $REV_BRS
No this is the point of this test, that the routers are down 😄
The test is slightly flaky though, see #4215.
tools/topology/monitoring.py
line 162 at r3 (raw file):
def _write_dc_file(self): # Merged yeager and prometheus files. # FIXME: should be a single container
Why?
* scion.sh commands all kebab-case. untabified. * removed wrong fix to revocation test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 26 files reviewed, 7 unresolved discussions (waiting on @matzf)
a discussion (no related file):
Previously, matzf (Matthias Frei) wrote…
Reading through the docker docs, I could not find out what the
--compatibility
flag really does. Is this just defensive, or do we need this?
And I believe this is related, but I'm not sure: if we are now doing a big pass over all the compose files, would it make sense to change to the compose file version 3?
Yeah, it is mostly defensive. Some doc says that this is needed to preserve some container naming compatibility (_ vs - something like that). It's included in the shim script that has replaced /usr/bin/docker-compose. Eventually we'll want to figure out the naming business and adjust if needed.
Previously, matzf (Matthias Frei) wrote…
FWIW, I don't get that either. The behavior is the same for supervisor and the status subcommand. Personally, I've never liked this; when everything is Ok, the output is empty and that is just not very informative.
Any concern about fixing it? May be outputing to stderr instead of stdout, too. To protect invokers that expect silence when stuff works.
scion.sh
line 145 at r3 (raw file):
Previously, matzf (Matthias Frei) wrote…
This file seems to use space for indention
done
scion.sh
line 279 at r3 (raw file):
Previously, matzf (Matthias Frei) wrote…
I'm not entirely sure, but isn't
traces
/stop_traces
now "covered" by the start/stop monitoring subcommands?
I think not. This seems to start/stop jaeger in some ad-hoc manner to retrieve traces. It existed independently of run_yaeger/stop_yeager which was called automatically by run_scion/stop_scion. So, I figured it was orthogonal business and left it alone. However, its usefullness probably depends yaeger having been started when starting scion. May be some added doc would be needed. I haven't seen any mention of the start_traces in the doc.
scion.sh
line 294 at r3 (raw file):
Previously, matzf (Matthias Frei) wrote…
Can we please use kebab-case for cli subcommands and flags?
skewered everything.
done
acceptance/cert_renewal/test.py
line 62 at r3 (raw file):
Previously, matzf (Matthias Frei) wrote…
I hate these sleeps. IMO, the right way to go about this is to explicitly wait for the required condition. It's usually much quicker, more reliable and explicitly documents what needs to happen to continue.
In many cases (e.g. here), a sensible condition is that the path discovery process has run for enough time so that at least one path between any two ASes have been found. This is implemented intools/await-connectivity
, but so far none of the acceptance tests have been modifed to use this. Note that the bazel harness requires to jump through some hoops to use that script. So, unless the flakyness is so bad that we cannot continue without this, I'd prefer to keep this flaky and fix it properly in a separate PR.
The flakyness is really bad. At least one test had a near 100% failure rate for me. If you file a bug about these added delays and assign to me, I'll try and fix it. In the meantime we do need a work-around.
tools/dc
line 40 at r3 (raw file):
Previously, matzf (Matthias Frei) wrote…
Nit: "For the monitoring services, i.e. prometheus and jaeger."
done
tools/integration/revocation_test.sh
line 41 at r3 (raw file):
Previously, matzf (Matthias Frei) wrote…
No this is the point of this test, that the routers are down 😄
The test is slightly flaky though, see #4215.
Oops. Had not understood the test design. Removed that.
tools/topology/monitoring.py
line 162 at r3 (raw file):
Previously, matzf (Matthias Frei) wrote…
Why?
Uh...why not. Just a small resource saving. I'm happy to give up on that though.
done
Tabified here docs are rendered with *all* leading tabs removed. Adjust accordingly to keep the output legible.
Matthias' is about to commit a better deflaking.
Add a low-priority slow path for special case packet handling, in particular SCMP errors and traceroutes, for the the asynchronous packet processing pipeline in the router added in scionproto#4351. This is implementation part three of three described in the design document (scionproto#4339, doc/dev/design/BorderRouter.rst). Closes scionproto#4334
These tests have become even more flaky with the updated CI runners; it seems that this is triggered due to using the significantly quicker docker-compose v2. Use the existing await-connectivity script to determine when a test can safely start to run, instead of sleeping for arbitrary number of seconds.
* Severed our dependency on docker-compose V1 (including any shim script in /usr/bin): updated installation instructions and other docs accordingly. * Gave a $HOME dir to integration tests. It's not actually used, but docker-compose v2 fails (in a very cryptic way) without it. * Deflaked some integration tests (timing issues). Those were playing with my head when I was chasing other issues. * Clarified the instructions for running tests wrt to supervisor mode vs docker mode (e.g. the e2e integration tests must be run with -d in the docker mode).
Matthias' is about to commit a better deflaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 32 files reviewed, 7 unresolved discussions (waiting on @matzf)
acceptance/cert_renewal/test.py
line 62 at r3 (raw file):
Previously, jiceatscion wrote…
The flakyness is really bad. At least one test had a near 100% failure rate for me. If you file a bug about these added delays and assign to me, I'll try and fix it. In the meantime we do need a work-around.
Discussion now moot. Issue fixed better in truck. My fix removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 32 files reviewed, 7 unresolved discussions (waiting on @matzf)
Previously, jiceatscion wrote…
Any concern about fixing it? May be outputing to stderr instead of stdout, too. To protect invokers that expect silence when stuff works.
Any objection to fixing that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 9 files at r1, 8 of 20 files at r2, 1 of 3 files at r3, 3 of 8 files at r4, 2 of 6 files at r6, 8 of 8 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: 28 of 32 files reviewed, 7 unresolved discussions (waiting on @jiceatscion)
Previously, jiceatscion wrote…
Any objection to fixing that way?
This seems like a fine idea. If you want to fix this, I'd perhaps go even further and simplify the status/mstatus commands to just print the (filtered) output of supervisor or docker compose.
I'm not aware of anything that uses the output of "./scion.sh [m]status" in a script. The only usage in this repo is in the revocation test, but that one only looks at exit status, as a sanity check to verify that a service exists at all.
scion.sh
line 279 at r3 (raw file):
Previously, jiceatscion wrote…
I think not. This seems to start/stop jaeger in some ad-hoc manner to retrieve traces. It existed independently of run_yaeger/stop_yeager which was called automatically by run_scion/stop_scion. So, I figured it was orthogonal business and left it alone. However, its usefullness probably depends yaeger having been started when starting scion. May be some added doc would be needed. I haven't seen any mention of the start_traces in the doc.
Right, I do remember now -- I think the idea was that the traces
subcommand helps to look at traces that were collected somewhere else.
The difference to the jaeger started in start-monitoring
is that this one (the traces
one) does not open the port to allow collecting new traces.
I don't think this really belongs here, it would make more sense as a separate script eg. in tools/, or get rid of it entirely, idk.
At minimum, I'd suggest to add a short sentence here in the documentation to clarify that this is intended to be used to look at the traces gathered elsewhere. To look at the traces collected locally, just open http://localhost:16868
while monitoring
is up.
scion.sh
line 294 at r3 (raw file):
Previously, jiceatscion wrote…
skewered everything.
done
Excellent, bon appetit!
scion.sh
line 50 at r9 (raw file):
start_scion echo "Note that jaeger is not included anymore." echo "To run jaeger and prometheus, use run_monitoring."
"To run jaeger and prometheus, use '$PROGRAM start-monitoring'."
Also nit: "not started automatically" instead of "not included"? Otherwise it makes it sound like we've removed it entirely.
scion.sh
line 66 at r9 (raw file):
fi echo "Running monitoring..." ./tools/quiet ./tools/dc monitoring up -d
Maybe this could echo the URLs for the local web-UIs of prometheus and jaeger, so people can actually find this?
Running monitoring...
Jaeger UI: http://localhost:16686
Prometheus UI: http://locahlhost:<port?>
scion.sh
line 122 at r9 (raw file):
stop_scion echo "Note that jaeger is not included anymore." echo "To stop jaeger and prometheus, use stop-monitoring."
Ditto, same as cmd_start.
doc/dev/run.rst
line 19 at r9 (raw file):
- `docker compose <https://docs.docker.com/compose/>`_. Runs individual processes in separate containers connected with docker network bridges. Only this mode supports running a "SCION-IP gateway". Before attempting to use the docker compose mode, be sure to build the necessary docker images with:
Nit: maybe indent this so it is part of the docker compose
bullet point. Could make it a "hint" admonition
- `docker compose <...>`_. blabla
.. hint:: Before attempting blablabl run `make docker-images`
(I'm suggesting to use inline code style instead of a code-block, because the admonition looks oddly big otherwise).
doc/dev/run.rst
line 91 at r9 (raw file):
corresponding :doc:`scion daemon </manuals/daemon>` instance. If :option:`scion.sh topology -d` command is used, additional configuration files are created to
Nit: I think "additional" is not completely accurate. It's more like selecting one of two variants (docker vs supervisor), both of which generate somewhat different things. I'd just leave the word "additional" away.
doc/dev/run.rst
line 198 at r9 (raw file):
Stop the monitoring services.
Nit. Trailing whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 28 of 32 files reviewed, 8 unresolved discussions (waiting on @jiceatscion)
doc/dev/run.rst
line 93 at r9 (raw file):
If :option:`scion.sh topology -d` command is used, additional configuration files are created to enable running the SCION services in docker containers (see `docker`_). Otherwise, a configuration file is created to enable running the SCION services as plain processes (see `supervisor`_)
Oops, sphinx creates a link to supervisor.org
!?
Add labels and use :ref:
instead to ensure this stuff points to where we want (see https://www.sphinx-doc.org/en/master/usage/restructuredtext/roles.html#role-ref)
* More scion.sh cleanup * Doc fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 28 of 32 files reviewed, 4 unresolved discussions (waiting on @matzf)
Previously, matzf (Matthias Frei) wrote…
This seems like a fine idea. If you want to fix this, I'd perhaps go even further and simplify the status/mstatus commands to just print the (filtered) output of supervisor or docker compose.
I'm not aware of anything that uses the output of "./scion.sh [m]status" in a script. The only usage in this repo is in the revocation test, but that one only looks at exit status, as a sanity check to verify that a service exists at all.
Ok, then I keep the output on stdout. Doing a formated ps of everything in scope, running or otherwise. Made the output for the docker case similar to that of the supervisor case. I cannot make them strictly equivalent though. There's nothing relevant I can output for a pid and the names aren't structured the same way.
scion.sh
line 279 at r3 (raw file):
Previously, matzf (Matthias Frei) wrote…
Right, I do remember now -- I think the idea was that the
traces
subcommand helps to look at traces that were collected somewhere else.
The difference to the jaeger started instart-monitoring
is that this one (thetraces
one) does not open the port to allow collecting new traces.
I don't think this really belongs here, it would make more sense as a separate script eg. in tools/, or get rid of it entirely, idk.
At minimum, I'd suggest to add a short sentence here in the documentation to clarify that this is intended to be used to look at the traces gathered elsewhere. To look at the traces collected locally, just openhttp://localhost:16868
whilemonitoring
is up.
Now that I paid attention to it, I agree. It doesn't really belong there. Moved it to it's own tool.
done
scion.sh
line 50 at r9 (raw file):
Previously, matzf (Matthias Frei) wrote…
"To run jaeger and prometheus, use '$PROGRAM start-monitoring'."
Also nit: "not started automatically" instead of "not included"? Otherwise it makes it sound like we've removed it entirely.
done
scion.sh
line 66 at r9 (raw file):
Previously, matzf (Matthias Frei) wrote…
Maybe this could echo the URLs for the local web-UIs of prometheus and jaeger, so people can actually find this?
Running monitoring... Jaeger UI: http://localhost:16686 Prometheus UI: http://locahlhost:<port?>
done
scion.sh
line 122 at r9 (raw file):
Previously, matzf (Matthias Frei) wrote…
Ditto, same as cmd_start.
done
doc/dev/run.rst
line 19 at r9 (raw file):
Previously, matzf (Matthias Frei) wrote…
Nit: maybe indent this so it is part of the
docker compose
bullet point. Could make it a "hint" admonition- `docker compose <...>`_. blabla .. hint:: Before attempting blablabl run `make docker-images`(I'm suggesting to use inline code style instead of a code-block, because the admonition looks oddly big otherwise).
done
doc/dev/run.rst
line 91 at r9 (raw file):
Previously, matzf (Matthias Frei) wrote…
Nit: I think "additional" is not completely accurate. It's more like selecting one of two variants (docker vs supervisor), both of which generate somewhat different things. I'd just leave the word "additional" away.
Yeah. I hesitated on that one. I wanted to convey that there are additional files (different ones) depending on the docker/supervisor option, but if I used "additional" twice it became even more confusing. Giving up. Better without it.
doc/dev/run.rst
line 198 at r9 (raw file):
Previously, matzf (Matthias Frei) wrote…
Nit. Trailing whitespace
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 27 of 33 files reviewed, 4 unresolved discussions (waiting on @matzf)
doc/dev/run.rst
line 93 at r9 (raw file):
Previously, matzf (Matthias Frei) wrote…
Oops, sphinx creates a link to
supervisor.org
!?
Add labels and use:ref:
instead to ensure this stuff points to where we want (see https://www.sphinx-doc.org/en/master/usage/restructuredtext/roles.html#role-ref)
done
Had missed that file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r4, 3 of 4 files at r10, 1 of 1 files at r11, all commit messages.
Reviewable status: 31 of 33 files reviewed, 2 unresolved discussions (waiting on @jiceatscion)
scion.sh
line 279 at r3 (raw file):
Previously, jiceatscion wrote…
Now that I paid attention to it, I agree. It doesn't really belong there. Moved it to it's own tool.
done
Ah sweet, thanks!
doc/dev/run.rst
line 93 at r9 (raw file):
Previously, jiceatscion wrote…
done
The trailing underscore is a different syntax, it should be removed now (doc/dev/run.rst:87: WARNING: Mismatch: both interpreted text role prefix and reference suffix.
)
:ref:`docker-section`
NOT
:ref:`docker-section`_
Note: you can see the rendered page and the build logs linked from the PR; in the "Checks" section with the statuses from the different integrated systems click on the "Details" link for read the docs.
Note 2: I need to review the settings for read the docs builds; these warnings should fail the build.
doc/dev/run.rst
line 93 at r11 (raw file):
.. _supervisor-section: supervisor
Needs a blank line between the label definition and the section. (doc/dev/run.rst:93: WARNING: Explicit markup ends without a blank line; unexpected unindent
)
Nit: we've
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 31 of 33 files reviewed, 3 unresolved discussions (waiting on @jiceatscion)
tools/integration/revocation_test.sh
line 22 at r11 (raw file):
for br in $REV_BRS; do if ! ./scion.sh mstatus "$br"; then
This currently fails, not sure why. Is the exit code different now for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 30 of 33 files reviewed, 3 unresolved discussions (waiting on @matzf)
doc/dev/run.rst
line 93 at r9 (raw file):
Previously, matzf (Matthias Frei) wrote…
The trailing underscore is a different syntax, it should be removed now (
doc/dev/run.rst:87: WARNING: Mismatch: both interpreted text role prefix and reference suffix.
):ref:`docker-section` NOT :ref:`docker-section`_
Note: you can see the rendered page and the build logs linked from the PR; in the "Checks" section with the statuses from the different integrated systems click on the "Details" link for read the docs.
Note 2: I need to review the settings for read the docs builds; these warnings should fail the build.
oops, forgot to remove it.
done.
doc/dev/run.rst
line 93 at r11 (raw file):
Previously, matzf (Matthias Frei) wrote…
Needs a blank line between the label definition and the section. (
doc/dev/run.rst:93: WARNING: Explicit markup ends without a blank line; unexpected unindent
)
Nit: we've
done
tools/integration/revocation_test.sh
line 22 at r11 (raw file):
Previously, matzf (Matthias Frei) wrote…
This currently fails, not sure why. Is the exit code different now for some reason?
not sure either...works for me. Digging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 30 of 33 files reviewed, 3 unresolved discussions (waiting on @matzf)
tools/integration/revocation_test.sh
line 22 at r11 (raw file):
Previously, jiceatscion wrote…
not sure either...works for me. Digging.
Ah...it fails only when using supervisor, and not with docker... stay tuned.
Also simplified a bit. No need to special case an empty argument list or compute the result differently in supervisor vs docker cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 29 of 33 files reviewed, 3 unresolved discussions (waiting on @matzf)
tools/integration/revocation_test.sh
line 22 at r11 (raw file):
Previously, jiceatscion wrote…
Ah...it fails only when using supervisor, and not with docker... stay tuned.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r12, 1 of 1 files at r13, 1 of 1 files at r14, all commit messages.
Reviewable status: 31 of 33 files reviewed, all discussions resolved
…pose". This is due to a bug in some version of docker-compose v2 that causes an unused .cache directory to be created in $HOME. Due to a bug in Bazel, tests don't have a HOME variable (it should be set to $TEST_TMPDIR but isn't set at all). So we must provide one ourselves. My previous work around of using /tmp had the potential of creating collisions in the future. This method is cleaner. It uses an output location internal to the test. This is the simplest I could manage. Tools are supposed to be enablers, but Bazel prefers to be a disabler it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r13, 5 of 5 files at r15, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @jiceatscion)
This gets outdated too fast. Better refer the user to docker install instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r16, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @jiceatscion)
Main points of this PR:
Make the build and test environment easier to setup and use:
This change is