Generalize integration tests to other queue systems (SGE) and test more Python versions #160

ml-evs · 2024-08-02T14:36:41Z

Pending e.g. #159, I have started to generalise the integration tests so we can make a mega combined Dockerfile that runs e.g., Slurm, SGE and potentially other queuing systems in the same container for testing purposes. This PR is the first step in that. Probably we could test remote shell execution first too. Depending on how awkward it is to set up multiple queues together, it might be that each one runs in a different container.

EDIT (3 months on...):

This PR introduces a multistage docker build that can be used to spin up Slurm and SGE queues. Setting up SGE ended up being significant effort, given the scant documentation (and was only achievable via help from Claude).

It also:

Adds Python 3.12 as a tested version
Attempts to more aggressively cache the docker builds in the CI

Important notes/issues:

QToolKit is pinned to develop
- This should be updated as soon as QToolKit is released.
Integration tests are slow
- This PR comes with a fairly significant performance cost on the integration tests. These are not run by default in the local test suite (can be forced with env var CI=1), but this should probably be revisited soon. There is discussion in the comments below, but it seems the main slowdown is the fact that docker-py does not seem to be make good use of the existing cache, perhaps due to using some features not fully supported (docker-py cannot use buildkit). We can switch over to something that maps more closely to the Docker CLI, which should improve performance. More generally, the integration tests can be a bit slow as the containers are simulating real queues that have their own polling times. At the very least, we can parallelize the integration tests over Slurm/SGE, though this will be a bit fiddly with the current setup. I've added the pytest-durations plugin to track these slow steps.

codecov-commenter · 2024-08-03T14:16:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.18%. Comparing base (fb0b196) to head (77ee908).

Additional details and impacted files

@@           Coverage Diff            @@
##           develop     #160   +/-   ##
========================================
  Coverage    73.18%   73.18%           
========================================
  Files           48       48           
  Lines         6347     6347           
  Branches      1012     1012           
========================================
  Hits          4645     4645           
- Misses        1340     1341    +1     
+ Partials       362      361    -1

see 2 files with indirect coverage changes

gpetretto · 2024-08-07T15:47:17Z

Hi @ml-evs, thanks for setting this up. Since the dependence on the queue system is only in the submission and check of the jobs I wonder if it is worth running all the tests for all the systems.
Could it be more effective to have these tests on all the queue systems directly in qtoolkit?

ml-evs · 2024-08-22T17:31:45Z

Hi @ml-evs, thanks for setting this up. Since the dependence on the queue system is only in the submission and check of the jobs I wonder if it is worth running all the tests for all the systems. Could it be more effective to have these tests on all the queue systems directly in qtoolkit?

Hi @gpetretto, sorry I missed this. Happy to have this migrated wherever you see fit, though I think having e2e tests for jobflow-remote are still appropriate here.

A status update on this PR in particular: SGE is proving pretty nasty to set up, attempting to follow the few guides I can find online and sadly the main init_cluster configuration script is segfaulting on what should be a straightforward operation (setting the manager's username). I'll keep probing with more recent versions and see how far I can get.

gpetretto · 2024-08-23T08:23:09Z

Hi @gpetretto, sorry I missed this. Happy to have this migrated wherever you see fit, though I think having e2e tests for jobflow-remote are still appropriate here.

A status update on this PR in particular: SGE is proving pretty nasty to set up, attempting to follow the few guides I can find online and sadly the main init_cluster configuration script is segfaulting on what should be a straightforward operation (setting the manager's username). I'll keep probing with more recent versions and see how far I can get.

Indeed it might be good to have a few different queue managers to test. Especially if it is confirmed that for example SGE does not support query by list of ids.

ml-evs · 2024-10-11T14:28:22Z

Hi @gpetretto, I think this is now 99% of the way there... at least locally, I can now SSH into the SGE container and run jobs. Any remaining issues might be related to @QuantumChemist's qtoolkit PR, but it is hard to say at this point. This was really painful to implement, but I think the pain was specific to SGE itself.

The main issue was needing to reverse engineer how to set up user permissions for the SGE queue (see cc613ba) without being able to use the bundled qmon GUI, and without there being any online documentation. With a lot of rubber-ducking with Claude (hours) I was able to get it working so the jobflow user can submit jobs.

Locally, I see the jobs being submitted and the usual JFR processes "working", but for some reason no output is being written or copied back to the runner at the moment.

As a general point, we might consider entirely splitting the integration tests and unit tests into separate jobs. I'm not sure how much the integration tests are really contributing to the codecov, which is the only real reason to run them together, but as the integration tests are slow (both building the Docker containers and actually running them), it's probably beneficial to split them and figure out a way to combine the coverage stats down the line (e.g., the jobs could both create a GH actions artifact containing the coverage, then a separate job can combine them and upload to codecov).

gpetretto

Thanks @ml-evs for all the work! The PR looks good to me.

As for the coverage, indeed most will be from the standard tests, but there are definitely some functionalities that are only tested in the integration tests (e.g. the batch submission). If the merging of the coverage files is done as in #180 it could be fine running the tests separately.

davidwaroquiers · 2024-11-08T14:10:43Z

From discussion in #201, should we add python 3.12 here as part of the github ci testing workflow.

ml-evs · 2024-11-16T16:37:45Z

Hi @gpetretto, the remaining issue I'm facing here seems to be that jobs are not retried with SGE in these tests. The test for whether the job has outputs triggers before the output has been written, though if I stick in a breakpoint and inspect the container, I can definitely see the output. For the simple test_submit_flow test in test_workers.py, job_1 ends up with this state:

{'_id': ObjectId('6738c69ccd47701991707182'), 'job': {'@module': 'jobflow.core.job', '@class': 'Job', '@version': '0.1.18', 'function': {'@module': 'jobflow_remote.testing', '@callable': 'add', '@bound': None}, 'function_args': [1, 5], 'function_kwargs': {}, 'output_schema': None, 'uuid': 'b4be80b1-e272-4694-a2fc-ba7f697e06bc', 'index': 1, 'name': 'add', 'metadata': {}, 'config': {'@module': 'jobflow.core.job', '@class': 'JobConfig', '@version': '0.1.18', 'resolve_references': True, 'on_missing_references': 'error', 'manager_config': {}, 'expose_store': False, 'pass_manager_config': True, 'response_manager_config': {}}, 'hosts': ['5aa4a8ef-79d1-42b6-822e-647ccd8e5f0f'], 'metadata_updates': [], 'config_updates': [], 'name_updates': []}, 'uuid': 'b4be80b1-e272-4694-a2fc-ba7f697e06bc', 'index': 1, 'db_id': '1', 'worker': 'test_remote_sge_worker', 'state': 'REMOTE_ERROR', 'remote': {'step_attempts': 0, 'queue_state': None, 'process_id': '1', 'retry_time_limit': None, 'error': 'Remote error: file /home/jobflow/jfr/b4/be/80/b4be80b1-e272-4694-a2fc-ba7f697e06bc_1/jfremote_out.json for job b4be80b1-e272-4694-a2fc-ba7f697e06bc does not exist'}, 'parents': [], 'previous_state': 'TERMINATED', 'error': None, 'lock_id': None, 'lock_time': None, 'run_dir': '/home/jobflow/jfr/b4/be/80/b4be80b1-e272-4694-a2fc-ba7f697e06bc_1', 'start_time': None, 'end_time': None, 'created_on': datetime.datetime(2024, 11, 16, 16, 21, 48, 423000), 'updated_on': datetime.datetime(2024, 11, 16, 16, 21, 50, 502000), 'priority': 0, 'exec_config': None, 'resources': None, 'stored_data': None}

I've tried resetting all of the conftest settings to the defaults, to hopefully allow for much longer retry steps, but I'm concerned about step_attempts being 0. Any ideas why this might not be working? I don't think it would be related to #210, but perhaps it is? i.e., the job is marked as finished before the file itself is written -- given that this is all local, it seems strange that the file would not be present at the moment jobflow checks for it.

gpetretto · 2024-11-18T09:26:05Z

Hi @ml-evs, thanks for all the efforts spent on this.
Indeed it seems weird that this would depend on the system finishing the Job before the files have been written to the file system. But if you it can help I can try to implement #210 right away, so that we can rule that out.

I also wanted to test it locally to investigate this further but in my case the Job fails before even running. It cannot submit the SGE job to the queue. I also tried to connect to the container and submit a job manually but it fails with the same error:

Unable to initialize environment because of error: cell directory "/opt/sge/default" doesn't exist
Exiting.

Did this ever happen to you?

gpetretto · 2024-11-22T11:06:21Z

Hi @ml-evs,
I made some progress on this. First, the error above was due to my configuration. I am running on a Mac with an M processor and so there was an error in the container log, stating that the there is no SGE version available for ARM. My solution was to explicitly ask docker for amd64:

FROM --platform=linux/amd64 ubuntu:22.04 as sge

Not sure if this could/should be added to the Dockerfile for everybody. For some reason the Slurm container was already coming up with amd64. Maybe because of some settings in the nathanhess/slurm image?

Then I did some tests and indeed the problem lies in the qtk implementation. Or at least in connection with this version of SGE. The problem seems to be that the -j option to get the list of jobs based on job ids returns something that is not consistent with what the parser expects. The option exists, but the output is a very long list of properties that do not even contain the state. The code interprets this bunch of information as if there were no jobs in the list, so it considers the job as finished and tries to download right away, even before the job is started, thus leading to the failure. I would suggest discussing this in an issue on qtoolkit. I will open it providing more details.

I then managed to run the tests by setting scheduler_username="jobflow" in the test_remote_sge_worker configuration. This uses the username instead of the job id to fetch the jobs and apparently works fine in this case. Before getting the tests to pass I also had to switch runner.run(ticks=10) to runner.run_all_jobs(max_seconds=120), because the sge jobs are relatively slow to start and it took a while for them to be completed. Not sure if this is just because I am running emulating the amd64 architecture, but it may be better to do that in any case. It may even speed up the tests for the local worker.

I have a few more notes/questions related to the container. Just starting the containers takes ~10 minutes every time on my laptop. It does not seem to benefit much from caching. Again, I don't know if this is because of emulation required, but it might be a problem when developing, as running the integration tests actually helps intercepting some kind of failures and waiting ~15/20 minutes for them to complete would be quite annoying. Also every execution leaves behind a number of dangling images and volumes, that I have to remember to clean up. At some point I reached the maximum space available and I started getting different errors. Does this also happen to you, or is there some configuration that I can set to improve speed, caching and cleanup?
(just to clarify, this also happened before with just the slurm container, but with less containers and less dangling images left)
I also suppose that running on the CI will not be much faster either. I am wondering if it is worth having all the different schedulers should always be tested. Is there any option to run certain tests only on demand?
Lastly, the error above would have been much easier to spot in an integration test in qtoolkit rather than here. Do you think we could just copy this configuration there and start writing a few integration tests as well?

ml-evs · 2024-11-22T11:49:23Z

Hi @ml-evs, I made some progress on this. First, the error above was due to my configuration. I am running on a Mac with an M processor and so there was an error in the container log, stating that the there is no SGE version available for ARM. My solution was to explicitly ask docker for amd64:

Not sure if this could/should be added to the Dockerfile for everybody. For some reason the Slurm container was already coming up with amd64. Maybe because of some settings in the nathanhess/slurm image?

Thanks a lot for investigating! I guess the slurm image is only available on amd64, so it was just defaulting to that, but the base ubuntu image can be used with apple silicon. Happy to add that platform line (will do it now).

Then I did some tests and indeed the problem lies in the qtk implementation. Or at least in connection with this version of SGE. The problem seems to be that the -j option to get the list of jobs based on job ids returns something that is not consistent with what the parser expects. The option exists, but the output is a very long list of properties that do not even contain the state. The code interprets this bunch of information as if there were no jobs in the list, so it considers the job as finished and tries to download right away, even before the job is started, thus leading to the failure. I would suggest discussing this in an issue on qtoolkit. I will open it providing more details.

I then managed to run the tests by setting scheduler_username="jobflow" in the test_remote_sge_worker configuration. This uses the username instead of the job id to fetch the jobs and apparently works fine in this case.

It's more likely just how this SGE has been configured, rather than version. You really have to set everything manually, and I have no idea where I would control the default qstat output. If the user workaround is robust then we should just use that, I think.

Before getting the tests to pass I also had to switch runner.run(ticks=10) to runner.run_all_jobs(max_seconds=120), because the sge jobs are relatively slow to start and it took a while for them to be completed.

There might be some tweaking here regarding the internal polling rate of SGE -- I've tried to do this in 73d2ce6, let's see how it affects the test times.

I have a few more notes/questions related to the container. Just starting the containers takes ~10 minutes every time on my laptop.

Running the tests "from scratch" for me takes around 10 minutes without any cache, with the container builds taking a couple of minutes (similar to the "build test bench job" in the CI). The caching should be fairly reasonable -- I'm using a separate build stage for jobflow and the queues, so once the queues are built they should be essentially frozen. You can at least select either sge or slurm for the testing part (-k sge or -k slurm), but yes this is still a bit painful for development, hence why the tests are disabled locally unless CI=1.

It does not seem to benefit much from caching.

Could you try building the images natively outside of Python? I know people have performance issues with Docker on Mac (which is why things like https://orbstack.dev/ exist), but generally the docker-python connection is also not ideal (cannot use buildkit features, for example) so we might be able to improve that by just manually running docker as subcommands (recommended in docker/docker-py#2230) Here's my timings for the relevant command:

$ time docker buildx bake --no-cache -f tests/integration/dockerfiles/docker-bake.hcl
docker buildx bake --no-cache -f tests/integration/dockerfiles/docker-bake.hc  0.97s user 0.62s system 1% cpu 2:24.77 total

# then, edit a jobflow src file (add a new line) in a minor way and run with cache to mimic development -- should show only jobflow being reinstalled
$ time docker buildx bake -f tests/integration/dockerfiles/docker-bake.hcl
docker buildx bake -f tests/integration/dockerfiles/docker-bake.hcl  0.15s user 0.09s system 1% cpu 21.058 total

Also every execution leaves behind a number of dangling images and volumes, that I have to remember to clean up. At some point I reached the maximum space available and I started getting different errors.

The teardown methods of the tests should remove the container, but your cache may continue to grow if you are making lots of edits to the queues themselves. I do tend to let my docker cache get quite large on my local machine (few hundred GBs) but that's across all my projects.

Does this also happen to you, or is there some configuration that I can set to improve speed, caching and cleanup? (just to clarify, this also happened before with just the slurm container, but with less containers and less dangling images left) I also suppose that running on the CI will not be much faster either. I am wondering if it is worth having all the different schedulers should always be tested. Is there any option to run certain tests only on demand? Lastly, the error above would have been much easier to spot in an integration test in qtoolkit rather than here. Do you think we could just copy this configuration there and start writing a few integration tests as well?

It's not exactly fast for me, but I haven't seen these issues with space. One way we could consider getting around this is pushing the final built containers to ghcr.io (and labelling them as such) so most of the time will just be spent downloading them. In this way, qtoolkit could just pull the containers and write custom tests against them.

ml-evs · 2024-11-22T11:58:26Z

I think we could also consider only running the integration tests for the minimum Python version, and perhaps running the tests for each queue in parallel to speed things up.

QuantumChemist · 2024-11-22T14:51:54Z

Hey @gpetretto and @ml-evs . Pls let me know if there is anything I can also help you with, like making fixes to the SGE qtk implementation or more testing with my colleague!

gpetretto · 2024-11-22T17:04:25Z

Thanka @ml-evs for the reply. I tried running the container without python, while being slower than for you, it is indeed way faster than usually, especially the cached version

time docker buildx bake --no-cache -f tests/integration/dockerfiles/docker-bake.hcl
docker buildx bake --no-cache -f tests/integration/dockerfiles/docker-bake.hc  1,79s user 1,83s system 1% cpu 3:50,59 total

time docker buildx bake -f tests/integration/dockerfiles/docker-bake.hcl
docker buildx bake -f tests/integration/dockerfiles/docker-bake.hcl  0,26s user 0,35s system 2% cpu 24,922 total

After this I tried running the tests and it ended up build everything again, taking almost 10 minutes as well.

I have seen in the issue that you linked that there is a library wrapping the docker CLI https://github.com/gabrieldemarmiesse/python-on-whales. Maybe I should try if switching to that avoids the issues that I encountered?

ml-evs · 2024-11-22T20:56:02Z

Hmm okay, I think wit mine it at least uses the cache in Python too. Can we get this PR passing and merged first, then we can try to split it up and optimise it?

gpetretto · 2024-11-26T09:37:52Z

Hi @ml-evs, thanks for rebasing, I have seen that there are no conflict anymore (also adding a merge commit would have been fine on my side. Sorry for the late reply). But now the tests are failing. Apparently in the configuration of the SGE worker the keys for the resources do not match the values in the SGEIO template: https://github.com/Matgenix/qtoolkit/blob/62b683f345d8652a672cfec91949d353ad7a410d/src/qtoolkit/io/sge.py#L106.

However it is weird, as it does not seem that you changed them in the last commit and the tests were passing yesterday.

ml-evs · 2024-11-26T16:03:16Z

Hi @ml-evs, thanks for rebasing, I have seen that there are no conflict anymore (also adding a merge commit would have been fine on my side. Sorry for the late reply). But now the tests are failing. Apparently in the configuration of the SGE worker the keys for the resources do not match the values in the SGEIO template: Matgenix/qtoolkit@62b683f/src/qtoolkit/io/sge.py#L106.

However it is weird, as it does not seem that you changed them in the last commit and the tests were passing yesterday.

I'm also confused why this didn't fail before. I've just removed the misunderstood resource keys and just set walltime. Are these not mappable by QToolkit via QResources? I can't see for e.g., how to set the SGE equivalent of ntasks anywhere...

gpetretto · 2024-11-26T17:02:26Z

The convertion to the QResources is in the PBE base class https://github.com/Matgenix/qtoolkit/blob/62b683f345d8652a672cfec91949d353ad7a410d/src/qtoolkit/io/pbs_base.py#L158

However it seems that we are back to the output files not being generated (or downloaded at the proper time). Unless you have any guess of what could have changed, I will need to get the latest updates and check what is happening now.

ml-evs · 2024-11-27T11:26:49Z

However it seems that we are back to the output files not being generated (or downloaded at the proper time). Unless you have any guess of what could have changed, I will need to get the latest updates and check what is happening now.

Given how hard this is to debug, I think this should even be classed a jobflow-remote bug, I'm watching the logs as the tests run and nothing seems to indicate that it will fail in this manner. I don't have a huge amount of time to probe this deeper at the moment, but when I do I will return to first principles of checking that I can actually submit a job in the container independently of jobflow-remote...

davidwaroquiers · 2024-11-27T12:22:14Z

Maybe to distinguish different possible sources, one good thing could be to "replicate" the integration tests in qtoolkit directly. I think it makes sense to test queue systems with integration tests directly in qtoolkit. That does not mean it's not needed to do it also in jobflow-remote and I think it is needed. But maybe this will help find out if some of the issue(s) are only related to queue systems interfaces or with jfr itself. Do you think it would be difficult to put that in place in qtoolkit ?

gpetretto · 2024-11-27T13:57:08Z

Hi @ml-evs, I looked again at the error message of the tests and I think I have found the problem (or at least one) that is causing the failure: it seems that during one of the merges the resources for test_remote_sge_worker have reverted back to those for slurm. Can you try changing them again so that they match the keys for SGE?

ml-evs · 2024-11-27T13:58:02Z

Maybe to distinguish different possible sources, one good thing could be to "replicate" the integration tests in qtoolkit directly. I think it makes sense to test queue systems with integration tests directly in qtoolkit. That does not mean it's not needed to do it also in jobflow-remote and I think it is needed. But maybe this will help find out if some of the issue(s) are only related to queue systems interfaces or with jfr itself. Do you think it would be difficult to put that in place in qtoolkit ?

One thing we can do is build the latest jobflow-remote test bench (including queues) in the CI and push it to the GitHub container registry (ghcr.io) so that qtoolkit can simply download it. We could then run the jobflow-remote integration test suite in the CI of qtoolkit.

ml-evs · 2024-11-27T14:00:05Z

Hi @ml-evs, I looked again at the error message of the tests and I think I have found the problem (or at least one) that is causing the failure: it seems that during one of the merges the resources for test_remote_sge_worker have reverted back to those for slurm. Can you try changing them again so that they match the keys for SGE?

Hi @gpetretto, yes, I reverted the commit the updated the resources as I'm 90% sure it was working with it originally, I've just reintroduced it again but still expect it to fail.

davidwaroquiers · 2024-11-27T14:10:22Z

Maybe to distinguish different possible sources, one good thing could be to "replicate" the integration tests in qtoolkit directly. I think it makes sense to test queue systems with integration tests directly in qtoolkit. That does not mean it's not needed to do it also in jobflow-remote and I think it is needed. But maybe this will help find out if some of the issue(s) are only related to queue systems interfaces or with jfr itself. Do you think it would be difficult to put that in place in qtoolkit ?

One thing we can do is build the latest jobflow-remote test bench (including queues) in the CI and push it to the GitHub container registry (ghcr.io) so that qtoolkit can simply download it. We could then run the jobflow-remote integration test suite in the CI of qtoolkit.

I was more referring to have integration tests in qtoolkit with the different queues, not necessarily run jobflow-remote tests when qtoolkit has updates. Do you think this would be needed ? I guess we could have something where it's not "forbidden" to introduce backward incompatibilities in qtoolkit and then dependabot on jobflow-remote will just assess that the tests do not pass anymore on jfr and then we need to do something in jfr itself. What do you think ?

ml-evs · 2024-11-27T14:27:02Z

For everyone's information, the problem seemed to be that SGE was back to defaulting to the wrong shell again, so it could not load the pre_run properly. Probably we should check that pre_run can be loaded at submission time (or does it need to be run in the job environment?) and improve the error handling around it (will raise issue).

ml-evs · 2024-11-27T14:45:50Z

Maybe to distinguish different possible sources, one good thing could be to "replicate" the integration tests in qtoolkit directly. I think it makes sense to test queue systems with integration tests directly in qtoolkit. That does not mean it's not needed to do it also in jobflow-remote and I think it is needed. But maybe this will help find out if some of the issue(s) are only related to queue systems interfaces or with jfr itself. Do you think it would be difficult to put that in place in qtoolkit ?

One thing we can do is build the latest jobflow-remote test bench (including queues) in the CI and push it to the GitHub container registry (ghcr.io) so that qtoolkit can simply download it. We could then run the jobflow-remote integration test suite in the CI of qtoolkit.

I was more referring to have integration tests in qtoolkit with the different queues, not necessarily run jobflow-remote tests when qtoolkit has updates. Do you think this would be needed ? I guess we could have something where it's not "forbidden" to introduce backward incompatibilities in qtoolkit and then dependabot on jobflow-remote will just assess that the tests do not pass anymore on jfr and then we need to do something in jfr itself. What do you think ?

Then it should be straight forward just to take the container definition (or built images) from here, then people can write the tests themselves! I'm not going to suggest that I immediately try to add PBS to the container, given how much work this one was...

davidwaroquiers · 2024-11-29T10:48:03Z

Maybe to distinguish different possible sources, one good thing could be to "replicate" the integration tests in qtoolkit directly. I think it makes sense to test queue systems with integration tests directly in qtoolkit. That does not mean it's not needed to do it also in jobflow-remote and I think it is needed. But maybe this will help find out if some of the issue(s) are only related to queue systems interfaces or with jfr itself. Do you think it would be difficult to put that in place in qtoolkit ?

One thing we can do is build the latest jobflow-remote test bench (including queues) in the CI and push it to the GitHub container registry (ghcr.io) so that qtoolkit can simply download it. We could then run the jobflow-remote integration test suite in the CI of qtoolkit.

I was more referring to have integration tests in qtoolkit with the different queues, not necessarily run jobflow-remote tests when qtoolkit has updates. Do you think this would be needed ? I guess we could have something where it's not "forbidden" to introduce backward incompatibilities in qtoolkit and then dependabot on jobflow-remote will just assess that the tests do not pass anymore on jfr and then we need to do something in jfr itself. What do you think ?

Then it should be straight forward just to take the container definition (or built images) from here, then people can write the tests themselves! I'm not going to suggest that I immediately try to add PBS to the container, given how much work this one was...

I think that would be great. We can start with slurm and sge, having people writing their own tests of course. When you have some time to just take the current container definitions and add this to qtoolkit (maybe with one small example test ?), that would be great! Then other contributors can add integration tests for the different functionalities. I'll maybe add an issue in qtoolkit itself for keeping track of that in the right place.

ml-evs force-pushed the ml-evs/generalize-integration-tests branch from 78f6054 to 35d0e40 Compare August 3, 2024 14:07

ml-evs force-pushed the ml-evs/generalize-integration-tests branch 2 times, most recently from 0e54d6b to ae798fd Compare September 1, 2024 13:31

ml-evs force-pushed the ml-evs/generalize-integration-tests branch from 7577c6c to 4062fdf Compare September 9, 2024 10:22

ml-evs mentioned this pull request Sep 17, 2024

New CLI functionalities: tree, report, job info #180

Merged

ml-evs force-pushed the ml-evs/generalize-integration-tests branch from e311fb0 to b1ee827 Compare September 19, 2024 14:15

ml-evs mentioned this pull request Sep 20, 2024

Implementation of SGE interface Matgenix/qtoolkit#43

Merged

ml-evs force-pushed the ml-evs/generalize-integration-tests branch from 645d9cb to 8637ef5 Compare September 25, 2024 09:32

ml-evs force-pushed the ml-evs/generalize-integration-tests branch 2 times, most recently from 4c7c7ea to e3cd1a3 Compare October 11, 2024 14:06

ml-evs marked this pull request as ready for review October 11, 2024 14:23

ml-evs requested a review from gpetretto October 11, 2024 14:23

gpetretto approved these changes Oct 14, 2024

View reviewed changes

gpetretto mentioned this pull request Nov 8, 2024

Delete job files when rerunning; customizable execution command #201

Merged

ml-evs mentioned this pull request Nov 15, 2024

Optionally delay download after job end #210

Open

ml-evs force-pushed the ml-evs/generalize-integration-tests branch from 578da9a to 242148b Compare November 22, 2024 21:00

ml-evs added 4 commits November 25, 2024 22:19

Test all remote workers with scheduler username config

46954f7

Switch fixed number of runner ticks to with a timeout

c712755

Add pytest-durations plugin to see slowest steps

1c04630

Try to remove containers after test run

4476ce3

ml-evs force-pushed the ml-evs/generalize-integration-tests branch 3 times, most recently from 02dbef9 to c30f3e5 Compare November 25, 2024 22:57

Remove try/finally in pytest fixture to simplify PR diff

a1c9399

ml-evs force-pushed the ml-evs/generalize-integration-tests branch from c30f3e5 to a1c9399 Compare November 25, 2024 22:59

Re-disable remove and autoremove

4ee17a8

Remove resources for SGE

ea0ae36

Force bash shell in SGE

77ee908

This was referenced Nov 27, 2024

QToolKit is currently pinned to development version #218

Open

Integration tests are very slow and test bench build is not cached properly locally #219

Open

ml-evs mentioned this pull request Nov 27, 2024

Better error handling for pre_run #220

Open

ml-evs merged commit 41bf0b2 into develop Nov 27, 2024
7 checks passed

ml-evs deleted the ml-evs/generalize-integration-tests branch November 27, 2024 14:45

davidwaroquiers mentioned this pull request Nov 29, 2024

Integration tests Matgenix/qtoolkit#51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize integration tests to other queue systems (SGE) and test more Python versions #160

Generalize integration tests to other queue systems (SGE) and test more Python versions #160

ml-evs commented Aug 2, 2024 •

edited

Loading

codecov-commenter commented Aug 3, 2024 •

edited

Loading

gpetretto commented Aug 7, 2024

ml-evs commented Aug 22, 2024

gpetretto commented Aug 23, 2024

ml-evs commented Oct 11, 2024

gpetretto left a comment

davidwaroquiers commented Nov 8, 2024

ml-evs commented Nov 16, 2024

gpetretto commented Nov 18, 2024

gpetretto commented Nov 22, 2024 •

edited

Loading

ml-evs commented Nov 22, 2024

ml-evs commented Nov 22, 2024

QuantumChemist commented Nov 22, 2024 •

edited

Loading

gpetretto commented Nov 22, 2024

ml-evs commented Nov 22, 2024

gpetretto commented Nov 26, 2024

ml-evs commented Nov 26, 2024

gpetretto commented Nov 26, 2024

ml-evs commented Nov 27, 2024

davidwaroquiers commented Nov 27, 2024

gpetretto commented Nov 27, 2024

ml-evs commented Nov 27, 2024

ml-evs commented Nov 27, 2024

davidwaroquiers commented Nov 27, 2024

ml-evs commented Nov 27, 2024

ml-evs commented Nov 27, 2024

davidwaroquiers commented Nov 29, 2024

Generalize integration tests to other queue systems (SGE) and test more Python versions #160

Generalize integration tests to other queue systems (SGE) and test more Python versions #160

Conversation

ml-evs commented Aug 2, 2024 • edited Loading

codecov-commenter commented Aug 3, 2024 • edited Loading

Codecov Report

gpetretto commented Aug 7, 2024

ml-evs commented Aug 22, 2024

gpetretto commented Aug 23, 2024

ml-evs commented Oct 11, 2024

gpetretto left a comment

Choose a reason for hiding this comment

davidwaroquiers commented Nov 8, 2024

ml-evs commented Nov 16, 2024

gpetretto commented Nov 18, 2024

gpetretto commented Nov 22, 2024 • edited Loading

ml-evs commented Nov 22, 2024

ml-evs commented Nov 22, 2024

QuantumChemist commented Nov 22, 2024 • edited Loading

gpetretto commented Nov 22, 2024

ml-evs commented Nov 22, 2024

gpetretto commented Nov 26, 2024

ml-evs commented Nov 26, 2024

gpetretto commented Nov 26, 2024

ml-evs commented Nov 27, 2024

davidwaroquiers commented Nov 27, 2024

gpetretto commented Nov 27, 2024

ml-evs commented Nov 27, 2024

ml-evs commented Nov 27, 2024

davidwaroquiers commented Nov 27, 2024

ml-evs commented Nov 27, 2024

ml-evs commented Nov 27, 2024

davidwaroquiers commented Nov 29, 2024

ml-evs commented Aug 2, 2024 •

edited

Loading

codecov-commenter commented Aug 3, 2024 •

edited

Loading

gpetretto commented Nov 22, 2024 •

edited

Loading

QuantumChemist commented Nov 22, 2024 •

edited

Loading