Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autosubmit gui bar graph not progressing #31

Open
LuiggiTenorioK opened this issue Sep 6, 2023 · 8 comments
Open

autosubmit gui bar graph not progressing #31

LuiggiTenorioK opened this issue Sep 6, 2023 · 8 comments
Assignees
Labels
bug Something isn't working to do This label should be attached issues that are assigned to work with but that did not start.

Comments

@LuiggiTenorioK
Copy link
Member

In GitLab by @ebergas on Sep 6, 2023, 09:50

Hi, I have several experiments I had to resubmit after this weekend. While the experiments themselves are running smoothly, the progress bar fails to reflect the true status. For instance, it currently displays "142/440 jobs," but more than 200 jobs have actually completed. However when I check the tree graph it shows an accurate representation of what is going on.

Screenshot_from_2023-09-06_09-39-08

What could be going wrong¿ I find it very useful to just check the progress with the preview and now I can not do that.

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @manuel-g-castro on Sep 6, 2023, 12:41

Before, and foremost! I am really sorry, but we don't have anyone in charge of both the API and GUI since Julián left. The new developer, Luiggi, should arrive in October 1st, according to @mcastril's latest news form HR.

BUT, that didn't stop me from trying to see what the issue was (even though, DISCLAIMER, I am an ignorant in web development).

This issue is indeed really, really, weird! I saw that there are more experiments that are incorrectly reporting in the progress bar: a6dj, a6e7, a6e8, a6dk, and a6di. But there is one that is not showing this behavior: a6e9, which is the only one that has been running since August 25th. All the other experiments seem to have failed somewhere around september 3rd, right? And you have rerun them in the afternoon of that same day.

@dbeltrankyl noticed that there are empty databases matching the expids of the problematic experiments. And, interestingly, the a6e9 is not among those troublesome databases. Maybe the issue might be solved by manually deleting this files, and rerunning the experiments. If you think this is worth it.

In the meantime, I am transfering this issue to the API since I believe that the GUI is reading the values properly.

mgimenez@bsces107930 ~ % cd /esarchive/autosubmit/as_metadata/data

mgimenez@bsces107930 ~/Documents/esarchive/autosubmit/as_metadata/data ls -l | grep -e a6dj -e  a6e7 -e a6e8 -e  a6dk -e a6di -e a6e9
-rwxrwxrw- 1 2401 565      7168 ago 14 19:31 job_data_a6di.db
-rw-rw-rw- 1 2401 565      1659 sep  6 10:46 job_data_a6di.sql
-rwxrwxrw- 1 2401 565      7168 ago 17 16:26 job_data_a6dj.db
-rw-rw-rw- 1 2401 565      1659 sep  6 12:15 job_data_a6dj.sql
-rwxrwxrw- 1 2401 565      7168 ago 17 16:39 job_data_a6dk.db
-rw-rw-rw- 1 2401 565      1659 sep  6 12:22 job_data_a6dk.sql
-rwxrwxrw- 1 2401 565      7168 ago 22 11:31 job_data_a6e7.db
-rw-rw-rw- 1 2401 565      1659 sep  6 12:26 job_data_a6e7.sql
-rwxrwxrw- 1 2401 565      7168 ago 22 11:49 job_data_a6e8.db
-rw-rw-rw- 1 2401 565      1659 sep  6 12:26 job_data_a6e8.sql
-rwxrwxrw- 1 2401 565    444416 sep  6 10:58 job_data_a6e9.db
-rw-rw-rw- 1 2401 565    360468 sep  6 10:58 job_data_a6e9.sql

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @manuel-g-castro on Sep 6, 2023, 12:41

moved from autosubmitreact#85

@LuiggiTenorioK
Copy link
Member Author

Noticed that the completed jobs indicator from the progress bar are gathered from the job_data_{expid}.db, and the tree view gets it from the .pkl files. Might be an internal problem in the worker process that populates the .db files. We will need to reproduce the bug and debug that worker.

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @mcastril on Oct 9, 2023, 16:46

This is very interesting information Luiggi

@LuiggiTenorioK
Copy link
Member Author

It seems that the number of completed jobs hasn't been updated by the populate_queue_run_times.py worker because it was trying to insert new data in a table that was constrained by its primary keys. So, it was needed to add an INSERT OR REPLACE statement to update it in case of stepping on a PRIMARY KEY constraint.

This was patched already on commit 4e41a40 which will be available on pre-release v4.0.0b2.

Even so, we will need to look close this in production to spot if this patch fixes the problem as there is no detail to reproduce the error.

@LuiggiTenorioK
Copy link
Member Author

@mcastril @dbeltrankyl The issue we saw today with experiments a6zk and a70a is related to this. In this case, the experiment pkl file and the job_data_{expid}.db are not synchronized, and the DDBB file is also empty.

Initially, I thought that the experiment wasn't running but I saw that the experiment is active by looking if the AS_LOGS/20240320_160854_run.log was continuously updating.

@LuiggiTenorioK
Copy link
Member Author

mentioned in issue autosubmit#1262

@LuiggiTenorioK
Copy link
Member Author

Just tested another buggy experiment we saw yesterday (a6yi) where it doesn't show the total or completed jobs. The issue was related to the data types that weren't controlled in the removed as_times.db tables.

With version v4.0.0b5 it works as intended because it uses the distributed databases 🎉

@LuiggiTenorioK LuiggiTenorioK self-assigned this Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working to do This label should be attached issues that are assigned to work with but that did not start.
Projects
None yet
Development

No branches or pull requests

1 participant