Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDBB sync improvements #53

Open
LuiggiTenorioK opened this issue Dec 4, 2023 · 5 comments
Open

DDBB sync improvements #53

LuiggiTenorioK opened this issue Dec 4, 2023 · 5 comments

Comments

@LuiggiTenorioK
Copy link
Member

By addressing issue #49 and following issue #34, I found that the DDBB tables can be improved to give better results for searching experiments.

In this way, some issues have to be handled:

  • Avoid deletion: First, in the populate_details background task, the details table could use a INSERT or UPDATE strategy instead of deleting the whole table and populating it from scratch. This is important because, if there are a lot of experiments or this process breaks at some point, this might make data unavailable for searching until the next call to the background task (4 hours). Also, the same strategy must be applied in the experiment_status table, because the status registry is deleted after the experiment finishes, instead of being updated.

  • Single Source of Truth (SSOT): Table data must have a Single Source Of Truth for each data concept. Then, I single function for getting a piece of data should be applied and mapped somewhere in the documentation, if possible (having a data catalog might help). For example, there should be just one way to obtain which is the user of the experiment for every endpoint and, with that, update the DDBB. Then, that information can be obtained from the table (as a cached snapshot) and, if needed in real-time, call the same function that was used to update it.

  • Reactive updating: As explained above, every time an SSOT function is called, data should be updated in the DDBB (if it is feasible and scalable). Then, DDBB can provide more recent data directly without calling the SSOT function (Is assumed that DDBB data is faster to get than calling the SSOT function).

  • Extend data available in DDBBs: As Autosubmit grows, other data concepts might be included in the details table (e.g: wrapper type, job status counters, etc) or in one-to-many additional tables (e.g: metadata as @kinow suggested). This will enrich the search by using only the data available from the DDBB as optimally desired.

As a scratch, a way to handle these improvements might be following this:

  1. Declare SSOT functions for each relevant data
  2. Refactor workers (background tasks) to call the SSOT functions and populate tables
  3. Decide which endpoints (/v4) need to call SSOTs or DDBBs and refactor them
  4. Apply reactive update to the ones that call SSOTs if it is feasible
  5. Extend DDBB data and repeat 1-4 for the new relevant data

@mcastril

@LuiggiTenorioK
Copy link
Member Author

marked this issue as related to autosubmit#1179

@LuiggiTenorioK
Copy link
Member Author

mentioned in issue autosubmit#1179

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @mcastril on Jan 4, 2024, 10:00

Hi @LuiggiTenorioK. This is a great initiative.

I think this should be part of the DDBB re-design for AS4 (https://earth.bsc.es/gitlab/es/autosubmit/-/issues/858) that led to nowhere. You could have a very active (an leading) role in this re-design.

Then we "only" would have to decide how to keep backward compatibility, especially for the workers.

Can you elaborate more on "Then, DDBB can provide more recent data directly without calling the SSOT function (Is assumed that DDBB data is faster to get than calling the SSOT function)." ?

@LuiggiTenorioK
Copy link
Member Author

Can you elaborate more on "Then, DDBB can provide more recent data directly without calling the SSOT function (Is assumed that DDBB data is faster to get than calling the SSOT function)." ?

There are cases where some data have to be read from files or be preprocessed to get the final value. Then, the DDBB acts as a cache of these final values.

A case where this is useful is when searching through the experiments. Doing that same process for each experiment can be expensive if there are many. So, in this case, the DDBB is used.

The idea of reactively updating these values is that those are updated when we are requesting the information of just one experiment as there is no "many experiments" issue. So, when we have that issue, the data given by the DDBB is as recent as when that experiment was last visited (also could be visited by a worker periodically).

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @mcastril on Jan 5, 2024, 13:27

Thank you for the clarification, it's clearer now

@LuiggiTenorioK LuiggiTenorioK self-assigned this Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant