DDBB sync improvements #53

LuiggiTenorioK · 2023-12-04T16:15:23Z

By addressing issue #49 and following issue #34, I found that the DDBB tables can be improved to give better results for searching experiments.

In this way, some issues have to be handled:

Avoid deletion: First, in the populate_details background task, the details table could use a INSERT or UPDATE strategy instead of deleting the whole table and populating it from scratch. This is important because, if there are a lot of experiments or this process breaks at some point, this might make data unavailable for searching until the next call to the background task (4 hours). Also, the same strategy must be applied in the experiment_status table, because the status registry is deleted after the experiment finishes, instead of being updated.
Single Source of Truth (SSOT): Table data must have a Single Source Of Truth for each data concept. Then, I single function for getting a piece of data should be applied and mapped somewhere in the documentation, if possible (having a data catalog might help). For example, there should be just one way to obtain which is the user of the experiment for every endpoint and, with that, update the DDBB. Then, that information can be obtained from the table (as a cached snapshot) and, if needed in real-time, call the same function that was used to update it.
Reactive updating: As explained above, every time an SSOT function is called, data should be updated in the DDBB (if it is feasible and scalable). Then, DDBB can provide more recent data directly without calling the SSOT function (Is assumed that DDBB data is faster to get than calling the SSOT function).
Extend data available in DDBBs: As Autosubmit grows, other data concepts might be included in the details table (e.g: wrapper type, job status counters, etc) or in one-to-many additional tables (e.g: metadata as @kinow suggested). This will enrich the search by using only the data available from the DDBB as optimally desired.

As a scratch, a way to handle these improvements might be following this:

Declare SSOT functions for each relevant data
Refactor workers (background tasks) to call the SSOT functions and populate tables
Decide which endpoints (/v4) need to call SSOTs or DDBBs and refactor them
Apply reactive update to the ones that call SSOTs if it is feasible
Extend DDBB data and repeat 1-4 for the new relevant data

@mcastril

The text was updated successfully, but these errors were encountered:

LuiggiTenorioK · 2023-12-14T10:33:19Z

marked this issue as related to autosubmit#1179

LuiggiTenorioK · 2023-12-14T10:35:49Z

mentioned in issue autosubmit#1179

LuiggiTenorioK · 2024-01-04T09:00:43Z

In GitLab by @mcastril on Jan 4, 2024, 10:00

Hi @LuiggiTenorioK. This is a great initiative.

I think this should be part of the DDBB re-design for AS4 (https://earth.bsc.es/gitlab/es/autosubmit/-/issues/858) that led to nowhere. You could have a very active (an leading) role in this re-design.

Then we "only" would have to decide how to keep backward compatibility, especially for the workers.

Can you elaborate more on "Then, DDBB can provide more recent data directly without calling the SSOT function (Is assumed that DDBB data is faster to get than calling the SSOT function)." ?

LuiggiTenorioK · 2024-01-04T10:37:33Z

Can you elaborate more on "Then, DDBB can provide more recent data directly without calling the SSOT function (Is assumed that DDBB data is faster to get than calling the SSOT function)." ?

There are cases where some data have to be read from files or be preprocessed to get the final value. Then, the DDBB acts as a cache of these final values.

A case where this is useful is when searching through the experiments. Doing that same process for each experiment can be expensive if there are many. So, in this case, the DDBB is used.

The idea of reactively updating these values is that those are updated when we are requesting the information of just one experiment as there is no "many experiments" issue. So, when we have that issue, the data given by the DDBB is as recent as when that experiment was last visited (also could be visited by a worker periodically).

LuiggiTenorioK · 2024-01-05T12:27:17Z

In GitLab by @mcastril on Jan 5, 2024, 13:27

Thank you for the clarification, it's clearer now

LuiggiTenorioK self-assigned this Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDBB sync improvements #53

DDBB sync improvements #53

LuiggiTenorioK commented Dec 4, 2023

LuiggiTenorioK commented Dec 14, 2023

LuiggiTenorioK commented Dec 14, 2023

LuiggiTenorioK commented Jan 4, 2024

LuiggiTenorioK commented Jan 4, 2024

LuiggiTenorioK commented Jan 5, 2024

DDBB sync improvements #53

DDBB sync improvements #53

Comments

LuiggiTenorioK commented Dec 4, 2023

LuiggiTenorioK commented Dec 14, 2023

LuiggiTenorioK commented Dec 14, 2023

LuiggiTenorioK commented Jan 4, 2024

LuiggiTenorioK commented Jan 4, 2024

LuiggiTenorioK commented Jan 5, 2024