Cache PAG serialization #20

SamStudio8 · 2020-05-28T14:17:05Z

After three months of Majora-ing I think we have discovered an interesting flaw in the process model. I think it's important that we're able to model the concepts of samples, tubes, boxes, files, directories and the processes that are applied to them. It means we can quickly return information on particular artifacts and more easily model how to create and validate such artifacts through the API. It makes natural sense to send and receive information about these real world items through the API with structures that try to represent them.

Yet, when it comes to analyses, we most often want to dump our knowledge about these carefully crafted objects into a gigantic unstructured flat file to tabulate, count and plot things of interest. It's not impossible to do this - we already can unroll all the links between artifacts and processes to traverse the process tree model that is central to how Majora records the journal of an artifact.

The two issues with this are:

The unravelling is quite slow, likely owing to the suboptimal implementation (given my Django learning curve and time constraints) and the sheer number of models involved
The unravelling is quite inflexible. Currently the API supports unravelling Published Artifact Groups and Sequencing Runs and not much else. The serializers for the latter are even a special implementation to work specifically for flattening metadata and metrics for artifacts that lead up to a sequencing run.

The first is not hugely problematic, we request this data from the database infrequently. However the latter is why I'm writing this issue. I want users to be able to request specific information ("columns") of metadata pertaining to any group of artifacts in the system - ideally in a fast and simple fashion.

This led me to think more about what the PAG really represents: If you think about it, the Published Artifact Group is a brief highlight reel of the journey an artifact has taken through its analysis lifespan (eg. for the Covid work,a PAG shows the sample and its FASTA - skipping everything in-between). We can formalise the idea of binding everything (including that in-between part) by specifically linking all the processes that were performed onto the Published Artifact Group.

I've previously discussed this idea and first thought about collecting all the processes from the start of the process tree to the end (eg. a sample, through to its FASTA) and adding these to a process_set on the Published Artifact Group. One could then ask all the processes in this group to serialize themselves, potentially with some context (eg. "these columns only"). We can formalise this slightly better by adding a concrete idea of a "journal" as a many-to-many FK on the Artifact and Process-related models.

That is, we still maintain the audit linkage of what processes were applied to which artifacts and when. But once the result of such a journey is final and a Published Artifact Group is minted, we can collect all those processes and label them with a specific journal_id. This means we can fetch all the processes related to a PAG/journal and serialise them without processing the tree.

If this still doesn't suffice, a post_save process for the PAG could serialize all the information and store it as JSON in postgresql or something.

SamStudio8 · 2020-07-01T14:36:15Z

We need a solution to this. I think the current idea that would cause the least amount of collateral damage would be adding a M:M relationship between a PAG and the Processes it encapsulates. We can then write a migration to enumerate every PAG, collect its artifacts, recurse through their process trees and add every process to the PAG.

I think it is gross but has the following pros:

If we don't like it, it will be easy to remove
It will touch very little code as the heavy lifting will be done in the migration
It will work

SamStudio8 · 2020-07-02T12:28:37Z

I took a different direction as an experiment as an excuse to continue my battle with DRF. You can send a leaf_cls GET param when listing or fetching PAGs which will check the PAG for artifacts of that class, pick one, and grab its process tree. If this works, we'll go ahead of write a migration to link those processes into the PAG model proper.

SamStudio8 · 2020-07-02T18:48:20Z

Alright I've taken a new approach that works for now. We get all the artifact IDs in the PAG, and look for any ProcessRecord that starts or finishes with that artifact, expand all the artifacts out and repeat. This gets all the ProcessRecords we need (for now) in scope for serialisation. It's much nicer than some bullshit nominated artifact approach like what I came up with yesterday. It works really well and what's better is it seems to involve fewer DB hits too.

SamStudio8 · 2021-01-12T15:00:54Z

This is shockingly insightful for something written back in May. Indeed this problem is a core design issue with Majora (and would need considerable thought in any new version #44). The way I see it is there are two parts to Majora's job:

Maintaining a thorough, interconnected history of artifacts and the processes that manipulate them: requiring fast indexes and heavy use of relational keys
Dumper-trucking everything we know about a sample into a flat file for analysis (bonus points if that can be filtered for particular rows and fields)

This dual-model of storage will need to maintain an SQL-like structure for the first part, and I think the ideas I've touched upon in the past about pre-serialising PAGs out to JSON (and perhaps one day a separate large key-value database) will solve the second part. In the near future (time permitting) I think I'll experiment with adding JSON to each PAG and using the dynamic part of the v3 DRF API (likely removing the DRF part) to serve the dynamic queries.

SamStudio8 · 2021-01-13T17:51:41Z

As part of fast prep for mass ENA consensus submissions (COG-UK/dipi-group#11), I've hacked the "original" GET PAG endpoint to allow a mode to override the behaviour of the celery task. This works absolutely brilliantly: it's blazing fast AND still satisfies the requirements of the ocarina struct which means I can just drop it in to work there. However this is completely against the design ethos of Majora (flexibility and genericism). Delving into this for ENA submissions has also made me realise I never solved the dual endpoint problem whereby the GET API for sequencing and PAGs have a lot of overlap but no shared code (currently); this has only worsened with the recent need to add highly specific code to speed those two processes up. I think the long term solution is to deploy the cache idea discussed here - then bring back the dynamic v3 API to handle JSON munging and API responses.

SamStudio8 added enhancement New feature or request post-COG These changes will probably not be implemented in the COG-UK Majora instance labels May 28, 2020

SamStudio8 added this to the v1.0 milestone May 28, 2020

SamStudio8 self-assigned this May 28, 2020

SamStudio8 mentioned this issue Jun 3, 2020

Remove need to bind ocarina PAG calls to the metadata table #21

Closed

SamStudio8 added next cool things coming soon and removed post-COG These changes will probably not be implemented in the COG-UK Majora instance labels Jul 1, 2020

SamStudio8 added a commit that referenced this issue Jul 1, 2020

pushing through on DRF. we can collect processes now #20

f3b1a5b

SamStudio8 added post-COG These changes will probably not be implemented in the COG-UK Majora instance and removed next cool things coming soon labels Dec 7, 2020

SamStudio8 changed the title ~~Add a M:M journal_id FK to Artifacts and Process(Records)~~ Cache PAG serialization Jan 12, 2021

SamStudio8 added P:STANDARD A regular issue that should be worried about in the regular amount perf next cool things coming soon and removed post-COG These changes will probably not be implemented in the COG-UK Majora instance labels Jan 12, 2021

SamStudio8 added P:HIGH Presents a significant roadblock to activities and removed P:STANDARD A regular issue that should be worried about in the regular amount labels Jan 21, 2021

SamStudio8 mentioned this issue Mar 6, 2022

PHE1 MDV performance COG-UK/dipi-group#199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache PAG serialization #20

Cache PAG serialization #20

SamStudio8 commented May 28, 2020

SamStudio8 commented Jul 1, 2020

SamStudio8 commented Jul 2, 2020

SamStudio8 commented Jul 2, 2020

SamStudio8 commented Jan 12, 2021

SamStudio8 commented Jan 13, 2021

Cache PAG serialization #20

Cache PAG serialization #20

Comments

SamStudio8 commented May 28, 2020

SamStudio8 commented Jul 1, 2020

SamStudio8 commented Jul 2, 2020

SamStudio8 commented Jul 2, 2020

SamStudio8 commented Jan 12, 2021

SamStudio8 commented Jan 13, 2021