Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache PAG serialization #20

Open
SamStudio8 opened this issue May 28, 2020 · 5 comments
Open

Cache PAG serialization #20

SamStudio8 opened this issue May 28, 2020 · 5 comments
Assignees
Labels
enhancement New feature or request next cool things coming soon P:HIGH Presents a significant roadblock to activities perf
Milestone

Comments

@SamStudio8
Copy link
Owner

After three months of Majora-ing I think we have discovered an interesting flaw in the process model. I think it's important that we're able to model the concepts of samples, tubes, boxes, files, directories and the processes that are applied to them. It means we can quickly return information on particular artifacts and more easily model how to create and validate such artifacts through the API. It makes natural sense to send and receive information about these real world items through the API with structures that try to represent them.

Yet, when it comes to analyses, we most often want to dump our knowledge about these carefully crafted objects into a gigantic unstructured flat file to tabulate, count and plot things of interest. It's not impossible to do this - we already can unroll all the links between artifacts and processes to traverse the process tree model that is central to how Majora records the journal of an artifact.

The two issues with this are:

  • The unravelling is quite slow, likely owing to the suboptimal implementation (given my Django learning curve and time constraints) and the sheer number of models involved
  • The unravelling is quite inflexible. Currently the API supports unravelling Published Artifact Groups and Sequencing Runs and not much else. The serializers for the latter are even a special implementation to work specifically for flattening metadata and metrics for artifacts that lead up to a sequencing run.

The first is not hugely problematic, we request this data from the database infrequently. However the latter is why I'm writing this issue. I want users to be able to request specific information ("columns") of metadata pertaining to any group of artifacts in the system - ideally in a fast and simple fashion.

This led me to think more about what the PAG really represents: If you think about it, the Published Artifact Group is a brief highlight reel of the journey an artifact has taken through its analysis lifespan (eg. for the Covid work,a PAG shows the sample and its FASTA - skipping everything in-between). We can formalise the idea of binding everything (including that in-between part) by specifically linking all the processes that were performed onto the Published Artifact Group.

I've previously discussed this idea and first thought about collecting all the processes from the start of the process tree to the end (eg. a sample, through to its FASTA) and adding these to a process_set on the Published Artifact Group. One could then ask all the processes in this group to serialize themselves, potentially with some context (eg. "these columns only"). We can formalise this slightly better by adding a concrete idea of a "journal" as a many-to-many FK on the Artifact and Process-related models.

That is, we still maintain the audit linkage of what processes were applied to which artifacts and when. But once the result of such a journey is final and a Published Artifact Group is minted, we can collect all those processes and label them with a specific journal_id. This means we can fetch all the processes related to a PAG/journal and serialise them without processing the tree.

If this still doesn't suffice, a post_save process for the PAG could serialize all the information and store it as JSON in postgresql or something.

@SamStudio8 SamStudio8 added enhancement New feature or request post-COG These changes will probably not be implemented in the COG-UK Majora instance labels May 28, 2020
@SamStudio8 SamStudio8 added this to the v1.0 milestone May 28, 2020
@SamStudio8 SamStudio8 self-assigned this May 28, 2020
@SamStudio8
Copy link
Owner Author

We need a solution to this. I think the current idea that would cause the least amount of collateral damage would be adding a M:M relationship between a PAG and the Processes it encapsulates. We can then write a migration to enumerate every PAG, collect its artifacts, recurse through their process trees and add every process to the PAG.

I think it is gross but has the following pros:

  • If we don't like it, it will be easy to remove
  • It will touch very little code as the heavy lifting will be done in the migration
  • It will work

@SamStudio8 SamStudio8 added next cool things coming soon and removed post-COG These changes will probably not be implemented in the COG-UK Majora instance labels Jul 1, 2020
@SamStudio8
Copy link
Owner Author

I took a different direction as an experiment as an excuse to continue my battle with DRF. You can send a leaf_cls GET param when listing or fetching PAGs which will check the PAG for artifacts of that class, pick one, and grab its process tree. If this works, we'll go ahead of write a migration to link those processes into the PAG model proper.

@SamStudio8
Copy link
Owner Author

Alright I've taken a new approach that works for now. We get all the artifact IDs in the PAG, and look for any ProcessRecord that starts or finishes with that artifact, expand all the artifacts out and repeat. This gets all the ProcessRecords we need (for now) in scope for serialisation. It's much nicer than some bullshit nominated artifact approach like what I came up with yesterday. It works really well and what's better is it seems to involve fewer DB hits too.

@SamStudio8 SamStudio8 added post-COG These changes will probably not be implemented in the COG-UK Majora instance and removed next cool things coming soon labels Dec 7, 2020
@SamStudio8
Copy link
Owner Author

This is shockingly insightful for something written back in May. Indeed this problem is a core design issue with Majora (and would need considerable thought in any new version #44). The way I see it is there are two parts to Majora's job:

  • Maintaining a thorough, interconnected history of artifacts and the processes that manipulate them: requiring fast indexes and heavy use of relational keys
  • Dumper-trucking everything we know about a sample into a flat file for analysis (bonus points if that can be filtered for particular rows and fields)

This dual-model of storage will need to maintain an SQL-like structure for the first part, and I think the ideas I've touched upon in the past about pre-serialising PAGs out to JSON (and perhaps one day a separate large key-value database) will solve the second part. In the near future (time permitting) I think I'll experiment with adding JSON to each PAG and using the dynamic part of the v3 DRF API (likely removing the DRF part) to serve the dynamic queries.

@SamStudio8 SamStudio8 changed the title Add a M:M journal_id FK to Artifacts and Process(Records) Cache PAG serialization Jan 12, 2021
@SamStudio8 SamStudio8 added P:STANDARD A regular issue that should be worried about in the regular amount perf next cool things coming soon and removed post-COG These changes will probably not be implemented in the COG-UK Majora instance labels Jan 12, 2021
@SamStudio8
Copy link
Owner Author

As part of fast prep for mass ENA consensus submissions (COG-UK/dipi-group#11), I've hacked the "original" GET PAG endpoint to allow a mode to override the behaviour of the celery task. This works absolutely brilliantly: it's blazing fast AND still satisfies the requirements of the ocarina struct which means I can just drop it in to work there. However this is completely against the design ethos of Majora (flexibility and genericism). Delving into this for ENA submissions has also made me realise I never solved the dual endpoint problem whereby the GET API for sequencing and PAGs have a lot of overlap but no shared code (currently); this has only worsened with the recent need to add highly specific code to speed those two processes up. I think the long term solution is to deploy the cache idea discussed here - then bring back the dynamic v3 API to handle JSON munging and API responses.

@SamStudio8 SamStudio8 added P:HIGH Presents a significant roadblock to activities and removed P:STANDARD A regular issue that should be worried about in the regular amount labels Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request next cool things coming soon P:HIGH Presents a significant roadblock to activities perf
Projects
None yet
Development

No branches or pull requests

1 participant