-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache PAG serialization #20
Comments
We need a solution to this. I think the current idea that would cause the least amount of collateral damage would be adding a M:M relationship between a PAG and the Processes it encapsulates. We can then write a migration to enumerate every PAG, collect its artifacts, recurse through their process trees and add every process to the PAG. I think it is gross but has the following pros:
|
I took a different direction as an experiment as an excuse to continue my battle with DRF. You can send a |
Alright I've taken a new approach that works for now. We get all the artifact IDs in the PAG, and look for any ProcessRecord that starts or finishes with that artifact, expand all the artifacts out and repeat. This gets all the ProcessRecords we need (for now) in scope for serialisation. It's much nicer than some bullshit nominated artifact approach like what I came up with yesterday. It works really well and what's better is it seems to involve fewer DB hits too. |
This is shockingly insightful for something written back in May. Indeed this problem is a core design issue with Majora (and would need considerable thought in any new version #44). The way I see it is there are two parts to Majora's job:
This dual-model of storage will need to maintain an SQL-like structure for the first part, and I think the ideas I've touched upon in the past about pre-serialising PAGs out to JSON (and perhaps one day a separate large key-value database) will solve the second part. In the near future (time permitting) I think I'll experiment with adding JSON to each PAG and using the dynamic part of the v3 DRF API (likely removing the DRF part) to serve the dynamic queries. |
As part of fast prep for mass ENA consensus submissions (COG-UK/dipi-group#11), I've hacked the "original" GET PAG endpoint to allow a |
After three months of Majora-ing I think we have discovered an interesting flaw in the process model. I think it's important that we're able to model the concepts of samples, tubes, boxes, files, directories and the processes that are applied to them. It means we can quickly return information on particular artifacts and more easily model how to create and validate such artifacts through the API. It makes natural sense to send and receive information about these real world items through the API with structures that try to represent them.
Yet, when it comes to analyses, we most often want to dump our knowledge about these carefully crafted objects into a gigantic unstructured flat file to tabulate, count and plot things of interest. It's not impossible to do this - we already can unroll all the links between artifacts and processes to traverse the process tree model that is central to how Majora records the journal of an artifact.
The two issues with this are:
The first is not hugely problematic, we request this data from the database infrequently. However the latter is why I'm writing this issue. I want users to be able to request specific information ("columns") of metadata pertaining to any group of artifacts in the system - ideally in a fast and simple fashion.
This led me to think more about what the PAG really represents: If you think about it, the Published Artifact Group is a brief highlight reel of the journey an artifact has taken through its analysis lifespan (eg. for the Covid work,a PAG shows the sample and its FASTA - skipping everything in-between). We can formalise the idea of binding everything (including that in-between part) by specifically linking all the processes that were performed onto the Published Artifact Group.
I've previously discussed this idea and first thought about collecting all the processes from the start of the process tree to the end (eg. a sample, through to its FASTA) and adding these to a process_set on the Published Artifact Group. One could then ask all the processes in this group to serialize themselves, potentially with some context (eg. "these columns only"). We can formalise this slightly better by adding a concrete idea of a "journal" as a many-to-many FK on the Artifact and Process-related models.
That is, we still maintain the audit linkage of what processes were applied to which artifacts and when. But once the result of such a journey is final and a Published Artifact Group is minted, we can collect all those processes and label them with a specific journal_id. This means we can fetch all the processes related to a PAG/journal and serialise them without processing the tree.
If this still doesn't suffice, a post_save process for the PAG could serialize all the information and store it as JSON in postgresql or something.
The text was updated successfully, but these errors were encountered: