-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Support DOT inputs (and multigraphs); much improved boundary node splitting in pattern identification; add back frayed rope support; lots of other improvements #244
Open
fedarko
wants to merge
379
commits into
marbl:master
Choose a base branch
from
fedarko:dot
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[ci skip]
... as we continue to refactor the data model. I think this is in the right direction
also merged AsmGraph.process() into the __init__() function, to make everything easier I think the current plans satisfy everything specified in marbl#204 at the moment. but the proof will be in the pudding...........
now that function no longer exists, but still nice to check that "reserved" node attrs are now OK
the tests still fail, due to the process() change (when we init an assemblygraph object, this runs the decomposition stuff / etc. which hasn't been updated yet). that's fine -- when these other parts of the assemblygraph code are refactored, these particular tests should be back alive. hopefully. [ci skip] b/c everything is still broken
... to be consistent with other graphs' nodes
gonna be a while til this particular function is ready tho, it's a nice-to-have [ci skip]
in init_graph_objs() -- marbl#204
Moving layout() to be the responsibility of the caller() makes sense, imo -- this way if people want to use the AG API in the future (e.g. for converting to DOT, or just identifying patterns, or whatevs) this is not a bottleneck for huge graphs
this way all of the logging done in main.py is based on just the non-AssemblyGraph stuff. I think consistency makes sense. It would be nice to make the logs say, for example, how many seconds have elapsed since execution has started (like in strainFlye). But that's not needed at all -- just another nice-to-have. [ci skip]
lotta changes
Using the same function to generate IDs (on the first pass, and on later passes) makes sense -- more foolproof
easier from a testing perspective (and less complicated) to just pass the pattern type to update(), not the entire pattern. i'm not thrilled with the need to call update() immediately after PS construction (ideally we'd use an alternate constructor...?) but whatever this works
yay, no caveats :)
also rm old code for Pattern.get_counts(), which is now no longer needed
obviates the inevitable confusion when someone removes components due to having, say, > E edges, but then gets confused as to why components with fewer edges than the ones that got removed have lower "size ranks." actually, would someone ever get confused that way? uhhhhhhh whatever this fixes the issue i made up in my head like a sane person commit message over
this was gonna come up sooner or later, so glad we caught it now lol
this fails, which it should. figure out why & fix it
see code changes -- i think this came about due to chain merging stuff. checking explicitly to see if the node and its counterpart are siblings fixes the problem :)
nuking this test case from orbit so it doesn't get borken again
just found another chr21 bug tho so another similar test coming soon :| nah this is good, good to catch this now rather than muuuch later
PHEW that was tricky.
onto the other bugs
OK, now I just need to fix the bug lol
lja / jumboDBG problem -- should be fixed there rather than here
oh my god why is everything happening
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(The interactive visualization side of things still needs to be updated, so this isn't ready yet, but the DOT / stats export options both work.)
Changes
MetagenomeScope now accepts DOT files produced by Flye and LJA.
This means that MetagenomeScope can now visualize de Bruijn graphs (DBGs) without having to load them from GFA / FASTG files (meaning that we can actually draw the literal structure of DBGs -- GFA and FASTG files implicitly convert edges to nodes and vice versa).
Nodes in these graphs are drawn as small circles (which matches existing DBG visualizations in the literature, including AGB's).
Added support for parallel edges (closes Need to set up decomposed graph as a MultiDiGraph #202, closes Explicitly handle/test duplicate edges in the decomposed graph #187, closes Add defined behavior for duplicate edges #75), whether they're in the input assembly graph file or in the decomposed graph. (Note that the FASTG and GFA parsers will throw errors if they see parallel edges, so I've opened Support duplicate (aka parallel) edges in GFA / FASTG files? #239 accordingly.)
Since "bubbles" often manifest in DBGs as "bulges" (in which there exist multiple edges between the same pair of nodes), I expanded the validation code to accept bulges as "bubbles".
Refactor the assembly graph code.
Rather than store node and edge metadata (e.g. length, orientation, multiplicity, ...) in the NetworkX graph structures, we instead create
Node
andEdge
objects that we use to store this information. The nodes and edges in the NX graphs contain IDs that point to these objects. (Amusingly, this is actually pretty close to thenodeid2obj
structures that the earliest versions of MetagenomeScope created.) This makes the code much cleaner. Closes Centralize data storage in the Python codebase #204.Overhauled hierarchical pattern identification.
Added back frayed ropes as a "top-level-only" pattern type (closes Disallow frayed ropes from being used in other types of patterns in the decomposition #200).
Support "end-to-start" cyclic bubbles (Identifying cyclic bubbles? #241) and frayed ropes (closes Identify cyclic frayed ropes? #242). (Need to do some more testing and also there's one silly issue if these are in isolated components, so I'm leaving Identifying cyclic bubbles? #241 open.)
Add automatic boundary node splitting for bubbles, chains, cyclic chains,
and frayed ropes by default; unneeded split nodes are then merged back
afterwards. Closes Auto-duplicate bubbles' boundary nodes, then trim "unneeded" duplicates at the end of pattern detection #167, and also addresses the un-checked boxes in Hierarchical decomposition improvements #164.
This makes the pattern decompositions so much nicer than before.
Replace the idea of "duplicate nodes" for boundary node splitting (where
one of them was that gross shade of pink) with the idea of left/right
splitting -- if a given node is split, then its left split node is marked
as a "left split" node and its right split node is marked as a "right split"
node. This means that, rather than adjusting node colors, we can adjust
the shape of nodes accordingly (which is a much more intuitive way of
showing this). Closes Alternate, more intuitive representation of duplicate bubble boundary nodes #206.
Additional output options.
Add the ability to save a DOT file of the assembly graph (representing patterns as clusters, like the ~2016-era version of MetagenomeScope).
Add the ability to save statistics about each connected component of the assembly graph to a TSV file. This helps give users a high-level overview of large graphs -- I think it'll be handy when e.g. working with large graphs on a remote server, where visualizations may be impractical.
Made the original output option (saving an interactive visualization) optional. Now, the user can specify anywhere from zero to all three of these output options. (If you don't select any output options, MetagenomeScope will just identify patterns, output the number of patterns identified, and call it a day.)
Documentation updates.
Other stuff.
--no-patterns
).-maxn
/-maxe
to0
removes the checks).--node-metadata
/--edge-metadata
). Closes Support general node / edge metadata #243.-v
/--version
).Things to address before merging this in
Ignore all that, show me some pretty pictures
These are all produced by running
mgsc -i [graph file] -od [dot output]
, and then visualizing the resulting DOT file with Graphviz. Note that this process doesn't perform backfilling, so these layouts will look a bit different from what's shown in the visualization interface.