Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Support DOT inputs (and multigraphs); much improved boundary node splitting in pattern identification; add back frayed rope support; lots of other improvements #244

Open
wants to merge 379 commits into
base: master
Choose a base branch
from

Conversation

fedarko
Copy link
Member

@fedarko fedarko commented May 16, 2023

(The interactive visualization side of things still needs to be updated, so this isn't ready yet, but the DOT / stats export options both work.)

Changes

  • MetagenomeScope now accepts DOT files produced by Flye and LJA.

    • This means that MetagenomeScope can now visualize de Bruijn graphs (DBGs) without having to load them from GFA / FASTG files (meaning that we can actually draw the literal structure of DBGs -- GFA and FASTG files implicitly convert edges to nodes and vice versa).

    • Nodes in these graphs are drawn as small circles (which matches existing DBG visualizations in the literature, including AGB's).

    • Added support for parallel edges (closes Need to set up decomposed graph as a MultiDiGraph #202, closes Explicitly handle/test duplicate edges in the decomposed graph #187, closes Add defined behavior for duplicate edges #75), whether they're in the input assembly graph file or in the decomposed graph. (Note that the FASTG and GFA parsers will throw errors if they see parallel edges, so I've opened Support duplicate (aka parallel) edges in GFA / FASTG files? #239 accordingly.)

    • Since "bubbles" often manifest in DBGs as "bulges" (in which there exist multiple edges between the same pair of nodes), I expanded the validation code to accept bulges as "bubbles".

      • Just for reference: we say there exists a bulge between two nodes X and Y if there multiple edges X → Y, X has no outgoing edges to other nodes, and Y has no incoming edges from other nodes.
  • Refactor the assembly graph code.

  • Additional output options.

    • Add the ability to save a DOT file of the assembly graph (representing patterns as clusters, like the ~2016-era version of MetagenomeScope).

      • I'd like to eventually also support saving XDOT files that use the exact node/edge coordinates used in the visualization interface (so, after applying backfilling), but that will take some extra work.
      • Part of the motivation behind this feature: if users want fancy images of the assembly graph (e.g. SVG) that Cytoscape.js can't export right now, then they can just extract them using DOT (which can create SVGs).
    • Add the ability to save statistics about each connected component of the assembly graph to a TSV file. This helps give users a high-level overview of large graphs -- I think it'll be handy when e.g. working with large graphs on a remote server, where visualizations may be impractical.

    • Made the original output option (saving an interactive visualization) optional. Now, the user can specify anywhere from zero to all three of these output options. (If you don't select any output options, MetagenomeScope will just identify patterns, output the number of patterns identified, and call it a day.)

  • Documentation updates.

    • Added developer documentation.
    • Lots of updates to the README (vignettes, FAQs, expanded filetype table, ...)
  • Other stuff.

    • Test on Python 3.6, 3.7, 3.8, 3.9, and 3.10 in addition to just 3.6.
    • Add an option to disallow pattern identification (--no-patterns).
    • Add explicit options to disable large component removal (setting -maxn / -maxe to 0 removes the checks).
    • [WIP] Add options to allow users to specify arbitrary node/edge metadata in input TSV files (--node-metadata / --edge-metadata). Closes Support general node / edge metadata #243.
    • Add a version option to the CLI (-v / --version).
    • Lots of other changes.

Things to address before merging this in

  • Update the layout code and everything downstream (the interactive viz stuff) to work with these changes.
  • Ideally, more tests.

Ignore all that, show me some pretty pictures

These are all produced by running mgsc -i [graph file] -od [dot output], and then visualizing the resulting DOT file with Graphviz. Note that this process doesn't perform backfilling, so these layouts will look a bit different from what's shown in the visualization interface.

Flye yeast assembly graph from AGB's GitHub repository Simple chain of two bubbles Cyclic chain of 3 bubbles in the Bandage E. coli graph
y bct image

... as we continue to refactor the data model. I think this is in the
right direction
also merged AsmGraph.process() into the __init__() function, to make
everything easier

I think the current plans satisfy everything specified in marbl#204 at the
moment. but the proof will be in the pudding...........
now that function no longer exists, but still nice to check that
"reserved" node attrs are now OK
the tests still fail, due to the process() change (when we init
an assemblygraph object, this runs the decomposition stuff / etc. which
hasn't been updated yet). that's fine -- when these other parts of the
assemblygraph code are refactored, these particular tests should be back
alive. hopefully.

[ci skip] b/c everything is still broken
... to be consistent with other graphs' nodes
gonna be a while til this particular function is ready tho, it's
a nice-to-have

[ci skip]
Moving layout() to be the responsibility of the caller() makes sense,
imo -- this way if people want to use the AG API in the future (e.g.
for converting to DOT, or just identifying patterns, or whatevs) this
is not a bottleneck for huge graphs
this way all of the logging done in main.py is based on just the
non-AssemblyGraph stuff. I think consistency makes sense.

It would be nice to make the logs say, for example, how many seconds
have elapsed since execution has started (like in strainFlye).
But that's not needed at all -- just another nice-to-have.

[ci skip]
Using the same function to generate IDs (on the first pass, and on
later passes) makes sense -- more foolproof
fedarko added 30 commits May 19, 2023 21:07
easier from a testing perspective (and less complicated) to just pass
the pattern type to update(), not the entire pattern.

i'm not thrilled with the need to call update() immediately after PS
construction (ideally we'd use an alternate constructor...?) but
whatever this works
also rm old code for Pattern.get_counts(), which is now no longer
needed
obviates the inevitable confusion when someone removes components
due to having, say, > E edges, but then gets confused as to why
components with fewer edges than the ones that got removed have lower
"size ranks." actually, would someone ever get confused that way?
uhhhhhhh whatever this fixes the issue i made up in my head like a sane
person commit message over
this was gonna come up sooner or later, so glad we caught it now lol
this fails, which it should. figure out why & fix it
see code changes -- i think this came about due to chain merging
stuff. checking explicitly to see if the node and its counterpart
are siblings fixes the problem :)
nuking this test case from orbit so it doesn't get borken again
just found another chr21 bug tho so another similar test coming
soon :|

nah this is good, good to catch this now rather than muuuch later
onto the other bugs
OK, now I just need to fix the bug lol
lja / jumboDBG problem -- should be fixed there rather than here
oh my god why is everything happening
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment