[WIP] Support DOT inputs (and multigraphs); much improved boundary node splitting in pattern identification; add back frayed rope support; lots of other improvements #244

fedarko · 2023-05-16T02:42:41Z

(The interactive visualization side of things still needs to be updated, so this isn't ready yet, but the DOT / stats export options both work.)

Changes

MetagenomeScope now accepts DOT files produced by Flye and LJA.
- This means that MetagenomeScope can now visualize de Bruijn graphs (DBGs) without having to load them from GFA / FASTG files (meaning that we can actually draw the literal structure of DBGs -- GFA and FASTG files implicitly convert edges to nodes and vice versa).
- Nodes in these graphs are drawn as small circles (which matches existing DBG visualizations in the literature, including AGB's).
- Added support for parallel edges (closes Need to set up decomposed graph as a MultiDiGraph #202, closes Explicitly handle/test duplicate edges in the decomposed graph #187, closes Add defined behavior for duplicate edges #75), whether they're in the input assembly graph file or in the decomposed graph. (Note that the FASTG and GFA parsers will throw errors if they see parallel edges, so I've opened Support duplicate (aka parallel) edges in GFA / FASTG files? #239 accordingly.)
- Since "bubbles" often manifest in DBGs as "bulges" (in which there exist multiple edges between the same pair of nodes), I expanded the validation code to accept bulges as "bubbles".
  - Just for reference: we say there exists a bulge between two nodes X and Y if there multiple edges X → Y, X has no outgoing edges to other nodes, and Y has no incoming edges from other nodes.
Refactor the assembly graph code.
- Rather than store node and edge metadata (e.g. length, orientation, multiplicity, ...) in the NetworkX graph structures, we instead create Node and Edge objects that we use to store this information. The nodes and edges in the NX graphs contain IDs that point to these objects. (Amusingly, this is actually pretty close to the nodeid2obj structures that the earliest versions of MetagenomeScope created.) This makes the code much cleaner. Closes Centralize data storage in the Python codebase #204.
- Overhauled hierarchical pattern identification.
  - Added back frayed ropes as a "top-level-only" pattern type (closes Disallow frayed ropes from being used in other types of patterns in the decomposition #200).
  - Support "end-to-start" cyclic bubbles (Identifying cyclic bubbles? #241) and frayed ropes (closes Identify cyclic frayed ropes? #242). (Need to do some more testing and also there's one silly issue if these are in isolated components, so I'm leaving Identifying cyclic bubbles? #241 open.)
  - Add automatic boundary node splitting for bubbles, chains, cyclic chains,
    and frayed ropes by default; unneeded split nodes are then merged back
    afterwards. Closes Auto-duplicate bubbles' boundary nodes, then trim "unneeded" duplicates at the end of pattern detection #167, and also addresses the un-checked boxes in Hierarchical decomposition improvements #164.
    This makes the pattern decompositions so much nicer than before.
  - Replace the idea of "duplicate nodes" for boundary node splitting (where
    one of them was that gross shade of pink) with the idea of left/right
    splitting -- if a given node is split, then its left split node is marked
    as a "left split" node and its right split node is marked as a "right split"
    node. This means that, rather than adjusting node colors, we can adjust
    the shape of nodes accordingly (which is a much more intuitive way of
    showing this). Closes Alternate, more intuitive representation of duplicate bubble boundary nodes #206.
Additional output options.
- Add the ability to save a DOT file of the assembly graph (representing patterns as clusters, like the ~2016-era version of MetagenomeScope).
  - I'd like to eventually also support saving XDOT files that use the exact node/edge coordinates used in the visualization interface (so, after applying backfilling), but that will take some extra work.
  - Part of the motivation behind this feature: if users want fancy images of the assembly graph (e.g. SVG) that Cytoscape.js can't export right now, then they can just extract them using DOT (which can create SVGs).
- Add the ability to save statistics about each connected component of the assembly graph to a TSV file. This helps give users a high-level overview of large graphs -- I think it'll be handy when e.g. working with large graphs on a remote server, where visualizations may be impractical.
- Made the original output option (saving an interactive visualization) optional. Now, the user can specify anywhere from zero to all three of these output options. (If you don't select any output options, MetagenomeScope will just identify patterns, output the number of patterns identified, and call it a day.)
Documentation updates.
- Added developer documentation.
- Lots of updates to the README (vignettes, FAQs, expanded filetype table, ...)
Other stuff.
- Test on Python 3.6, 3.7, 3.8, 3.9, and 3.10 in addition to just 3.6.
- Add an option to disallow pattern identification (--no-patterns).
- Add explicit options to disable large component removal (setting -maxn / -maxe to 0 removes the checks).
  - Given the tests that I added for this, I think it's safe to close Detect prohibitively large components and skip (initial) layout/rendering for them #137 now. (Maybe test more specifically that output is formatted nicely...? eh)
- [WIP] Add options to allow users to specify arbitrary node/edge metadata in input TSV files (--node-metadata / --edge-metadata). Closes Support general node / edge metadata #243.
- Add a version option to the CLI (-v / --version).
- Lots of other changes.

Things to address before merging this in

Update the layout code and everything downstream (the interactive viz stuff) to work with these changes.
Ideally, more tests.

Ignore all that, show me some pretty pictures

These are all produced by running mgsc -i [graph file] -od [dot output], and then visualizing the resulting DOT file with Graphviz. Note that this process doesn't perform backfilling, so these layouts will look a bit different from what's shown in the visualization interface.

Flye yeast assembly graph from AGB's GitHub repository	Simple chain of two bubbles	Cyclic chain of 3 bubbles in the Bandage E. coli graph

[ci skip]

... as we continue to refactor the data model. I think this is in the right direction

also merged AsmGraph.process() into the __init__() function, to make everything easier I think the current plans satisfy everything specified in marbl#204 at the moment. but the proof will be in the pudding...........

now that function no longer exists, but still nice to check that "reserved" node attrs are now OK

the tests still fail, due to the process() change (when we init an assemblygraph object, this runs the decomposition stuff / etc. which hasn't been updated yet). that's fine -- when these other parts of the assemblygraph code are refactored, these particular tests should be back alive. hopefully. [ci skip] b/c everything is still broken

... to be consistent with other graphs' nodes

gonna be a while til this particular function is ready tho, it's a nice-to-have [ci skip]

in init_graph_objs() -- marbl#204

uhh marbl#204, marbl#167, marbl#201, ... [ci skip]

Moving layout() to be the responsibility of the caller() makes sense, imo -- this way if people want to use the AG API in the future (e.g. for converting to DOT, or just identifying patterns, or whatevs) this is not a bottleneck for huge graphs

this way all of the logging done in main.py is based on just the non-AssemblyGraph stuff. I think consistency makes sense. It would be nice to make the logs say, for example, how many seconds have elapsed since execution has started (like in strainFlye). But that's not needed at all -- just another nice-to-have. [ci skip]

lotta changes

Using the same function to generate IDs (on the first pass, and on later passes) makes sense -- more foolproof

[ci skip]

easier from a testing perspective (and less complicated) to just pass the pattern type to update(), not the entire pattern. i'm not thrilled with the need to call update() immediately after PS construction (ideally we'd use an alternate constructor...?) but whatever this works

yay, no caveats :)

also rm old code for Pattern.get_counts(), which is now no longer needed

obviates the inevitable confusion when someone removes components due to having, say, > E edges, but then gets confused as to why components with fewer edges than the ones that got removed have lower "size ranks." actually, would someone ever get confused that way? uhhhhhhh whatever this fixes the issue i made up in my head like a sane person commit message over

silly

this was gonna come up sooner or later, so glad we caught it now lol

this fails, which it should. figure out why & fix it

see code changes -- i think this came about due to chain merging stuff. checking explicitly to see if the node and its counterpart are siblings fixes the problem :)

nuking this test case from orbit so it doesn't get borken again

just found another chr21 bug tho so another similar test coming soon :| nah this is good, good to catch this now rather than muuuch later

PHEW that was tricky.

onto the other bugs

OK, now I just need to fix the bug lol

lja / jumboDBG problem -- should be fixed there rather than here

oh my god why is everything happening

fedarko added 30 commits April 14, 2023 15:05

MNT/DOC: minor node "name"/"ID" chgs

40d4b95

MNT: start adding node/edge objs

d42eed5

[ci skip]

MNT/DOC: continue setup of node/edge objs marbl#204

6c7a397

... as we continue to refactor the data model. I think this is in the right direction

MNT: more data model refactoring marbl#204

6860ca1

also merged AsmGraph.process() into the __init__() function, to make everything easier I think the current plans satisfy everything specified in marbl#204 at the moment. but the proof will be in the pudding...........

TST: adjust check_attrs test

149d79c

now that function no longer exists, but still nice to check that "reserved" node attrs are now OK

MNT: continue refactor; update scale_nodes() marbl#204

4e654bf

MNT/DOC: this was located in the wrong folder

e657bdf

MNT: rename nodes in FASTG graphs

f38676f

... to be consistent with other graphs' nodes

MNT: update edge scaling re: refactor

78f2aa0

DOC/MNT: slight AsmGraph updates re: refactor

9f3e103

MNT: move get_edge_weight_field() down

b7972d6

DOC: to_dot() plans

f59cd42

gonna be a while til this particular function is ready tho, it's a nice-to-have [ci skip]

MNT: relabel nodes (and kinda edges) in nx graph

7bdd319

in init_graph_objs() -- marbl#204

MNT: graph __init__

a14c0c7

MNT: hierarchical decomposition plans

fd6f0e7

uhh marbl#204, marbl#167, marbl#201, ... [ci skip]

DOC: link to Nijkamp 2013 at README start [ci skip]

299f8bf

DOC: bungled it [ci skip]

f19b77a

MNT: gracefully fail if layout not done [ci skip]

f7c2e9a

STY: black changed its mind abt ** spacing

e2f5b19

MNT/DOC: submodules in mgsc stuff; layout docs

8e3c5cc

MNT/BUG: store node names; setup edgeid2obj right

d3b787b

ENH: configurable node splitting names

e09aab7

STY: lint [ci skip]

6980a9f

MNT: refactoring; patterns subclass nodes again

50481cb

lotta changes

BUG/MNT: fix (circular) import crap

a05adb3

MNT/DOC: fun ID stuff

e8657ca

Using the same function to generate IDs (on the first pass, and on later passes) makes sense -- more foolproof

DOC: notes on too-large-component removal fn

e369c6d

[ci skip]

fedarko added 30 commits May 19, 2023 21:07

STY: spacing & unused imports

330b70c

DOC: update CLI re: --output-ccstats new fmt

b5303c0

yay, no caveats :)

TST/MNT/DOC: test&doc Pattern.get_descendant_info

89f5950

also rm old code for Pattern.get_counts(), which is now no longer needed

TST: AssemblyGraph.to_tsv() on a multi-cc graph

2010663

TST: start testing pattern utils directly

ec9c196

MNT: just incl make_into_split() for Patterns

08fadb1

silly

TST: more Pattern utils

49bed01

DOC: make the first line of the CLI a bit nicer

a1af78f

MNT: add Component to mgsc.graph imports; repr()

b381a02

BUG: catch "strict"/undirected DOT graphs & report

165e1d6

this was gonna come up sooner or later, so glad we caught it now lol

MNT: AssemblyGraph.__repr__()

0ed8fa0

TST: add chr21 test input & acks

5bb082e

TST: rename chr21mat test file, and basic test

509c3fb

this fails, which it should. figure out why & fix it

BUG: fix chr21 splits-not-being-merged bug

af3cc04

see code changes -- i think this came about due to chain merging stuff. checking explicitly to see if the node and its counterpart are siblings fixes the problem :)

TST: beef up chr21 test

4ca023a

nuking this test case from orbit so it doesn't get borken again

TST: chr21 test - more detail & update re fixed

f9686bf

just found another chr21 bug tho so another similar test coming soon :| nah this is good, good to catch this now rather than muuuch later

TST/BUG: reproduce cool and new chr21 bug

dee7c74

BUG: fix the other chr21 test - dual chain merging

facd434

PHEW that was tricky.

STY

ef1b173

TST: more thorough chr21 test 2

d3296e7

onto the other bugs

STY

0b0b360

TST: better edge obj testing; reproduce chr15 bug

42bb7df

OK, now I just need to fix the bug lol

DOC: better comments in unnec removal func

4df3f97

MNT: explicitly label fake edges in __repr__

e64d940

DOC: remove outdated comment re: FR splitting

a60fd29

DOC: note about node IDs defined on multiple lines

0c03540

lja / jumboDBG problem -- should be fixed there rather than here

MNT: fix typo

3667504

oh my god why is everything happening

DOC: some readme tweaking

be8bd66

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Support DOT inputs (and multigraphs); much improved boundary node splitting in pattern identification; add back frayed rope support; lots of other improvements #244

[WIP] Support DOT inputs (and multigraphs); much improved boundary node splitting in pattern identification; add back frayed rope support; lots of other improvements #244

fedarko commented May 16, 2023 •

edited

Loading

[WIP] Support DOT inputs (and multigraphs); much improved boundary node splitting in pattern identification; add back frayed rope support; lots of other improvements #244

Are you sure you want to change the base?

[WIP] Support DOT inputs (and multigraphs); much improved boundary node splitting in pattern identification; add back frayed rope support; lots of other improvements #244

Conversation

fedarko commented May 16, 2023 • edited Loading

Changes

Things to address before merging this in

Ignore all that, show me some pretty pictures

fedarko commented May 16, 2023 •

edited

Loading