Better support for partial/split GRGs #32
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Splitting a GRG into 1,000 equally sized (w.r.t number of mutations) graphs
produces a total file size almost exactly 2x of the original full GRG. Thus you
can load these GRGs into RAM one at a time, at about 1/500th of the RAM needed
for the full GRG.
There are a few reasons why you might want to split up a GRG:
calculation on the smaller GRGs will slow it down overall, but you can more or
less control the RAM usage by choosing the split size.
genome, it is generally much faster to split the graph by windows first, and
then perform the calculation. And obviously you can do this for overlapping
windows as well (at the cost of some disk space). The reasons it is faster:
locality is significantly better, and the number of uninformative edges that
you need to examine at each node is potentially orders of magnitude smaller
(especially at sample nodes).
You can split a GRG with the new
pygrgl.save_subset()
method, which takeseither a list of MutationIDs or a list of sample NodeIDs, and serializes a
graph with only those roots/leaves. In addition to splitting a GRG into pieces
for performance reasons, you can also use this API to:
In addition to splitting a GRG after-the-fact, you can now prevent the merge of
partial GRGs during construction via the
--no-merge
flag to thegrg construct
command. There is also a new commandgrg split
which takesa GRG, a number of windows, an optional recombination map, and will quickly
split the GRG using multiple threads.
In order to better support workflows with partial GRGs, the GRG now stores its
genomic range (in base-pair position). Old GRG files will just use the min/max
position from their Mutations as the range. New GRGs have the option for their
range to be specified, as you might want to indicate what region you
constructed the GRG from (not just the first/last mutation in it).
In Python, you can access these two ranges using the
grg.bp_range
(range asfound in the mutation list) and
grg.specified_bp_range
(range as specifiedduring construction of the GRG).
The Windowing API (in C++) does support overlapping windows, but this has not
yet been exposed to the
grg split
command, and has not been exposed toPython yet.
Also: fix a bug in the formatting script and properly format the internal headers
in
src/