Better support for partial/split GRGs #32

dcdehaas · 2024-11-19T17:31:46Z

Splitting a GRG into 1,000 equally sized (w.r.t number of mutations) graphs
produces a total file size almost exactly 2x of the original full GRG. Thus you
can load these GRGs into RAM one at a time, at about 1/500th of the RAM needed
for the full GRG.

There are a few reasons why you might want to split up a GRG:

To reduce the RAM usage for a genome-wide calculation. Performing this
calculation on the smaller GRGs will slow it down overall, but you can more or
less control the RAM usage by choosing the split size.
If you are performing a calculation that uses windows (regions) of the
genome, it is generally much faster to split the graph by windows first, and
then perform the calculation. And obviously you can do this for overlapping
windows as well (at the cost of some disk space). The reasons it is faster:
locality is significantly better, and the number of uninformative edges that
you need to examine at each node is potentially orders of magnitude smaller
(especially at sample nodes).

You can split a GRG with the new pygrgl.save_subset() method, which takes
either a list of MutationIDs or a list of sample NodeIDs, and serializes a
graph with only those roots/leaves. In addition to splitting a GRG into pieces
for performance reasons, you can also use this API to:

Remove samples from a GRG.
Filter out mutations from a GRG, e.g. based on frequency.

In addition to splitting a GRG after-the-fact, you can now prevent the merge of
partial GRGs during construction via the --no-merge flag to the grg construct command. There is also a new command grg split which takes
a GRG, a number of windows, an optional recombination map, and will quickly
split the GRG using multiple threads.

In order to better support workflows with partial GRGs, the GRG now stores its
genomic range (in base-pair position). Old GRG files will just use the min/max
position from their Mutations as the range. New GRGs have the option for their
range to be specified, as you might want to indicate what region you
constructed the GRG from (not just the first/last mutation in it).
In Python, you can access these two ranges using the grg.bp_range (range as
found in the mutation list) and grg.specified_bp_range (range as specified
during construction of the GRG).

The Windowing API (in C++) does support overlapping windows, but this has not
yet been exposed to the grg split command, and has not been exposed to
Python yet.

Also: fix a bug in the formatting script and properly format the internal headers
in src/

Splitting a GRG into 1,000 equally sized (w.r.t number of mutations) graphs produces a total file size almost exactly 2x of the original full GRG. Thus you can load these GRGs into RAM one at a time, at about 1/500th of the RAM needed for the full GRG. There are a few reasons why you might want to split up a GRG: 1. To reduce the RAM usage for a genome-wide calculation. Performing this calculation on the smaller GRGs will slow it down overall, but you can more or less control the RAM usage by choosing the split size. 2. If you are performing a calculation that uses windows (regions) of the genome, it is generally much faster to split the graph by windows first, and then perform the calculation. And obviously you can do this for overlapping windows as well (at the cost of some disk space). The reasons it is faster: locality is significantly better, and the number of uninformative edges that you need to examine at each node is potentially orders of magnitude smaller (especially at sample nodes). You can split a GRG with the new `pygrgl.save_subset()` method, which takes either a list of MutationIDs or a list of sample NodeIDs, and serializes a graph with only those roots/leaves. In addition to splitting a GRG into pieces for performance reasons, you can also use this API to: 1. Remove samples from a GRG. 2. Filter out mutations from a GRG, e.g. based on frequency. In addition to splitting a GRG after-the-fact, you can now prevent the merge of partial GRGs during construction via the `--no-merge` flag to the `grg construct` command. There is also a new command `grg split` which takes a GRG, a number of windows, an optional recombination map, and will quickly split the GRG using multiple threads. In order to better support workflows with partial GRGs, the GRG now stores its genomic range (in base-pair position). Old GRG files will just use the min/max position from their Mutations as the range. New GRGs have the option for their range to be specified, as you might want to indicate what region you constructed the GRG from (not just the first/last mutation in it). In Python, you can access these two ranges using the `grg.bp_range` (range as found in the mutation list) and `grg.specified_bp_range` (range as specified during construction of the GRG). The Windowing API (in C++) does support overlapping windows, but this has not yet been exposed to the `grg split` command, and has not been exposed to Python yet.

The clang-format script was skipping headers in the src/ directory previously. Fixed the script and updated the headers.

dcdehaas added 2 commits November 19, 2024 12:25

Properly format internal headers

e240891

The clang-format script was skipping headers in the src/ directory previously. Fixed the script and updated the headers.

dcdehaas merged commit c0a5b37 into main Nov 19, 2024
3 checks passed

dcdehaas deleted the partial_grgs branch November 19, 2024 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for partial/split GRGs #32

Better support for partial/split GRGs #32

dcdehaas commented Nov 19, 2024

Better support for partial/split GRGs #32

Better support for partial/split GRGs #32

Conversation

dcdehaas commented Nov 19, 2024