Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for partial/split GRGs #32

Merged
merged 2 commits into from
Nov 19, 2024
Merged

Better support for partial/split GRGs #32

merged 2 commits into from
Nov 19, 2024

Conversation

dcdehaas
Copy link
Collaborator

Splitting a GRG into 1,000 equally sized (w.r.t number of mutations) graphs
produces a total file size almost exactly 2x of the original full GRG. Thus you
can load these GRGs into RAM one at a time, at about 1/500th of the RAM needed
for the full GRG.

There are a few reasons why you might want to split up a GRG:

  1. To reduce the RAM usage for a genome-wide calculation. Performing this
    calculation on the smaller GRGs will slow it down overall, but you can more or
    less control the RAM usage by choosing the split size.
  2. If you are performing a calculation that uses windows (regions) of the
    genome, it is generally much faster to split the graph by windows first, and
    then perform the calculation. And obviously you can do this for overlapping
    windows as well (at the cost of some disk space). The reasons it is faster:
    locality is significantly better, and the number of uninformative edges that
    you need to examine at each node is potentially orders of magnitude smaller
    (especially at sample nodes).

You can split a GRG with the new pygrgl.save_subset() method, which takes
either a list of MutationIDs or a list of sample NodeIDs, and serializes a
graph with only those roots/leaves. In addition to splitting a GRG into pieces
for performance reasons, you can also use this API to:

  1. Remove samples from a GRG.
  2. Filter out mutations from a GRG, e.g. based on frequency.

In addition to splitting a GRG after-the-fact, you can now prevent the merge of
partial GRGs during construction via the --no-merge flag to the grg construct command. There is also a new command grg split which takes
a GRG, a number of windows, an optional recombination map, and will quickly
split the GRG using multiple threads.

In order to better support workflows with partial GRGs, the GRG now stores its
genomic range (in base-pair position). Old GRG files will just use the min/max
position from their Mutations as the range. New GRGs have the option for their
range to be specified, as you might want to indicate what region you
constructed the GRG from (not just the first/last mutation in it).
In Python, you can access these two ranges using the grg.bp_range (range as
found in the mutation list) and grg.specified_bp_range (range as specified
during construction of the GRG).

The Windowing API (in C++) does support overlapping windows, but this has not
yet been exposed to the grg split command, and has not been exposed to
Python yet.

Also: fix a bug in the formatting script and properly format the internal headers
in src/

Splitting a GRG into 1,000 equally sized (w.r.t number of mutations) graphs
produces a total file size almost exactly 2x of the original full GRG. Thus you
can load these GRGs into RAM one at a time, at about 1/500th of the RAM needed
for the full GRG.

There are a few reasons why you might want to split up a GRG:
1. To reduce the RAM usage for a genome-wide calculation. Performing this
calculation on the smaller GRGs will slow it down overall, but you can more or
less control the RAM usage by choosing the split size.
2. If you are performing a calculation that uses windows (regions) of the
genome, it is generally much faster to split the graph by windows first, and
then perform the calculation. And obviously you can do this for overlapping
windows as well (at the cost of some disk space). The reasons it is faster:
locality is significantly better, and the number of uninformative edges that
you need to examine at each node is potentially orders of magnitude smaller
(especially at sample nodes).

You can split a GRG with the new `pygrgl.save_subset()` method, which takes
either a list of MutationIDs or a list of sample NodeIDs, and serializes a
graph with only those roots/leaves. In addition to splitting a GRG into pieces
for performance reasons, you can also use this API to:
1. Remove samples from a GRG.
2. Filter out mutations from a GRG, e.g. based on frequency.

In addition to splitting a GRG after-the-fact, you can now prevent the merge of
partial GRGs during construction via the `--no-merge` flag to the `grg
construct` command. There is also a new command `grg split` which takes
a GRG, a number of windows, an optional recombination map, and will quickly
split the GRG using multiple threads.

In order to better support workflows with partial GRGs, the GRG now stores its
genomic range (in base-pair position). Old GRG files will just use the min/max
position from their Mutations as the range. New GRGs have the option for their
range to be specified, as you might want to indicate what region you
constructed the GRG from (not just the first/last mutation in it).
In Python, you can access these two ranges using the `grg.bp_range` (range as
found in the mutation list) and `grg.specified_bp_range` (range as specified
during construction of the GRG).

The Windowing API (in C++) does support overlapping windows, but this has not
yet been exposed to the `grg split` command, and has not been exposed to
Python yet.
The clang-format script was skipping headers in the src/ directory
previously. Fixed the script and updated the headers.
@dcdehaas dcdehaas merged commit c0a5b37 into main Nov 19, 2024
3 checks passed
@dcdehaas dcdehaas deleted the partial_grgs branch November 19, 2024 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant