Feature/mapper chunking #46

japols · 2024-09-18T15:52:56Z

Describe your changes

This PR adds a chunking strategy for the GraphTransformerMapperBlock to reduce peak memory usage during inference.

During inference, the max memory usage is dominated by large peaks in the the encoder and decoder. Chunking the edges and computing the GraphTransformer forward pass sequentially nicely flattens these peaks without significant increases in runtime. The following projection MLPs are also computed sequentially in (node) chunks.
Here are some results for a single-gpu, 240h-leadtime inference run on n320, hidden grid o48:

num_chunks	Peak Memory usage (GB)	Time (s)
1	33.45	159.13
2	29.81	159.45
4	24.85	160.77
8	22.91	160.84
16	22.08	164.38

The number of chunks during inference is handled by an environment variable ANEMOI_NUM_CHUNKS_INFERENCE which seems to be the only practical solution currently as we are currently not able to pass options to the model during inference. It falls back to no chunking i.e. num_chunks=1 if not set.
I am happy about any suggestions/alternative solutions.

Type of change

New feature (non-breaking change which adds functionality)
This change requires a documentation update*

*I am not sure how/where to document the usage of the environment variable, the goal would be to find a better long-time solution at some point

Checklist before requesting a review

*technically the environment variable is a new dependency in inference

Tag possible reviewers

@ssmmnn11 @JesperDramsch @theissenhelen @gmertes

📚 Documentation preview 📚: https://anemoi-models--46.org.readthedocs.build/en/46/

FussyDuck · 2024-09-18T15:53:02Z

All committers have signed the CLA.

codecov-commenter · 2024-09-18T15:57:58Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.84%. Comparing base (57ea969) to head (b6e34b9).

Additional details and impacted files

@@           Coverage Diff            @@
##           develop      #46   +/-   ##
========================================
  Coverage    99.84%   99.84%           
========================================
  Files           23       23           
  Lines         1277     1301   +24     
========================================
+ Hits          1275     1299   +24     
  Misses           2        2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gmertes

Regarding the environment variable:

We will come up with an interface to pass options to the model during inference. In the meantime, I suggested to Jan we work with an environment variable because this does not require a code change across the anemois.

When we have this options interface, we can load inference settings from a predefined place (the user config, command line, etc). While the environment variable is not a long-term solution, we can keep it as "one of the sources to look for options", lowest in the priority list.

We will delegate the options handling to anemoi-inference, which then passes the option values to the model. So down the line this env var will probably move to anemoi-inference, into a general inference options handler. But I suggest we deal with the later, so we don't block Jan's work.

src/anemoi/models/layers/block.py

ssmmnn11

Great work!

updates: - [github.com/psf/black-pre-commit-mirror: 24.4.2 → 24.8.0](psf/black-pre-commit-mirror@24.4.2...24.8.0) - [github.com/astral-sh/ruff-pre-commit: v0.4.6 → v0.6.2](astral-sh/ruff-pre-commit@v0.4.6...v0.6.2) - [github.com/tox-dev/pyproject-fmt: 2.1.3 → 2.2.1](tox-dev/pyproject-fmt@2.1.3...2.2.1) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* ci: add changelof release updater * docs: update changelog

…TransformerMapperBlock

updates: - [github.com/astral-sh/ruff-pre-commit: v0.6.2 → v0.6.3](astral-sh/ruff-pre-commit@v0.6.2...v0.6.3) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* feat: make model instantiateable * docs: instantiation explained in changelog

This reverts commit 5c75822.

* feat: remapper and change to data indices when mapping one variable to several * tests: update imputer and normalizer tests * feat: include remapper as a preprocessor. update init for all preprocessors. add tests for the remapper. * tests: update tests for index collection * documentation and changelog * feat: enable remapper for forcing variables * tests: include remapping forcing variable and do not test with remapping variables at the end * comment and warning about using in_place=True in remapper as this is not possible * comments: incorporate changes/documentation requested by jesper * change order of function inputs preprocessors, documentation for data indices and remapper * style: dict in config files for defining the variables to be remapped. structure and additional assert in index collection. * args in preprocessors

* fix: data indices import. typo documentation * fix: data indices import. typo documentation * typo * tests: remapped variables need to be given as dictionary not list

* fix: make config object optional * chore: add .envrc to gitignore * docs: changelog

* [create-pull-request] automated change * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update CHANGELOG.md --------- Co-authored-by: JesperDramsch <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…nking

japols added the enhancement New feature or request label Sep 18, 2024

gmertes reviewed Sep 23, 2024

View reviewed changes

src/anemoi/models/layers/block.py Outdated Show resolved Hide resolved

ssmmnn11 previously approved these changes Sep 23, 2024

View reviewed changes

japols and others added 16 commits September 26, 2024 09:17

Separate chunking logic from sharding logic in sort_edges_1hop()

eb0b0dd

Add GraphTransformerMapper chunking for inference

c9d7b11

Ci/changelog-release-updater (#26)

59ed2b4

* ci: add changelof release updater * docs: update changelog

Correct num_chunks usage

6c9724f

Add chunking of MLPs applied to nodes after GraphTransformer in Graph…

ff53a20

…TransformerMapperBlock

[pre-commit.ci] pre-commit autoupdate (#27)

15b4111

updates: - [github.com/astral-sh/ruff-pre-commit: v0.6.2 → v0.6.3](astral-sh/ruff-pre-commit@v0.6.2...v0.6.3) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

feat: make model instantiateable (#29)

11d1c63

* feat: make model instantiateable * docs: instantiation explained in changelog

Revert "feat: make model instantiateable (#29)" (#31)

a920d9a

This reverts commit 5c75822.

fix: data indices import. typo documentation (#34)

1d58fef

* fix: data indices import. typo documentation * fix: data indices import. typo documentation * typo * tests: remapped variables need to be given as dictionary not list

Refactor MLP chunking: replace loops with list comprehension

dcbd6cd

refactor: rename sort_edges_1hop -> sort_edges_1hop_sharding for clarity

0da9a09

test: add chunking tests for GraphTransformerMapperBlock

fa2dda3

Fix: make remapper config object optional (#36)

6789b62

* fix: make config object optional * chore: add .envrc to gitignore * docs: changelog

japols force-pushed the feature/mapper_chunking branch from 64350b7 to 7013cfc Compare September 26, 2024 09:33

japols requested review from theissenhelen and JesperDramsch as code owners September 26, 2024 09:33

Merge remote-tracking branch 'origin/develop' into feature/mapper_chu…

c518c38

…nking

japols dismissed ssmmnn11’s stale review via c518c38 September 27, 2024 11:30

japols added 3 commits October 1, 2024 09:48

docs: add chunking documentation, rename ANEMOI_INFERENCE_NUM_CHUNKS

7c0daa7

test: add pytest for env-var ANEMOI_INFERENCE_NUM_CHUNKS

87018dd

Merge remote-tracking branch 'origin/develop' into feature/mapper_chu…

b6e34b9

…nking

japols requested a review from gmertes October 1, 2024 14:06

japols requested review from mishooax and ssmmnn11 October 1, 2024 14:06

mishooax approved these changes Oct 1, 2024

View reviewed changes

japols merged commit f96bcf9 into develop Oct 2, 2024
115 checks passed

japols deleted the feature/mapper_chunking branch November 12, 2024 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/mapper chunking #46

Feature/mapper chunking #46

japols commented Sep 18, 2024 •

edited by github-actions bot

Loading

FussyDuck commented Sep 18, 2024 •

edited

Loading

codecov-commenter commented Sep 18, 2024 •

edited

Loading

gmertes left a comment •

edited

Loading

ssmmnn11 left a comment

Feature/mapper chunking #46

Feature/mapper chunking #46

Conversation

japols commented Sep 18, 2024 • edited by github-actions bot Loading

Describe your changes

Type of change

Checklist before requesting a review

Tag possible reviewers

FussyDuck commented Sep 18, 2024 • edited Loading

codecov-commenter commented Sep 18, 2024 • edited Loading

Codecov Report

gmertes left a comment • edited Loading

Choose a reason for hiding this comment

ssmmnn11 left a comment

Choose a reason for hiding this comment

japols commented Sep 18, 2024 •

edited by github-actions bot

Loading

FussyDuck commented Sep 18, 2024 •

edited

Loading

codecov-commenter commented Sep 18, 2024 •

edited

Loading

gmertes left a comment •

edited

Loading