Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/mapper chunking #46

Merged
merged 20 commits into from
Oct 2, 2024
Merged

Feature/mapper chunking #46

merged 20 commits into from
Oct 2, 2024

Conversation

japols
Copy link
Member

@japols japols commented Sep 18, 2024

Describe your changes

This PR adds a chunking strategy for the GraphTransformerMapperBlock to reduce peak memory usage during inference.

During inference, the max memory usage is dominated by large peaks in the the encoder and decoder. Chunking the edges and computing the GraphTransformer forward pass sequentially nicely flattens these peaks without significant increases in runtime. The following projection MLPs are also computed sequentially in (node) chunks.
Here are some results for a single-gpu, 240h-leadtime inference run on n320, hidden grid o48:

num_chunks Peak Memory usage (GB) Time (s)
1 33.45 159.13
2 29.81 159.45
4 24.85 160.77
8 22.91 160.84
16 22.08 164.38

The number of chunks during inference is handled by an environment variable ANEMOI_NUM_CHUNKS_INFERENCE which seems to be the only practical solution currently as we are currently not able to pass options to the model during inference. It falls back to no chunking i.e. num_chunks=1 if not set.
I am happy about any suggestions/alternative solutions.

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update*

*I am not sure how/where to document the usage of the environment variable, the goal would be to find a better long-time solution at some point

Checklist before requesting a review

  • I have performed a self-review of my code
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation and docstrings to reflect the changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have ensured that the code is still pip-installable after the changes and runs
  • I have not introduced new dependencies in the inference partion of the model*
  • I have ran this on single GPU
  • I have ran this on multi-GPU or multi-node
  • I have ran this to work on LUMI (or made sure the changes work independently.)
  • I have ran the Benchmark Profiler against the old version of the code

*technically the environment variable is a new dependency in inference

Tag possible reviewers

@ssmmnn11 @JesperDramsch @theissenhelen @gmertes


📚 Documentation preview 📚: https://anemoi-models--46.org.readthedocs.build/en/46/

@japols japols added the enhancement New feature or request label Sep 18, 2024
@FussyDuck
Copy link

FussyDuck commented Sep 18, 2024

CLA assistant check
All committers have signed the CLA.

@codecov-commenter
Copy link

codecov-commenter commented Sep 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.84%. Comparing base (57ea969) to head (b6e34b9).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop      #46   +/-   ##
========================================
  Coverage    99.84%   99.84%           
========================================
  Files           23       23           
  Lines         1277     1301   +24     
========================================
+ Hits          1275     1299   +24     
  Misses           2        2           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@gmertes gmertes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the environment variable:

We will come up with an interface to pass options to the model during inference. In the meantime, I suggested to Jan we work with an environment variable because this does not require a code change across the anemois.

When we have this options interface, we can load inference settings from a predefined place (the user config, command line, etc). While the environment variable is not a long-term solution, we can keep it as "one of the sources to look for options", lowest in the priority list.

We will delegate the options handling to anemoi-inference, which then passes the option values to the model. So down the line this env var will probably move to anemoi-inference, into a general inference options handler. But I suggest we deal with the later, so we don't block Jan's work.

src/anemoi/models/layers/block.py Outdated Show resolved Hide resolved
ssmmnn11
ssmmnn11 previously approved these changes Sep 23, 2024
Copy link
Member

@ssmmnn11 ssmmnn11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

japols and others added 16 commits September 26, 2024 09:17
updates:
- [github.com/psf/black-pre-commit-mirror: 24.4.2 → 24.8.0](psf/black-pre-commit-mirror@24.4.2...24.8.0)
- [github.com/astral-sh/ruff-pre-commit: v0.4.6 → v0.6.2](astral-sh/ruff-pre-commit@v0.4.6...v0.6.2)
- [github.com/tox-dev/pyproject-fmt: 2.1.3 → 2.2.1](tox-dev/pyproject-fmt@2.1.3...2.2.1)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* ci: add changelof release updater

* docs: update changelog
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.6.2 → v0.6.3](astral-sh/ruff-pre-commit@v0.6.2...v0.6.3)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat: make model instantiateable

* docs: instantiation explained in changelog
* feat: remapper and change to data indices when mapping one variable to several

* tests: update imputer and normalizer tests

* feat: include remapper as a preprocessor. update init for all preprocessors. add tests for the remapper.

* tests: update tests for index collection

* documentation and changelog

* feat: enable remapper for forcing variables

* tests: include remapping forcing variable and do not test with remapping variables at the end

* comment and warning about using in_place=True in remapper as this is not possible

* comments: incorporate changes/documentation requested by jesper

* change order of function inputs preprocessors, documentation for data indices and remapper

* style: dict in config files for defining the variables to be remapped. structure and additional assert in index collection.

* args in preprocessors
* fix: data indices import. typo documentation

* fix: data indices import. typo documentation

* typo

* tests: remapped variables need to be given as dictionary not list
* fix: make config object optional

* chore: add .envrc to gitignore

* docs: changelog
* [create-pull-request] automated change

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update CHANGELOG.md

---------

Co-authored-by: JesperDramsch <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@japols japols requested a review from gmertes October 1, 2024 14:06
@japols japols merged commit f96bcf9 into develop Oct 2, 2024
115 checks passed
@japols japols deleted the feature/mapper_chunking branch November 12, 2024 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants