Memory issues and Einsum #748

lucasdekam · 2024-12-12T11:21:19Z

lucasdekam
Dec 12, 2024

When using large configurations (900 atoms) I run out of memory on A100 GPU,

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<eval_with_key>.127", line 83, in forward
    reshape_24 = reshape_3.reshape(getitem_4, 128)
    einsum_14 = torch.functional.einsum('ca,cab->cab', reshape_24, reshape_23);  reshape_24 = reshape_23 = None
    einsum_15 = torch.functional.einsum('dbc,dca->dba', einsum_14, reshape_7);  einsum_14 = reshape_7 = None
                ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    reshape_25 = einsum_15.reshape(getitem_4, 640);  einsum_15 = None
    getitem_17 = reshape_2[(slice(None, None, None), slice(768, 896, None))]
RuntimeError: CUDA out of memory. Tried to allocate 2.11 GiB. GPU 0 has a total capacity of 39.50 GiB of which 32.12 MiB is free. Including non-PyTorch memory, this process has 39.45 GiB memory in use. Of the allocated memory 38.75 GiB is allocated by PyTorch, and 210.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Could this be solved by doing the einsum in batches, perhaps using some opt-einsum module? I saw this being used in other MLIP codes (but maybe you already thought about it and there's a reason it's not possible).

In any case it's possible to avoid running out of memory by using e.g. an H100 GPU with more memory or choosing a smaller cutoff, but it would be nice if it can be avoided if someone really needs a larger cutoff and run with a smaller GPU.

edit: I already set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in the above example.

Answered by ilyes319

Dec 12, 2024

You can look at the doc https://mace-docs.readthedocs.io/en/latest/guide/cuda_acceleration.html, these are cuda kernels developed by Nvidia to accelerate MACE. You can train with them.
Two notes:

You should not use --hidden_irreps="128x0e + 128x1e" but --hidden_irreps="128x0e + 128x1o"
Training on 900 atoms cell is HUGE, I think you should be able to get away with smaller structures.

View full answer

ilyes319 · 2024-12-12T11:47:39Z

ilyes319
Dec 12, 2024
Maintainer

Hey @lucasdekam,

We already use optimised einsums. Did you try cueq, it should enable you to load more atoms on your GPU. Are you training or evaluating? Also what is the size of your model (channels, L, rmax)?

15 replies

ilyes319 Dec 12, 2024
Maintainer

https://mace-docs.readthedocs.io/en/latest/guide/cuda_acceleration.html

lucasdekam Dec 12, 2024
Author

Hi, I'm not familiar with cueq, where can I find more information on this?
edit: didn't see the link yet, I'll check it out

I use

  --r_max=5.0 \\
  --num_interactions=2 \\
  --hidden_irreps="128x0e + 128x1e" \\

with the default max_L (should be 3).

Running out of memory with these settings only occurs while training on the A100 GPU. When training on H100 and then doing molecular dynamics with A100 GPU I have no memory issues.

ilyes319 Dec 12, 2024
Maintainer

You can look at the doc https://mace-docs.readthedocs.io/en/latest/guide/cuda_acceleration.html, these are cuda kernels developed by Nvidia to accelerate MACE. You can train with them.
Two notes:

You should not use --hidden_irreps="128x0e + 128x1e" but --hidden_irreps="128x0e + 128x1o"
Training on 900 atoms cell is HUGE, I think you should be able to get away with smaller structures.

Answer selected by lucasdekam

lucasdekam Dec 12, 2024
Author

Oops, that was a typo in the hidden irreps. Thanks.

I'm going to try out cueq, the acceleration would be really nice. Thanks a lot, I wasn't aware of it yet.

The large configs are for metal-water interfaces, so MACE needs to learn both interface and bulk behavior and everything in between. The dataset was made by someone else so I'm working with what I have. If you have any opinions on how to train MACE for interfaces like this with smaller configurations, though, I would be interested.

gabor1 Dec 12, 2024
Maintainer

I think that indeed you won't get too different results if you shave off some waters in various way (have water clusters next to the metal surface, have thin layers of waters across it), make the system smaller in the plane perpendicular to the interface, etc. the 900 atoms is probably coming from aimd studies, where the actual md needs to be that large to be realistic. but you can and should train mace from smaller structures, and then run large (>900) atom simulations with it.

gabor1 Dec 12, 2024
Maintainer

how many layers of metal do you have? you can skip on that too in the training phase, it's plenty to have 3-4 layers for most of your training data, include a few 6-layer systems which are much narrower in-plane just to give it some diversity.

lucasdekam Dec 12, 2024
Author

We've got 6 layers of metal with a 30A water box. When the dataset was created I think the idea was indeed to have cells that also give realistic results in aimd.

I'll keep this in mind for next projects, creating smaller and more diverse configurations could really save time and computational resources, and help the model to generalize better. Thanks for the comments!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issues and Einsum #748

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 15 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Memory issues and Einsum #748

lucasdekam Dec 12, 2024

Replies: 1 comment · 15 replies

ilyes319 Dec 12, 2024 Maintainer

ilyes319 Dec 12, 2024 Maintainer

lucasdekam Dec 12, 2024 Author

ilyes319 Dec 12, 2024 Maintainer

lucasdekam Dec 12, 2024 Author

gabor1 Dec 12, 2024 Maintainer

gabor1 Dec 12, 2024 Maintainer

lucasdekam Dec 12, 2024 Author

lucasdekam
Dec 12, 2024

Replies: 1 comment 15 replies

ilyes319
Dec 12, 2024
Maintainer

ilyes319 Dec 12, 2024
Maintainer

lucasdekam Dec 12, 2024
Author

ilyes319 Dec 12, 2024
Maintainer

lucasdekam Dec 12, 2024
Author

gabor1 Dec 12, 2024
Maintainer

gabor1 Dec 12, 2024
Maintainer

lucasdekam Dec 12, 2024
Author