Switch layer kernel implementation in the config #35

cathalobrien · 2024-09-11T11:56:33Z

Describe your changes

This PR makes it possible to switch the implementation of Linear and LayerNorm kernels in the config.

At the moment we use torch.NN implementation for many layers in Anemoi model e.g. torch.nn.layerNorm, torch.NN.linear. This has the advantage of being available out of the box with torch and portable to many systems (CPU, AMD and Nvidia GPUs). However, other layer implementations might be more efficient for certain hardware, or use different algorithms we want to explore (RMSNorm is an alternate implementation of LayerNorm for instance).
These might only run on certain systems (e.g. Nvidias transformer_engine.pytorch provides layer implementations optimized for their GPUs).
Therefore, we'd like to be able to flexibly take advantage of these faster kernels when they're available without losing the ability to fall back to torch.nn for resiliency.

This PR adds the following block to config/model/.yaml:

  layer_kernels:
    LayerNorm:
      #_target_: "transformer_engine.pytorch.LayerNorm"
      _target_: "liger_kernel.transformers.rms_norm.LigerRMSNorm"
      #_target_: "torch.nn.LayerNorm" #the default PyTorch implementation
      _partial_: True
      #Any arguments to your chosen function go here e.g.
      #bias: False
    Linear:
      #_target_: "transformer_engine.pytorch.Linear"
      _target_: "torch.nn.Linear"
      _partial_: True

In the future, this syntax could be extended to replace other layers (e.g. mlp) if required.

The calls to torch.nn are then replaced with

- self.layer_norm1 = nn.LayerNorm(num_channels)
+ LayerNorm=layer_kernels['LayerNorm']
+ self.layer_norm1 = LayerNorm(num_channels)

You can pass any parameters to your new kernels in the config file, after "partial : True". Hydra tries to load the desired kernel in "models/encoder_processor_decoder.py". If the desired library isnt available, torch currently will fall back to torch.nn..

        for kernel in self.layer_kernels:
            kernel_entry=self.layer_kernels[kernel]
            try:
                instantiate(kernel_entry)
            except InstantiationException:
                LOGGER.info(f"{kernel_entry['_target_']} not availible! falling back to torch.nn.{kernel}")
                #config.model.layer_kernels[kernel]["_target_"]=f"torch.nn.{kernel}"
                self.layer_kernels[kernel] = DotDict({'_target_': f"torch.nn.{kernel}", '_partial_': True}) #replace the entry, to remove any args passed to the orginal kernel
        LOGGER.debug(f"{self.layer_kernels=}")

I am in two minds about this. On the one hand, it helps ensure you run if you are missing a library (which might be nessecary when doing inference in a different enviroment to where the model was trained). But on the other hand, maybe it would be better to loudly fail when the user is requesting an uninstalled library be used. Otherwise, a user could be under the impression they are using an optimised kernel when they are not. Also it feels like poor programming practice to branch on Exceptions like I am doing here. Open to suggestions on this.

This feature makes it easy to try out new kernels in an end-to-end machine learning run, rather then simply doing a standalone kernel benchmark. Using this feature I was able to trial the RMSNorm implementation of LayerNorm from Liger Kernel, in 2 lines (pip install liger_kernel; vim config/model/transformer.yaml) and I saw ~10% speedup.

LayerNorm	run_training_batch time (s)	training_avg_throughput (iter/s)
torch.nn.LayerNorm	0.93532	0.88574
liger_kernel.transformers.rms_norm.LigerRMSNorm	0.82982	0.99186

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)

This change requires a documentation update

Checklist before requesting a review

FussyDuck · 2024-09-11T11:56:38Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ cathalobrien
❌ Cathal Liam O Brien

Cathal Liam O Brien seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

for more information, see https://pre-commit.ci

clessig · 2024-09-11T13:42:13Z

I think silently switching to a different implementation is dangerous. As we see with the attention, there might be differences in the implementation that the user should be aware of and think about when selecting a different backend.

cathalobrien · 2024-09-11T14:06:57Z

LOGGER.info(f"{kernel_entry['_target_']} not availible! falling back to torch.nn.{kernel}")

I agree, good analogy with attention. At the moment there's a warning, but stdout is easily missed.

However, we have to make sure inference is still possible without any additional libraries that might have been used during training. This could be done by resetting the 'layer_kernels' config entry to "torch.nn" during inference. but at the moment there's apparently no easy way to tell if you're in inference or training, so this requires some thought

… default

…n/anemoi-models into switch_layer_kernels_w_hydra

for more information, see https://pre-commit.ci

src/anemoi/models/layers/attention.py

src/anemoi/models/layers/block.py

mishooax · 2024-09-11T15:22:47Z

src/anemoi/models/layers/mlp.py

            mlp1.append(act_func())
-        mlp1.append(nn.Linear(hidden_dim, out_features))
+        mlp1.append(Linear(hidden_dim, out_features))

        if final_activation:
            mlp1.append(act_func())

        if layer_norm:
            mlp1.append(AutocastLayerNorm(out_features))


do we still want the AutocastLayerNorm here? should this be replaced by a LayerNorm?

Is that where we landed from the Nvidia meeting way back? I can do some tests

src/anemoi/models/models/encoder_processor_decoder.py

mishooax · 2024-09-11T15:27:54Z

src/anemoi/models/models/encoder_processor_decoder.py

@@ -69,6 +70,31 @@ def __init__(

        self.num_channels = config.model.num_channels

+        # If self.layer_kernels entry is missing from the config, use torch.nn by default


where is the config? i'm not seeing it in this PR

The configs are part of anemoi training now, so I guess i'll just update the docs for anemoi-training to say this feature exists and show some examples. Otherwise I can make a small PR to anemoi-training with an updated config

src/anemoi/models/models/encoder_processor_decoder.py

cathalobrien · 2024-09-11T16:05:05Z

LOGGER.info(f"{kernel_entry['_target_']} not availible! falling back to torch.nn.{kernel}")
I agree, good analogy with attention. At the moment there's a warning, but stdout is easily missed.

However, we have to make sure inference is still possible without any additional libraries that might have been used during training. This could be done by resetting the 'layer_kernels' config entry to "torch.nn" during inference. but at the moment there's apparently no easy way to tell if you're in inference or training, so this requires some thought

I'll make a PR to ai_models, where i reset layer_kernels to torch.nn if thay've been set. seems like the most straightforward way to handle inference

for more information, see https://pre-commit.ci

JesperDramsch · 2024-09-23T12:49:24Z

Open to a discussion, but I think we shouldn't have "default fallbacks" for these types of things, if the code fails to instantiate from the provided config.

When someone sbatch submits their job with a specific experiment, it should probably be obvious that the config wasn't valid, but with these fallbacks, it just "runs through" anyways and you have to monitor your logs if anything went wrong with your experiment due to a misconfiguration or wrong environment.

cathalobrien · 2024-10-09T11:34:47Z

Open to a discussion, but I think we shouldn't have "default fallbacks" for these types of things, if the code fails to instantiate from the provided config.

When someone sbatch submits their job with a specific experiment, it should probably be obvious that the config wasn't valid, but with these fallbacks, it just "runs through" anyways and you have to monitor your logs if anything went wrong with your experiment due to a misconfiguration or wrong environment.

The current behaviour is:

if the entry is missing from the config -> use torch.nn by default. This maintains backwards compatibility with older config files
if the config entry exists but the library it describes cant be loaded -> throw an error

cathalobrien and others added 7 commits September 2, 2024 14:19

can change transformer layer kernels via hydra

0474ca2

more logging

98962c2

fallsback to torch.nn implementation if requested kernel isn't availible

8004ed1

try replace entry on failure

47680b7

WIP extending kernel switching to GNN and graph transformer

d006534

Merge branch 'ecmwf:develop' into switch_layer_kernels_w_hydra

47974aa

removed some prints and lowered verbosity

df1bd10

[pre-commit.ci] auto fixes from pre-commit.com hooks

bd6fdb3

for more information, see https://pre-commit.ci

github-actions bot added the contributor label Sep 11, 2024

Cathal Liam O Brien and others added 3 commits September 11, 2024 14:29

added comments

2501cac

merge

4745f62

[pre-commit.ci] auto fixes from pre-commit.com hooks

3eed82a

for more information, see https://pre-commit.ci

Cathal Liam O Brien and others added 3 commits September 11, 2024 16:52

If a layer_kernel entry is missing in config, load torch.nn layers by…

a1e822f

… default

Merge branch 'switch_layer_kernels_w_hydra' of github.com:cathalobrie…

d2d27dc

…n/anemoi-models into switch_layer_kernels_w_hydra

[pre-commit.ci] auto fixes from pre-commit.com hooks

73d643c

for more information, see https://pre-commit.ci

mishooax reviewed Sep 11, 2024

View reviewed changes

cathalobrien and others added 3 commits September 12, 2024 12:45

clean up

d7c94c4

working on feedback

f74242f

[pre-commit.ci] auto fixes from pre-commit.com hooks

78b8429

for more information, see https://pre-commit.ci

cathalobrien marked this pull request as draft November 18, 2024 16:53

anaprietonem assigned cathalobrien Nov 21, 2024

cathalobrien mentioned this pull request Nov 22, 2024

Specifying normalization layers. #87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch layer kernel implementation in the config #35

Switch layer kernel implementation in the config #35

cathalobrien commented Sep 11, 2024 •

edited

Loading

FussyDuck commented Sep 11, 2024 •

edited

Loading

clessig commented Sep 11, 2024

cathalobrien commented Sep 11, 2024

mishooax Sep 11, 2024

cathalobrien Sep 11, 2024

mishooax Sep 11, 2024

cathalobrien Oct 9, 2024

cathalobrien commented Sep 11, 2024

JesperDramsch commented Sep 23, 2024

cathalobrien commented Oct 9, 2024 •

edited

Loading

		@@ -69,6 +70,31 @@ def __init__(

		self.num_channels = config.model.num_channels

		# If self.layer_kernels entry is missing from the config, use torch.nn by default

Switch layer kernel implementation in the config #35

Are you sure you want to change the base?

Switch layer kernel implementation in the config #35

Conversation

cathalobrien commented Sep 11, 2024 • edited Loading

Describe your changes

Type of change

Checklist before requesting a review

FussyDuck commented Sep 11, 2024 • edited Loading

clessig commented Sep 11, 2024

cathalobrien commented Sep 11, 2024

mishooax Sep 11, 2024

Choose a reason for hiding this comment

cathalobrien Sep 11, 2024

Choose a reason for hiding this comment

mishooax Sep 11, 2024

Choose a reason for hiding this comment

cathalobrien Oct 9, 2024

Choose a reason for hiding this comment

cathalobrien commented Sep 11, 2024

JesperDramsch commented Sep 23, 2024

cathalobrien commented Oct 9, 2024 • edited Loading

cathalobrien commented Sep 11, 2024 •

edited

Loading

FussyDuck commented Sep 11, 2024 •

edited

Loading

cathalobrien commented Oct 9, 2024 •

edited

Loading