Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share Inception v3 network between FID, KID, Inception Score, and MiFID #2099

Closed
gavrieladler opened this issue Sep 21, 2023 · 3 comments · Fixed by #2120
Closed

Share Inception v3 network between FID, KID, Inception Score, and MiFID #2099

gavrieladler opened this issue Sep 21, 2023 · 3 comments · Fixed by #2120
Labels
enhancement New feature or request
Milestone

Comments

@gavrieladler
Copy link

gavrieladler commented Sep 21, 2023

🚀 Feature

A single copy of Inception v3 loaded on a GPU and shared between all the metrics which use it.

Motivation

FID, KID, Inception Score, and MiFID all use the same Inception v3 network, which uses 304MB of GPU memory when loaded. If you want to calculate all 4 scores while training, that's 1.2GB GPU memory per GPU. It would be ideal if the metrics could share a single copy of the network to minimize GPU memory usage.

Pitch

Torchmetrics should know if one metric which uses Inception V3 has been loaded and share the copy between the metrics, so when a second metric is created, a second copy of the network is not created.

Alternatives

  1. Documentation / officially support the "unofficial non supported" way to do this -- pass a reference to the network from one metric into the __init__ of another metric. So for example, today you could do:
my_fid = torchmetrics.image.fid.FrechetInceptionDistance()
my_is = torchmetrics.image.inception.InceptionScore(feature=my_fid.inception)

However I am hesitant to rely on this as it depends on the internal details of metrics which are subject to change release to release. I would not know from release notes if this was broken or not.

  1. A "parent metric" which created one copy of InceptionV3 and the metrics you selected to use it with, and calls to update / reset / compuse would in turn pass that down to each individual metric. This could get a bit messy as the __init__ APIs do not exactly line up.

Additional context

It would be doubly awesome if the solution allowed you to share activations as well as the network itself, as currently for each copy I'm calculating the same activations on each copy of the network.

Thanks so much!

@gavrieladler gavrieladler added the enhancement New feature or request label Sep 21, 2023
@github-actions
Copy link

Hi! thanks for your contribution!, great first issue!

@SkafteNicki
Copy link
Member

Hi @gavrieladler, thanks for raising this issue.
I really think it is a good idea that you proposing as it is a bit silly that the same network is initialized multiple times and the activations are neither reused. Reusing internal computations is already part of MetricCollection but not in this way.

The current solution that I came up with is using a caching system and creating a specialized MetricCollection. This collection called FeatureShare (naming is work in progress) essentially has two features:

  • Make sure that a single copy of a network is used across all metrics. This is simply done by passing the network of the first metric in the collection to the rest, essentially overwriting.
  • Secondly, this shared network gets wrapped in a lru_cache meaning that input-output will be cached. If we set it to the max size of the cache to the number of networks in the collection (or a bit above) it should not impact memory that much (I hope)
    and the same computation should be reused when the metrics are updated one after another.

Here is some work in progress code

class NetworkCache(Module):
    def __init__(self, network, max_size=100):
        super().__init__()
        self.max_size = max_size
        self.network = lru_cache(maxsize=self.max_size)(network)

    def forward(self, *args, **kwargs):
        return self.network(*args, **kwargs)


class FeatureShare(MetricCollection):
    def __init__(self,
        metrics: Union[Metric, Sequence[Metric], Dict[str, Metric]],
        *additional_metrics: Metric,
        network_names: Union[str, Sequence[str]],
    ):
        super().__init__(metrics=metrics, *additional_metrics)

        if isinstance(network_names, str):
            network_names = [network_names] * len(self)
        else:
            if len(network_names) != len(self):
                raise ValueError('The number of network names should be equal to the number of metrics.')
        shared_net = getattr(getattr(self, list(self.keys())[0]), network_names[0])
        cached_net = NetworkCache(shared_net)
        for (_, metric), network_name in zip(self.items(), network_names):
            setattr(metric, network_name, cached_net)

So very basic analysis:

from torchmetrics.image import FrechetInceptionDistance, InceptionScore, KernelInceptionDistance
from torchmetrics import MetricCollection

fs = FeatureShare([FrechetInceptionDistance(), InceptionScore(), KernelInceptionDistance()], network_names=['inception', 'inception', 'inception'])
mc = MetricCollection([FrechetInceptionDistance(), InceptionScore(), KernelInceptionDistance()])

import time
start = time.time()
for _ in range(10):
    x = torch.randint(255, (1, 3, 64, 64), dtype=torch.uint8)
    mc.update(x, real=True)
end = time.time()
print(end - start)

start = time.time()
for _ in range(10):
    x = torch.randint(255, (1, 3, 64, 64), dtype=torch.uint8)
    fs.update(x, real=True)
end = time.time()
print(end - start)

give me 5.57 and 1.92 respectively for MetricCollection and FeatureShare meaning that sharing the feature is approximately 2.9x times faster which is very much what I would expect when we have three metric that can reuse the same computation.

What do you think about this?

@SkafteNicki SkafteNicki added this to the v1.3.0 milestone Sep 25, 2023
@gavrieladler
Copy link
Author

gavrieladler commented Sep 26, 2023

Hi @SkafteNicki,

Thanks for the detailed reply! I think this looks good and is absolutely usable as presented.

One suggestion, and I could go either way: I do think there's an argument to be made that because inception is a class variable name set by the library, the line

fs = FeatureShare([FrechetInceptionDistance(), InceptionScore(), KernelInceptionDistance()], network_names=['inception', 'inception', 'inception'])

may be unnecessary for the user to specify? Torchmetrics only supports a finite number of metrics and is able to "know" that the selected features should share a network. Only requiring the user to specify

mc = MetricCollection([FrechetInceptionDistance(), InceptionScore(), KernelInceptionDistance()])

without having to have detailed knowledge of the inner workings of each metrics is a cleaner API. Especially since inception is an internal name not visible to the API, so asking the user to specify network_names requires the user to dig through the source code to know what the network names are. Requiring the user to specify network_names also creates the risk that these internal class variable names are changed between releases, breaking your code.

The code could automatically, when you create any of FID, IS, and KID, add their inception networks to a cache.

Thoughts? Like I said, the solution presented is definitely usable as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants