diff --git a/docs/.buildinfo b/docs/.buildinfo index 62cc9b6..7f70517 100644 --- a/docs/.buildinfo +++ b/docs/.buildinfo @@ -1,4 +1,4 @@ # Sphinx build info version 1 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. -config: ba51abc8dad17399953f2a24e939f1ec +config: ff98f6fae0a75c232c4a4aa789f50b6c tags: 645f666f9bcd5a90fca523b33c5a78b7 diff --git a/docs/.doctrees/environment.pickle b/docs/.doctrees/environment.pickle index c93c240..772ffe6 100644 Binary files a/docs/.doctrees/environment.pickle and b/docs/.doctrees/environment.pickle differ diff --git a/docs/.doctrees/recipes/profile_with_itt.doctree b/docs/.doctrees/recipes/profile_with_itt.doctree index c4b99b0..66c4eb8 100644 Binary files a/docs/.doctrees/recipes/profile_with_itt.doctree and b/docs/.doctrees/recipes/profile_with_itt.doctree differ diff --git a/docs/.doctrees/recipes/recipes/Captum_Recipe.doctree b/docs/.doctrees/recipes/recipes/Captum_Recipe.doctree index d1bfe85..4fbe499 100644 Binary files a/docs/.doctrees/recipes/recipes/Captum_Recipe.doctree and b/docs/.doctrees/recipes/recipes/Captum_Recipe.doctree differ diff --git a/docs/.doctrees/recipes/recipes/benchmark.doctree b/docs/.doctrees/recipes/recipes/benchmark.doctree index 13c9d12..b6cb7cd 100644 Binary files a/docs/.doctrees/recipes/recipes/benchmark.doctree and b/docs/.doctrees/recipes/recipes/benchmark.doctree differ diff --git a/docs/.doctrees/recipes/recipes/dynamic_quantization.doctree b/docs/.doctrees/recipes/recipes/dynamic_quantization.doctree index 8a2004a..fffd58b 100644 Binary files a/docs/.doctrees/recipes/recipes/dynamic_quantization.doctree and b/docs/.doctrees/recipes/recipes/dynamic_quantization.doctree differ diff --git a/docs/.doctrees/recipes/recipes/index.doctree b/docs/.doctrees/recipes/recipes/index.doctree index 3825d1e..981e441 100644 Binary files a/docs/.doctrees/recipes/recipes/index.doctree and b/docs/.doctrees/recipes/recipes/index.doctree differ diff --git a/docs/.doctrees/recipes/recipes/module_load_state_dict_tips.doctree b/docs/.doctrees/recipes/recipes/module_load_state_dict_tips.doctree index d154643..155a823 100644 Binary files a/docs/.doctrees/recipes/recipes/module_load_state_dict_tips.doctree and b/docs/.doctrees/recipes/recipes/module_load_state_dict_tips.doctree differ diff --git a/docs/.doctrees/recipes/recipes/reasoning_about_shapes.doctree b/docs/.doctrees/recipes/recipes/reasoning_about_shapes.doctree index 5710d30..3ca9ce3 100644 Binary files a/docs/.doctrees/recipes/recipes/reasoning_about_shapes.doctree and b/docs/.doctrees/recipes/recipes/reasoning_about_shapes.doctree differ diff --git a/docs/.doctrees/recipes/recipes/swap_tensors.doctree b/docs/.doctrees/recipes/recipes/swap_tensors.doctree index c4e30f8..b3c51c3 100644 Binary files a/docs/.doctrees/recipes/recipes/swap_tensors.doctree and b/docs/.doctrees/recipes/recipes/swap_tensors.doctree differ diff --git a/docs/.doctrees/recipes/recipes/tensorboard_with_pytorch.doctree b/docs/.doctrees/recipes/recipes/tensorboard_with_pytorch.doctree index ae19a5d..245d33d 100644 Binary files a/docs/.doctrees/recipes/recipes/tensorboard_with_pytorch.doctree and b/docs/.doctrees/recipes/recipes/tensorboard_with_pytorch.doctree differ diff --git a/docs/.doctrees/recipes/recipes_index.doctree b/docs/.doctrees/recipes/recipes_index.doctree index 5ec4763..dc7de89 100644 Binary files a/docs/.doctrees/recipes/recipes_index.doctree and b/docs/.doctrees/recipes/recipes_index.doctree differ diff --git a/docs/.doctrees/recipes/torch_compile_backend_ipex.doctree b/docs/.doctrees/recipes/torch_compile_backend_ipex.doctree index c072160..bf4b1da 100644 Binary files a/docs/.doctrees/recipes/torch_compile_backend_ipex.doctree and b/docs/.doctrees/recipes/torch_compile_backend_ipex.doctree differ diff --git a/docs/.doctrees/recipes/torch_logs.doctree b/docs/.doctrees/recipes/torch_logs.doctree index cf7fe08..963bf6f 100644 Binary files a/docs/.doctrees/recipes/torch_logs.doctree and b/docs/.doctrees/recipes/torch_logs.doctree differ diff --git a/docs/.doctrees/recipes/torchscript_inference.doctree b/docs/.doctrees/recipes/torchscript_inference.doctree index 94f49bd..8ae9b5b 100644 Binary files a/docs/.doctrees/recipes/torchscript_inference.doctree and b/docs/.doctrees/recipes/torchscript_inference.doctree differ diff --git a/docs/_downloads/1bba1c0153db192997cdb32f9c312b2c/reasoning_about_shapes.ipynb b/docs/_downloads/1bba1c0153db192997cdb32f9c312b2c/reasoning_about_shapes.ipynb index 430b711..845367e 100644 --- a/docs/_downloads/1bba1c0153db192997cdb32f9c312b2c/reasoning_about_shapes.ipynb +++ b/docs/_downloads/1bba1c0153db192997cdb32f9c312b2c/reasoning_about_shapes.ipynb @@ -15,7 +15,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n# Reasoning about Shapes in PyTorch\n\nWhen writing models with PyTorch, it is commonly the case that the parameters\nto a given layer depend on the shape of the output of the previous layer. For\nexample, the ``in_features`` of an ``nn.Linear`` layer must match the\n``size(-1)`` of the input. For some layers, the shape computation involves\ncomplex equations, for example convolution operations.\n\nOne way around this is to run the forward pass with random inputs, but this is\nwasteful in terms of memory and compute.\n\nInstead, we can make use of the ``meta`` device to determine the output shapes\nof a layer without materializing any data.\n" + "\n# \u5728PyTorch\u4e2d\u63a8\u7406\u5f62\u72b6\n\n\u5728\u4f7f\u7528PyTorch\u7f16\u5199\u6a21\u578b\u65f6,\u901a\u5e38\u4f1a\u9047\u5230\u67d0\u4e00\u5c42\u7684\u53c2\u6570\u53d6\u51b3\u4e8e\u524d\u4e00\u5c42\u8f93\u51fa\u7684\u5f62\u72b6\u7684\u60c5\u51b5\u3002\u4f8b\u5982,\n``nn.Linear``\u5c42\u7684``in_features``\u5fc5\u987b\u4e0e\u8f93\u5165\u7684``size(-1)``\u76f8\u5339\u914d\u3002\u5bf9\u4e8e\u67d0\u4e9b\u5c42,\u5f62\u72b6\u8ba1\u7b97\u6d89\u53ca\u590d\u6742\u7684\u7b49\u5f0f,\u4f8b\u5982\u5377\u79ef\u8fd0\u7b97\u3002\n\n\u4e00\u79cd\u89e3\u51b3\u65b9\u6cd5\u662f\u4f7f\u7528\u968f\u673a\u8f93\u5165\u8fdb\u884c\u524d\u5411\u4f20\u64ad,\u4f46\u8fd9\u5728\u5185\u5b58\u548c\u8ba1\u7b97\u65b9\u9762\u662f\u6d6a\u8d39\u7684\u3002\n\n\u76f8\u53cd,\u6211\u4eec\u53ef\u4ee5\u4f7f\u7528``meta``\u8bbe\u5907\u6765\u786e\u5b9a\u5c42\u7684\u8f93\u51fa\u5f62\u72b6,\u800c\u65e0\u9700\u5b9e\u9645\u5316\u4efb\u4f55\u6570\u636e\u3002\n" ] }, { @@ -26,14 +26,14 @@ }, "outputs": [], "source": [ - "import torch\nimport timeit\n\nt = torch.rand(2, 3, 10, 10, device=\"meta\")\nconv = torch.nn.Conv2d(3, 5, 2, device=\"meta\")\nstart = timeit.default_timer()\nout = conv(t)\nend = timeit.default_timer()\n\nprint(out)\nprint(f\"Time taken: {end-start}\")" + "import timeit\n\nimport torch\n\nt = torch.rand(2, 3, 10, 10, device=\"meta\")\nconv = torch.nn.Conv2d(3, 5, 2, device=\"meta\")\nstart = timeit.default_timer()\nout = conv(t)\nend = timeit.default_timer()\n\nprint(out)\nprint(f\"\u6240\u9700\u65f6\u95f4: {end-start}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Observe that since data is not materialized, passing arbitrarily large\ninputs will not significantly alter the time taken for shape computation.\n\n" + "\u89c2\u5bdf\u5230,\u7531\u4e8e\u6ca1\u6709\u5b9e\u9645\u5316\u6570\u636e,\u5373\u4f7f\u4f20\u5165\u4efb\u610f\u5927\u7684\u8f93\u5165,\u7528\u4e8e\u5f62\u72b6\u8ba1\u7b97\u7684\u65f6\u95f4\u4e5f\u4e0d\u4f1a\u663e\u8457\u6539\u53d8\u3002\n\n" ] }, { @@ -44,14 +44,14 @@ }, "outputs": [], "source": [ - "t_large = torch.rand(2**10, 3, 2**16, 2**16, device=\"meta\")\nstart = timeit.default_timer()\nout = conv(t_large)\nend = timeit.default_timer()\n\nprint(out)\nprint(f\"Time taken: {end-start}\")" + "t_large = torch.rand(2**10, 3, 2**16, 2**16, device=\"meta\")\nstart = timeit.default_timer()\nout = conv(t_large)\nend = timeit.default_timer()\n\nprint(out)\nprint(f\"\u6240\u9700\u65f6\u95f4: {end-start}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Consider an arbitrary network such as the following:\n\n" + "\u8003\u8651\u4ee5\u4e0b\u4efb\u610f\u7f51\u7edc:\n\n" ] }, { @@ -62,14 +62,14 @@ }, "outputs": [], "source": [ - "import torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass Net(nn.Module):\n def __init__(self):\n super().__init__()\n self.conv1 = nn.Conv2d(3, 6, 5)\n self.pool = nn.MaxPool2d(2, 2)\n self.conv2 = nn.Conv2d(6, 16, 5)\n self.fc1 = nn.Linear(16 * 5 * 5, 120)\n self.fc2 = nn.Linear(120, 84)\n self.fc3 = nn.Linear(84, 10)\n\n def forward(self, x):\n x = self.pool(F.relu(self.conv1(x)))\n x = self.pool(F.relu(self.conv2(x)))\n x = torch.flatten(x, 1) # flatten all dimensions except batch\n x = F.relu(self.fc1(x))\n x = F.relu(self.fc2(x))\n x = self.fc3(x)\n return x" + "import torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass Net(nn.Module):\n def __init__(self):\n super().__init__()\n self.conv1 = nn.Conv2d(3, 6, 5)\n self.pool = nn.MaxPool2d(2, 2)\n self.conv2 = nn.Conv2d(6, 16, 5)\n self.fc1 = nn.Linear(16 * 5 * 5, 120)\n self.fc2 = nn.Linear(120, 84)\n self.fc3 = nn.Linear(84, 10)\n\n def forward(self, x):\n x = self.pool(F.relu(self.conv1(x)))\n x = self.pool(F.relu(self.conv2(x)))\n x = torch.flatten(x, 1) # \u5c55\u5e73\u9664\u6279\u6b21\u7ef4\u5ea6\u5916\u7684\u6240\u6709\u7ef4\u5ea6\n x = F.relu(self.fc1(x))\n x = F.relu(self.fc2(x))\n x = self.fc3(x)\n return x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We can view the intermediate shapes within an entire network by registering a\nforward hook to each layer that prints the shape of the output.\n\n" + "\u6211\u4eec\u53ef\u4ee5\u901a\u8fc7\u4e3a\u6bcf\u4e00\u5c42\u6ce8\u518c\u4e00\u4e2a\u524d\u5411\u94a9\u5b50\u6765\u6253\u5370\u8f93\u51fa\u7684\u5f62\u72b6,\u4ece\u800c\u67e5\u770b\u6574\u4e2a\u7f51\u7edc\u4e2d\u95f4\u5c42\u7684\u5f62\u72b6\u3002\n\n" ] }, { @@ -80,7 +80,7 @@ }, "outputs": [], "source": [ - "def fw_hook(module, input, output):\n print(f\"Shape of output to {module} is {output.shape}.\")\n\n\n# Any tensor created within this torch.device context manager will be\n# on the meta device.\nwith torch.device(\"meta\"):\n net = Net()\n inp = torch.randn((1024, 3, 32, 32))\n\nfor name, layer in net.named_modules():\n layer.register_forward_hook(fw_hook)\n\nout = net(inp)" + "def fw_hook(module, input, output):\n print(f\"{module}\u7684\u8f93\u51fa\u5f62\u72b6\u4e3a{output.shape}\u3002\")\n\n\n# \u5728\u6b64torch.device\u4e0a\u4e0b\u6587\u7ba1\u7406\u5668\u4e2d\u521b\u5efa\u7684\u4efb\u4f55\u5f20\u91cf\u90fd\u5c06\u5728meta\u8bbe\u5907\u4e0a\u3002\nwith torch.device(\"meta\"):\n net = Net()\n inp = torch.randn((1024, 3, 32, 32))\n\nfor name, layer in net.named_modules():\n layer.register_forward_hook(fw_hook)\n\nout = net(inp)" ] } ], diff --git a/docs/_downloads/41526f38c5c72d94f024660d73cef185/torch_logs.ipynb b/docs/_downloads/41526f38c5c72d94f024660d73cef185/torch_logs.ipynb index b0cf651..bc9dd66 100644 --- a/docs/_downloads/41526f38c5c72d94f024660d73cef185/torch_logs.ipynb +++ b/docs/_downloads/41526f38c5c72d94f024660d73cef185/torch_logs.ipynb @@ -15,7 +15,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n# (beta) Using TORCH_LOGS python API with torch.compile\n**Author:** [Michael Lazos](https://github.com/mlazos)\n" + "\n# (Beta) \u4f7f\u7528 TORCH_LOGS python API \u4e0e torch.compile\n**\u4f5c\u8005:** [Michael Lazos](https://github.com/mlazos)\n" ] }, { @@ -33,14 +33,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This tutorial introduces the ``TORCH_LOGS`` environment variable, as well as the Python API, and\ndemonstrates how to apply it to observe the phases of ``torch.compile``.\n\n

Note

This tutorial requires PyTorch 2.2.0 or later.

\n\n\n\n" + "\u672c\u6559\u7a0b\u4ecb\u7ecd\u4e86 ``TORCH_LOGS`` \u73af\u5883\u53d8\u91cf\u4ee5\u53ca Python API,\u5e76\u6f14\u793a\u4e86\u5982\u4f55\u5c06\u5176\u5e94\u7528\u4e8e\u89c2\u5bdf ``torch.compile`` \u7684\u5404\u4e2a\u9636\u6bb5\u3002\n\n

Note

\u672c\u6559\u7a0b\u9700\u8981 PyTorch 2.2.0 \u6216\u66f4\u9ad8\u7248\u672c\u3002

\n\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Setup\nIn this example, we'll set up a simple Python function which performs an elementwise\nadd and observe the compilation process with ``TORCH_LOGS`` Python API.\n\n

Note

There is also an environment variable ``TORCH_LOGS``, which can be used to\n change logging settings at the command line. The equivalent environment\n variable setting is shown for each example.

\n\n" + "## \u8bbe\u7f6e\n\u5728\u8fd9\u4e2a\u4f8b\u5b50\u4e2d,\u6211\u4eec\u5c06\u8bbe\u7f6e\u4e00\u4e2a\u7b80\u5355\u7684 Python \u51fd\u6570,\u6267\u884c\u5143\u7d20\u7ea7\u52a0\u6cd5,\u5e76\u4f7f\u7528 ``TORCH_LOGS`` Python API \u89c2\u5bdf\u7f16\u8bd1\u8fc7\u7a0b\u3002\n\n

Note

\u8fd8\u6709\u4e00\u4e2a\u73af\u5883\u53d8\u91cf ``TORCH_LOGS``,\u53ef\u7528\u4e8e\u5728\u547d\u4ee4\u884c\u4e2d\u66f4\u6539\u65e5\u5fd7\u8bbe\u7f6e\u3002\u6bcf\u4e2a\u793a\u4f8b\u90fd\u663e\u793a\u4e86\u7b49\u6548\u7684\u73af\u5883\u53d8\u91cf\u8bbe\u7f6e\u3002

\n\n" ] }, { @@ -51,14 +51,14 @@ }, "outputs": [], "source": [ - "import torch\n\n# exit cleanly if we are on a device that doesn't support torch.compile\nif torch.cuda.get_device_capability() < (7, 0):\n print(\"Skipping because torch.compile is not supported on this device.\")\nelse:\n @torch.compile()\n def fn(x, y):\n z = x + y\n return z + 2\n\n\n inputs = (torch.ones(2, 2, device=\"cuda\"), torch.zeros(2, 2, device=\"cuda\"))\n\n\n# print separator and reset dynamo\n# between each example\n def separator(name):\n print(f\"==================={name}=========================\")\n torch._dynamo.reset()\n\n\n separator(\"Dynamo Tracing\")\n# View dynamo tracing\n# TORCH_LOGS=\"+dynamo\"\n torch._logging.set_logs(dynamo=logging.DEBUG)\n fn(*inputs)\n\n separator(\"Traced Graph\")\n# View traced graph\n# TORCH_LOGS=\"graph\"\n torch._logging.set_logs(graph=True)\n fn(*inputs)\n\n separator(\"Fusion Decisions\")\n# View fusion decisions\n# TORCH_LOGS=\"fusion\"\n torch._logging.set_logs(fusion=True)\n fn(*inputs)\n\n separator(\"Output Code\")\n# View output code generated by inductor\n# TORCH_LOGS=\"output_code\"\n torch._logging.set_logs(output_code=True)\n fn(*inputs)\n\n separator(\"\")" + "import torch\n\n# \u5982\u679c\u8bbe\u5907\u4e0d\u652f\u6301 torch.compile,\u5219\u5e72\u51c0\u5730\u9000\u51fa\nif torch.cuda.get_device_capability() < (7, 0):\n print(\"\u8df3\u8fc7,\u56e0\u4e3a\u6b64\u8bbe\u5907\u4e0d\u652f\u6301 torch.compile\u3002\")\nelse:\n\n @torch.compile()\n def fn(x, y):\n z = x + y\n return z + 2\n\n inputs = (torch.ones(2, 2, device=\"cuda\"), torch.zeros(2, 2, device=\"cuda\"))\n\n # \u5728\u6bcf\u4e2a\u793a\u4f8b\u4e4b\u95f4\u6253\u5370\u5206\u9694\u7b26\u5e76\u91cd\u7f6e dynamo\n def separator(name):\n print(f\"==================={name}=========================\")\n torch._dynamo.reset()\n\n separator(\"Dynamo \u8ddf\u8e2a\")\n # \u67e5\u770b dynamo \u8ddf\u8e2a\n # TORCH_LOGS=\"+dynamo\"\n torch._logging.set_logs(dynamo=logging.DEBUG)\n fn(*inputs)\n\n separator(\"\u8ddf\u8e2a\u7684\u56fe\u5f62\")\n # \u67e5\u770b\u8ddf\u8e2a\u7684\u56fe\u5f62\n # TORCH_LOGS=\"graph\"\n torch._logging.set_logs(graph=True)\n fn(*inputs)\n\n separator(\"\u878d\u5408\u51b3\u7b56\")\n # \u67e5\u770b\u878d\u5408\u51b3\u7b56\n # TORCH_LOGS=\"fusion\"\n torch._logging.set_logs(fusion=True)\n fn(*inputs)\n\n separator(\"\u8f93\u51fa\u4ee3\u7801\")\n # \u67e5\u770b inductor \u751f\u6210\u7684\u8f93\u51fa\u4ee3\u7801\n # TORCH_LOGS=\"output_code\"\n torch._logging.set_logs(output_code=True)\n fn(*inputs)\n\n separator(\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Conclusion\n\nIn this tutorial we introduced the TORCH_LOGS environment variable and python API\nby experimenting with a small number of the available logging options.\nTo view descriptions of all available options, run any python script\nwhich imports torch and set TORCH_LOGS to \"help\".\n\nAlternatively, you can view the `torch._logging documentation`_ to see\ndescriptions of all available logging options.\n\nFor more information on torch.compile, see the `torch.compile tutorial`_.\n\n\n" + "## \u7ed3\u8bba\n\n\u5728\u672c\u6559\u7a0b\u4e2d,\u6211\u4eec\u4ecb\u7ecd\u4e86 TORCH_LOGS \u73af\u5883\u53d8\u91cf\u548c python API,\u5e76\u901a\u8fc7\u5b9e\u9a8c\u4e86\u4e00\u5c0f\u90e8\u5206\u53ef\u7528\u7684\u65e5\u5fd7\u9009\u9879\u3002\n\u8981\u67e5\u770b\u6240\u6709\u53ef\u7528\u9009\u9879\u7684\u63cf\u8ff0,\u8bf7\u8fd0\u884c\u4efb\u4f55\u5bfc\u5165 torch \u7684 python \u811a\u672c,\u5e76\u5c06 TORCH_LOGS \u8bbe\u7f6e\u4e3a \"help\"\u3002\n\n\u6216\u8005,\u60a8\u53ef\u4ee5\u67e5\u770b `torch._logging \u6587\u6863`_ \u4ee5\u67e5\u770b\u6240\u6709\u53ef\u7528\u65e5\u5fd7\u9009\u9879\u7684\u63cf\u8ff0\u3002\n\n\u6709\u5173 torch.compile \u7684\u66f4\u591a\u4fe1\u606f,\u8bf7\u53c2\u9605 `torch.compile \u6559\u7a0b`_\u3002\n\n\n" ] } ], diff --git a/docs/_downloads/46064f5dec95799fe5460a89db85ffdd/dynamic_quantization.py b/docs/_downloads/46064f5dec95799fe5460a89db85ffdd/dynamic_quantization.py index eb9605d..2777025 100644 --- a/docs/_downloads/46064f5dec95799fe5460a89db85ffdd/dynamic_quantization.py +++ b/docs/_downloads/46064f5dec95799fe5460a89db85ffdd/dynamic_quantization.py @@ -1,243 +1,179 @@ """ -Dynamic Quantization +动态量化 ==================== -In this recipe you will see how to take advantage of Dynamic -Quantization to accelerate inference on an LSTM-style recurrent neural -network. This reduces the size of the model weights and speeds up model -execution. +在这个示例中,您将看到如何利用动态量化来加速 LSTM 风格的循环神经网络的推理。这可以减小模型权重的大小,并加快模型执行速度。 -Introduction +介绍 ------------- -There are a number of trade-offs that can be made when designing neural -networks. During model development and training you can alter the -number of layers and number of parameters in a recurrent neural network -and trade-off accuracy against model size and/or model latency or -throughput. Such changes can take lot of time and compute resources -because you are iterating over the model training. Quantization gives -you a way to make a similar trade off between performance and model -accuracy with a known model after training is completed. +在设计神经网络时,可以做出多种权衡。在模型开发和训练期间,您可以改变循环神经网络中的层数和参数数量,在模型大小和/或模型延迟或吞吐量与精度之间进行权衡。由于您需要重复模型训练过程,因此这种改变需要大量的时间和计算资源。量化为您提供了一种在已知模型上在性能和模型精度之间进行权衡的方式,而无需重新训练模型。 -You can give it a try in a single session and you will certainly reduce -your model size significantly and may get a significant latency -reduction without losing a lot of accuracy. +您可以在单个会话中尝试一下,您肯定会显著减小模型大小,并可能在不会损失太多精度的情况下获得显著的延迟减少。 -What is dynamic quantization? +什么是动态量化? ----------------------------- -Quantizing a network means converting it to use a reduced precision -integer representation for the weights and/or activations. This saves on -model size and allows the use of higher throughput math operations on -your CPU or GPU. +量化网络意味着将其转换为使用较低精度的整数表示形式来表示权重和/或激活。这可以减小模型大小,并允许在 CPU 或 GPU 上使用更高吞吐量的数学运算。 -When converting from floating point to integer values you are -essentially multiplying the floating point value by some scale factor -and rounding the result to a whole number. The various quantization -approaches differ in the way they approach determining that scale -factor. +从浮点数转换为整数值时,您实际上是将浮点数乘以某个比例因子,然后将结果舍入为整数。不同的量化方法在确定该比例因子的方式上有所不同。 -The key idea with dynamic quantization as described here is that we are -going to determine the scale factor for activations dynamically based on -the data range observed at runtime. This ensures that the scale factor -is "tuned" so that as much signal as possible about each observed -dataset is preserved. +这里介绍的动态量化的关键思想是,我们将根据运行时观察到的数据范围动态确定激活的比例因子。这可确保比例因子被"调整"为尽可能保留每个观察到的数据集的信号。 -The model parameters on the other hand are known during model conversion -and they are converted ahead of time and stored in INT8 form. +另一方面,模型参数在模型转换期间是已知的,它们会提前转换并以 INT8 形式存储。 -Arithmetic in the quantized model is done using vectorized INT8 -instructions. Accumulation is typically done with INT16 or INT32 to -avoid overflow. This higher precision value is scaled back to INT8 if -the next layer is quantized or converted to FP32 for output. +量化模型中的算术运算使用矢量化的 INT8 指令完成。累加通常使用 INT16 或 INT32 来避免溢出。如果下一层是量化的,则将此较高精度值缩放回 INT8;如果是输出,则将其转换为 FP32。 -Dynamic quantization is relatively free of tuning parameters which makes -it well suited to be added into production pipelines as a standard part -of converting LSTM models to deployment. +动态量化相对来说没有太多需要调整的参数,因此非常适合作为将 LSTM 模型转换为部署的标准部分添加到生产管道中。 +.. note:: + 本示例中采用的方法的局限性 + 本示例提供了对 PyTorch 中动态量化功能的快速介绍,以及使用它的工作流程。我们的重点是解释用于转换模型的特定函数。为了简洁和清晰,我们做出了一些重大简化,包括: -.. note:: - Limitations on the approach taken here - - - This recipe provides a quick introduction to the dynamic quantization - features in PyTorch and the workflow for using it. Our focus is on - explaining the specific functions used to convert the model. We will - make a number of significant simplifications in the interest of brevity - and clarity - - -1. You will start with a minimal LSTM network -2. You are simply going to initialize the network with a random hidden - state -3. You are going to test the network with random inputs -4. You are not going to train the network in this tutorial -5. You will see that the quantized form of this network is smaller and - runs faster than the floating point network we started with -6. You will see that the output values are generally in the same - ballpark as the output of the FP32 network, but we are not - demonstrating here the expected accuracy loss on a real trained - network - -You will see how dynamic quantization is done and be able to see -suggestive reductions in memory use and latency times. Providing a -demonstration that the technique can preserve high levels of model -accuracy on a trained LSTM is left to a more advanced tutorial. If you -want to move right away to that more rigorous treatment please proceed -to the `advanced dynamic quantization -tutorial `__. - -Steps -------------- +1. 您将从一个最小的 LSTM 网络开始 +2. 您只需用随机隐藏状态初始化网络 +3. 您将使用随机输入来测试网络 +4. 您不会在本教程中训练网络 +5. 您将看到,与我们开始时的浮点网络相比,量化后的网络更小且运行速度更快 +6. 您将看到,量化网络产生的输出张量值与 FP32 网络输出的值在同一数量级,但我们并未在这里展示该技术在经过训练的 LSTM 上能够保留较高模型精度的情况 -This recipe has 5 steps. +您将了解如何进行动态量化,并能够看到内存使用和延迟时间的潜在减小。关于该技术在经过训练的 LSTM 上能够保留较高模型精度的演示,将留待更高级的教程。如果您想直接进入更严格的处理,请继续学习 `高级动态量化教程 `__。 -1. Set Up - Here you define a very simple LSTM, import modules, and establish - some random input tensors. +步骤 +------------- -2. Do the Quantization - Here you instantiate a floating point model and then create quantized - version of it. +本示例包含 5 个步骤。 -3. Look at Model Size - Here you show that the model size gets smaller. +1. 设置 - 在这里,您定义一个非常简单的 LSTM,导入模块,并建立一些随机输入张量。 -4. Look at Latency - Here you run the two models and compare model runtime (latency). +2. 执行量化 - 在这里,您实例化一个浮点模型,然后创建其量化版本。 -5. Look at Accuracy - Here you run the two models and compare outputs. +3. 查看模型大小 - 在这里,您显示模型大小变小了。 +4. 查看延迟 - 在这里,您运行两个模型并比较模型运行时间(延迟)。 -1: Set Up -~~~~~~~~~~~~~~~ -This is a straightforward bit of code to set up for the rest of the -recipe. +5. 查看精度 - 在这里,您运行两个模型并比较输出。 -The unique module we are importing here is torch.quantization which -includes PyTorch's quantized operators and conversion functions. We also -define a very simple LSTM model and set up some inputs. +1: 设置 +~~~~~~~~~~~~~~~ +这是一段直接的代码,用于为本示例的其余部分做准备。 +我们在这里导入的唯一模块是 torch.quantization,它包含了 PyTorch 的量化算子和转换函数。我们还定义了一个非常简单的 LSTM 模型,并设置了一些输入。 """ -# import the modules used here in this recipe -import torch -import torch.quantization -import torch.nn as nn +# 导入本示例中使用的模块 import copy import os import time -# define a very, very simple LSTM for demonstration purposes -# in this case, we are wrapping ``nn.LSTM``, one layer, no preprocessing or postprocessing -# inspired by -# `Sequence Models and Long Short-Term Memory Networks tutorial `__. +import torch +import torch.nn as nn +import torch.quantization + + +# 为演示目的定义一个非常简单的 LSTM +# 在这种情况下,我们只是包装了 ``nn.LSTM``、一层,没有预处理或后处理 +# 受到以下教程的启发: +# `序列模型和长短期记忆网络教程 `_, 作者 Robert Guthrie +# 和 `动态量化教程 `__。 class lstm_for_demonstration(nn.Module): - """Elementary Long Short Term Memory style model which simply wraps ``nn.LSTM`` - Not to be used for anything other than demonstration. - """ - def __init__(self,in_dim,out_dim,depth): - super(lstm_for_demonstration,self).__init__() - self.lstm = nn.LSTM(in_dim,out_dim,depth) + """基本的长短期记忆风格模型,只是包装了 ``nn.LSTM`` + 不应用于除演示之外的任何其他用途。 + """ + + def __init__(self, in_dim, out_dim, depth): + super(lstm_for_demonstration, self).__init__() + self.lstm = nn.LSTM(in_dim, out_dim, depth) - def forward(self,inputs,hidden): - out,hidden = self.lstm(inputs,hidden) - return out, hidden + def forward(self, inputs, hidden): + out, hidden = self.lstm(inputs, hidden) + return out, hidden -torch.manual_seed(29592) # set the seed for reproducibility +torch.manual_seed(29592) # 设置种子以获得可重复结果 -#shape parameters -model_dimension=8 -sequence_length=20 -batch_size=1 -lstm_depth=1 +# 形状参数 +model_dimension = 8 +sequence_length = 20 +batch_size = 1 +lstm_depth = 1 -# random data for input -inputs = torch.randn(sequence_length,batch_size,model_dimension) -# hidden is actually is a tuple of the initial hidden state and the initial cell state -hidden = (torch.randn(lstm_depth,batch_size,model_dimension), torch.randn(lstm_depth,batch_size,model_dimension)) +# 随机输入数据 +inputs = torch.randn(sequence_length, batch_size, model_dimension) +# hidden 实际上是初始隐藏状态和初始细胞状态的元组 +hidden = ( + torch.randn(lstm_depth, batch_size, model_dimension), + torch.randn(lstm_depth, batch_size, model_dimension), +) ###################################################################### -# 2: Do the Quantization +# 2: 执行量化 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # -# Now we get to the fun part. First we create an instance of the model -# called ``float\_lstm`` then we are going to quantize it. We're going to use -# the `torch.quantization.quantize_dynamic `__ function, which takes the model, then a list of the submodules -# which we want to -# have quantized if they appear, then the datatype we are targeting. This -# function returns a quantized version of the original model as a new -# module. +# 现在我们来执行有趣的部分。首先,我们创建一个名为 ``float_lstm`` 的模型实例,然后我们将对其进行量化。我们将使用 `torch.quantization.quantize_dynamic `__ 函数,它接受模型、我们希望量化的子模块列表(如果存在)以及目标数据类型。此函数返回原始模型的量化版本,作为一个新模块。 # -# That's all it takes. +# 就这么简单。 # - # here is our floating point instance -float_lstm = lstm_for_demonstration(model_dimension, model_dimension,lstm_depth) +# 这是我们的浮点实例 +float_lstm = lstm_for_demonstration(model_dimension, model_dimension, lstm_depth) -# this is the call that does the work +# 这是执行量化的调用 quantized_lstm = torch.quantization.quantize_dynamic( float_lstm, {nn.LSTM, nn.Linear}, dtype=torch.qint8 ) -# show the changes that were made -print('Here is the floating point version of this module:') +# 显示所做的更改 +print("这是该模块的浮点版本:") print(float_lstm) -print('') -print('and now the quantized version:') +print("") +print("现在是量化版本:") print(quantized_lstm) ###################################################################### -# 3. Look at Model Size +# 3. 查看模型大小 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -# We've quantized the model. What does that get us? Well the first -# benefit is that we've replaced the FP32 model parameters with INT8 -# values (and some recorded scale factors). This means about 75% less data -# to store and move around. With the default values the reduction shown -# below will be less than 75% but if you increase the model size above -# (for example you can set model dimension to something like 80) this will -# converge towards 4x smaller as the stored model size dominated more and -# more by the parameter values. +# 我们已经量化了模型。这给我们带来了什么好处?好处之一是我们用 INT8 值(和一些记录的比例因子)替换了 FP32 模型参数。这意味着存储和移动数据的大小减小了约 75%。使用默认值时,下面显示的减小量将小于 75%,但如果您将模型大小增加到更大值(例如将 model_dimension 设置为 80),随着存储的模型大小越来越多地由参数值主导,减小量将趋近于 4 倍。 # + def print_size_of_model(model, label=""): torch.save(model.state_dict(), "temp.p") - size=os.path.getsize("temp.p") - print("model: ",label,' \t','Size (KB):', size/1e3) - os.remove('temp.p') + size = os.path.getsize("temp.p") + print("模型: ", label, " \t", "大小 (KB):", size / 1e3) + os.remove("temp.p") return size -# compare the sizes -f=print_size_of_model(float_lstm,"fp32") -q=print_size_of_model(quantized_lstm,"int8") -print("{0:.2f} times smaller".format(f/q)) + +# 比较大小 +f = print_size_of_model(float_lstm, "fp32") +q = print_size_of_model(quantized_lstm, "int8") +print("{0:.2f} 倍更小".format(f / q)) ###################################################################### -# 4. Look at Latency +# 4. 查看延迟 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -# The second benefit is that the quantized model will typically run -# faster. This is due to a combinations of effects including at least: +# 第二个好处是量化模型通常会运行得更快。这是由于多种效果的组合,至少包括: # -# 1. Less time spent moving parameter data in -# 2. Faster INT8 operations +# 1. 减少了移动参数数据所花费的时间 +# 2. INT8 操作更快 # -# As you will see the quantized version of this super-simple network runs -# faster. This will generally be true of more complex networks but as they -# say "your mileage may vary" depending on a number of factors including -# the structure of the model and the hardware you are running on. +# 如您所见,这个超级简单的网络的量化版本运行速度更快。对于更复杂的网络通常也是如此,但正如他们所说,"您的里程可能会有所不同",这取决于许多因素,包括模型的结构和您运行的硬件。 # -# compare the performance -print("Floating point FP32") +# 比较性能 +print("浮点 FP32") ##################################################################### # .. code-block:: python # # %timeit float_lstm.forward(inputs, hidden) -print("Quantized INT8") +print("量化 INT8") ###################################################################### # .. code-block:: python @@ -246,49 +182,45 @@ def print_size_of_model(model, label=""): ###################################################################### -# 5: Look at Accuracy +# 5: 查看精度 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -# We are not going to do a careful look at accuracy here because we are -# working with a randomly initialized network rather than a properly -# trained one. However, I think it is worth quickly showing that the -# quantized network does produce output tensors that are "in the same -# ballpark" as the original one. +# 我们不会在这里仔细查看精度,因为我们使用的是随机初始化的网络,而不是经过正确训练的网络。但是,我认为值得快速展示一下量化网络确实产生了与原始网络"同一数量级"的输出张量值。 # -# For a more detailed analysis please see the more advanced tutorials -# referenced at the end of this recipe. +# 有关更详细的分析,请参阅本示例结尾处引用的更高级教程。 # -# run the float model +# 运行浮点模型 out1, hidden1 = float_lstm(inputs, hidden) mag1 = torch.mean(abs(out1)).item() -print('mean absolute value of output tensor values in the FP32 model is {0:.5f} '.format(mag1)) +print("FP32 模型中输出张量值的绝对值均值为 {0:.5f} ".format(mag1)) -# run the quantized model +# 运行量化模型 out2, hidden2 = quantized_lstm(inputs, hidden) mag2 = torch.mean(abs(out2)).item() -print('mean absolute value of output tensor values in the INT8 model is {0:.5f}'.format(mag2)) - -# compare them -mag3 = torch.mean(abs(out1-out2)).item() -print('mean absolute value of the difference between the output tensors is {0:.5f} or {1:.2f} percent'.format(mag3,mag3/mag1*100)) +print("INT8 模型中输出张量值的绝对值均值为 {0:.5f}".format(mag2)) + +# 比较它们 +mag3 = torch.mean(abs(out1 - out2)).item() +print( + "输出张量之间差值的绝对值均值为 {0:.5f},或占 {1:.2f} 百分比".format( + mag3, mag3 / mag1 * 100 + ) +) ###################################################################### -# Learn More +# 了解更多 # ------------ -# We've explained what dynamic quantization is, what benefits it brings, -# and you have used the ``torch.quantization.quantize_dynamic()`` function -# to quickly quantize a simple LSTM model. +# 我们已经解释了什么是动态量化,它带来了什么好处,您已经使用 ``torch.quantization.quantize_dynamic()`` 函数快速量化了一个简单的 LSTM 模型。 # -# This was a fast and high level treatment of this material; for more -# detail please continue learning with `(beta) Dynamic Quantization on an LSTM Word Language Model Tutorial `_. +# 这是对该材料的快速和高级处理;要了解更多详细信息,请继续学习 `(beta) 动态量化 LSTM 词语言模型教程 `_。 # # -# Additional Resources +# 其他资源 # -------------------- # -# * `Quantization API Documentaion `_ -# * `(beta) Dynamic Quantization on BERT `_ -# * `(beta) Dynamic Quantization on an LSTM Word Language Model `_ -# * `Introduction to Quantization on PyTorch `_ +# * `量化 API 文档 `_ +# * `(beta) 动态量化 BERT `_ +# * `(beta) 动态量化 LSTM 词语言模型 `_ +# * `PyTorch 量化介绍 `_ # diff --git a/docs/_downloads/54db51700fabe094cbf7f11f5195d2bd/benchmark.ipynb b/docs/_downloads/54db51700fabe094cbf7f11f5195d2bd/benchmark.ipynb index c40ea1a..d81085b 100644 --- a/docs/_downloads/54db51700fabe094cbf7f11f5195d2bd/benchmark.ipynb +++ b/docs/_downloads/54db51700fabe094cbf7f11f5195d2bd/benchmark.ipynb @@ -15,14 +15,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n# PyTorch Benchmark\nThis recipe provides a quick-start guide to using PyTorch\n``benchmark`` module to measure and compare code performance.\n\n## Introduction\nBenchmarking is an important step in writing code. It helps\nus validate that our code meets performance expectations,\ncompare different approaches to solving the same problem and\nprevent performance regressions.\n\nThere are many options when it comes to benchmarking PyTorch code\nincluding the Python builtin ``timeit`` module. However, benchmarking\nPyTorch code has many caveats that can be easily overlooked such as\nmanaging the number of threads and synchronizing CUDA devices. Moreover,\ngenerating Tensor inputs for benchmarking can be quite tedious.\n\nThis recipe demonstrates how to use PyTorch ``benchmark`` module to avoid\ncommon mistakes while making it easier to compare performance of\ndifferent code, generate input for benchmarking and more.\n\n## Setup\nBefore we begin, install ``torch`` if it isn\u2019t already available.\n\n::\n\n pip install torch\n" + "\n# PyTorch Benchmark\n\u672c\u6559\u7a0b\u63d0\u4f9b\u4e86\u4f7f\u7528 PyTorch ``benchmark`` \u6a21\u5757\u6765\u6d4b\u91cf\u548c\u6bd4\u8f83\u4ee3\u7801\u6027\u80fd\u7684\u5feb\u901f\u5165\u95e8\u6307\u5357\u3002\n\n## \u4ecb\u7ecd\n\u57fa\u51c6\u6d4b\u8bd5\u662f\u7f16\u5199\u4ee3\u7801\u65f6\u7684\u4e00\u4e2a\u91cd\u8981\u6b65\u9aa4\u3002\u5b83\u5e2e\u52a9\u6211\u4eec\u9a8c\u8bc1\u4ee3\u7801\u662f\u5426\u6ee1\u8db3\u6027\u80fd\u9884\u671f,\u6bd4\u8f83\u89e3\u51b3\u540c\u4e00\u95ee\u9898\u7684\u4e0d\u540c\u65b9\u6cd5,\u5e76\u9632\u6b62\u6027\u80fd\u88c2\u5316\u3002\n\n\u5bf9\u4e8e\u57fa\u51c6\u6d4b\u8bd5 PyTorch \u4ee3\u7801\u6709\u8bb8\u591a\u9009\u62e9,\u5305\u62ec Python \u5185\u7f6e\u7684 ``timeit`` \u6a21\u5757\u3002\n\u7136\u800c,\u57fa\u51c6\u6d4b\u8bd5 PyTorch \u4ee3\u7801\u6709\u8bb8\u591a\u5bb9\u6613\u88ab\u5ffd\u89c6\u7684\u6ce8\u610f\u4e8b\u9879,\u4f8b\u5982\u7ba1\u7406\u7ebf\u7a0b\u6570\u91cf\u548c\u540c\u6b65 CUDA \u8bbe\u5907\u3002\n\u6b64\u5916,\u4e3a\u57fa\u51c6\u6d4b\u8bd5\u751f\u6210\u5f20\u91cf\u8f93\u5165\u53ef\u80fd\u76f8\u5f53\u7e41\u7410\u3002\n\n\u672c\u6559\u7a0b\u6f14\u793a\u4e86\u5982\u4f55\u4f7f\u7528 PyTorch ``benchmark`` \u6a21\u5757\u6765\u907f\u514d\u5e38\u89c1\u9519\u8bef,\u540c\u65f6\u66f4\u5bb9\u6613\u6bd4\u8f83\u4e0d\u540c\u4ee3\u7801\u7684\u6027\u80fd\u3001\u4e3a\u57fa\u51c6\u6d4b\u8bd5\u751f\u6210\u8f93\u5165\u7b49\u3002\n\n## \u8bbe\u7f6e\n\u5728\u5f00\u59cb\u4e4b\u524d,\u5982\u679c\u5c1a\u672a\u5b89\u88c5 ``torch``,\u8bf7\u5148\u5b89\u88c5\u3002\n\n::\n\n pip install torch\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Steps\n\n1. Defining functions to benchmark\n2. Benchmarking with ``timeit.Timer``\n3. Benchmarking with ``torch.utils.benchmark.Timer``\n4. Benchmarking with ``Blocked Autorange``\n5. Comparing benchmark results\n6. Saving/Loading benchmark results\n7. Generating inputs with ``Fuzzed Parameters``\n8. Collecting instruction counts with ``Callgrind``\n\n### 1. Defining functions to benchmark\n\nAs of the time of this writing, [torch.dot](https://pytorch.org/docs/stable/generated/torch.dot.html?highlight=dot#torch.dot)_\ndoes not support batched mode, so we will compare two approaches to\nimplementing it using existing ``torch`` operators: one approach uses a\ncombination of ``mul`` and ``sum`` while the other reduces the problem to ``bmm``.\n\n\n" + "## \u5177\u4f53\u6b65\u9aa4\n\n1. \u5b9a\u4e49\u8981\u57fa\u51c6\u6d4b\u8bd5\u7684\u51fd\u6570\n2. \u4f7f\u7528 ``timeit.Timer`` \u8fdb\u884c\u57fa\u51c6\u6d4b\u8bd5\n3. \u4f7f\u7528 ``torch.utils.benchmark.Timer`` \u8fdb\u884c\u57fa\u51c6\u6d4b\u8bd5\n4. \u4f7f\u7528 ``Blocked Autorange`` \u8fdb\u884c\u57fa\u51c6\u6d4b\u8bd5\n5. \u6bd4\u8f83\u57fa\u51c6\u6d4b\u8bd5\u7ed3\u679c\n6. \u4fdd\u5b58/\u52a0\u8f7d\u57fa\u51c6\u6d4b\u8bd5\u7ed3\u679c\n7. \u4f7f\u7528 ``Fuzzed Parameters`` \u751f\u6210\u8f93\u5165\n8. \u4f7f\u7528 ``Callgrind`` \u6536\u96c6\u6307\u4ee4\u8ba1\u6570\n\n### 1. \u5b9a\u4e49\u8981\u57fa\u51c6\u6d4b\u8bd5\u7684\u51fd\u6570\n\n\u5728\u64b0\u5199\u672c\u6587\u65f6, [torch.dot](https://pytorch.org/docs/stable/generated/torch.dot.html?highlight=dot#torch.dot)_\n\u4e0d\u652f\u6301\u6279\u91cf\u6a21\u5f0f,\u56e0\u6b64\u6211\u4eec\u5c06\u6bd4\u8f83\u4f7f\u7528\u73b0\u6709 ``torch`` \u8fd0\u7b97\u7b26\u5b9e\u73b0\u5b83\u7684\u4e24\u79cd\u65b9\u6cd5:\u4e00\u79cd\u65b9\u6cd5\u4f7f\u7528 ``mul`` \u548c ``sum`` \u7684\u7ec4\u5408,\u53e6\u4e00\u79cd\u65b9\u6cd5\u4f7f\u7528 ``bmm``\u3002\n\n\n" ] }, { @@ -33,14 +33,14 @@ }, "outputs": [], "source": [ - "import torch\n\n\ndef batched_dot_mul_sum(a, b):\n '''Computes batched dot by multiplying and summing'''\n return a.mul(b).sum(-1)\n\n\ndef batched_dot_bmm(a, b):\n '''Computes batched dot by reducing to ``bmm``'''\n a = a.reshape(-1, 1, a.shape[-1])\n b = b.reshape(-1, b.shape[-1], 1)\n return torch.bmm(a, b).flatten(-3)\n\n\n# Input for benchmarking\nx = torch.randn(10000, 64)\n\n# Ensure that both functions compute the same output\nassert batched_dot_mul_sum(x, x).allclose(batched_dot_bmm(x, x))" + "import torch\n\n\ndef batched_dot_mul_sum(a, b):\n \"\"\"Computes batched dot by multiplying and summing\"\"\"\n return a.mul(b).sum(-1)\n\n\ndef batched_dot_bmm(a, b):\n \"\"\"Computes batched dot by reducing to ``bmm``\"\"\"\n a = a.reshape(-1, 1, a.shape[-1])\n b = b.reshape(-1, b.shape[-1], 1)\n return torch.bmm(a, b).flatten(-3)\n\n\n# Input for benchmarking\nx = torch.randn(10000, 64)\n\n# Ensure that both functions compute the same output\nassert batched_dot_mul_sum(x, x).allclose(batched_dot_bmm(x, x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### 2. Benchmarking with ``timeit.Timer``\n\nFirst, let's benchmark the code using Python's builtin ``timeit`` module.\nWe keep the benchmark code simple here so we can compare the defaults\nof ``timeit`` and ``torch.utils.benchmark``.\n\n\n" + "### 2. \u4f7f\u7528 ``timeit.Timer`` \u8fdb\u884c\u57fa\u51c6\u6d4b\u8bd5\n\u9996\u5148,\u8ba9\u6211\u4eec\u4f7f\u7528 Python \u5185\u7f6e\u7684 ``timeit`` \u6a21\u5757\u5bf9\u4ee3\u7801\u8fdb\u884c\u57fa\u51c6\u6d4b\u8bd5\u3002\n\u6211\u4eec\u5728\u8fd9\u91cc\u4fdd\u6301\u57fa\u51c6\u6d4b\u8bd5\u4ee3\u7801\u7b80\u5355,\u4ee5\u4fbf\u6211\u4eec\u53ef\u4ee5\u6bd4\u8f83 ``timeit`` \u548c ``torch.utils.benchmark`` \u7684\u9ed8\u8ba4\u8bbe\u7f6e\u3002\n\n\n" ] }, { @@ -51,7 +51,7 @@ }, "outputs": [], "source": [ - "import timeit\n\nt0 = timeit.Timer(\n stmt='batched_dot_mul_sum(x, x)', \n setup='from __main__ import batched_dot_mul_sum',\n globals={'x': x})\n\nt1 = timeit.Timer(\n stmt='batched_dot_bmm(x, x)',\n setup='from __main__ import batched_dot_bmm',\n globals={'x': x})\n\nprint(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us')\nprint(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us')" + "import timeit\n\nt0 = timeit.Timer(\n stmt=\"batched_dot_mul_sum(x, x)\",\n setup=\"from __main__ import batched_dot_mul_sum\",\n globals={\"x\": x},\n)\n\nt1 = timeit.Timer(\n stmt=\"batched_dot_bmm(x, x)\",\n setup=\"from __main__ import batched_dot_bmm\",\n globals={\"x\": x},\n)\n\nprint(f\"mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us\")\nprint(f\"bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us\")" ] }, { @@ -65,7 +65,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 3. Benchmarking with ``torch.utils.benchmark.Timer``\n\nPyTorch ``benchmark`` module was designed to be familiar to those who\nhave used the ``timeit`` module before. However, its defaults make it\neasier and safer to use for benchmarking PyTorch code. Let's first\ncompare the same basic API as above.\n\n\n" + "### 3. \u4f7f\u7528 ``torch.utils.benchmark.Timer`` \u8fdb\u884c\u57fa\u51c6\u6d4b\u8bd5\nPyTorch ``benchmark``\u6a21\u5757\u7684\u8bbe\u8ba1\u4f7f\u5f97\u5bf9\u4e8e\u90a3\u4e9b\u66fe\u7ecf\u4f7f\u7528\u8fc7 ``timeit`` \u6a21\u5757\u7684\u4eba\u6765\u8bf4,\u5b83\u770b\u8d77\u6765\u5f88\u719f\u6089\u3002\n\u7136\u800c,\u5b83\u7684\u9ed8\u8ba4\u8bbe\u7f6e\u4f7f\u5f97\u5b83\u66f4\u5bb9\u6613\u4e14\u66f4\u5b89\u5168\u5730\u7528\u4e8e\u5bf9 PyTorch \u4ee3\u7801\u8fdb\u884c\u57fa\u51c6\u6d4b\u8bd5\u3002\n\u9996\u5148\u8ba9\u6211\u4eec\u5bf9\u6bd4\u4e00\u4e0b\u57fa\u672cAPI\u7684\u4f7f\u7528\u3002\n\n" ] }, { @@ -76,7 +76,7 @@ }, "outputs": [], "source": [ - "import torch.utils.benchmark as benchmark\n\nt0 = benchmark.Timer(\n stmt='batched_dot_mul_sum(x, x)', \n setup='from __main__ import batched_dot_mul_sum',\n globals={'x': x})\n\nt1 = benchmark.Timer(\n stmt='batched_dot_bmm(x, x)',\n setup='from __main__ import batched_dot_bmm',\n globals={'x': x})\n\nprint(t0.timeit(100))\nprint(t1.timeit(100))" + "import torch.utils.benchmark as benchmark\n\nt0 = benchmark.Timer(\n stmt=\"batched_dot_mul_sum(x, x)\",\n setup=\"from __main__ import batched_dot_mul_sum\",\n globals={\"x\": x},\n)\n\nt1 = benchmark.Timer(\n stmt=\"batched_dot_bmm(x, x)\",\n setup=\"from __main__ import batched_dot_bmm\",\n globals={\"x\": x},\n)\n\nprint(t0.timeit(100))\nprint(t1.timeit(100))" ] }, { @@ -90,7 +90,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Even though the APIs are the same for the basic functionality, there\nare some important differences. ``benchmark.Timer.timeit()`` returns the\ntime per run as opposed to the total runtime like ``timeit.Timer.timeit()``\ndoes. PyTorch ``benchmark`` module also provides formatted string\nrepresentations for printing the results.\n\nAnother important difference, and the reason why the results diverge\nis that PyTorch benchmark module runs in a single thread by default.\nWe can change the number of threads with the ``num_threads`` argument.\n\n``torch.utils.benchmark.Timer`` takes several additional arguments\nincluding: ``label``, ``sub_label``, ``description`` and ``env`` which change\nthe __repr__ of the measurement object returned and are used for\ngrouping the results (more on this later).\n\n\n" + "\u867d\u7136\u57fa\u672c\u529f\u80fd\u7684API\u662f\u76f8\u540c\u7684,\u4f46\u662f\u8fd8\u662f\u6709\u4e00\u4e9b\u91cd\u8981\u7684\u533a\u522b\u3002\n``benchmark.Timer.timeit()``\u8fd4\u56de\u7684\u662f\u6bcf\u6b21\u8fd0\u884c\u7684\u65f6\u95f4,\u800c\u4e0d\u662f ``timeit.Timer.timeit()`` \u8fd4\u56de\u7684\u603b\u8fd0\u884c\u65f6\u95f4\u3002\nPyTorch ``benchmark``\u6a21\u5757\u8fd8\u63d0\u4f9b\u4e86\u683c\u5f0f\u5316\u7684\u5b57\u7b26\u4e32\u8868\u793a,\u7528\u4e8e\u6253\u5370\u7ed3\u679c\u3002\n\n\u53e6\u4e00\u4e2a\u91cd\u8981\u7684\u533a\u522b,\u4e5f\u662f\u7ed3\u679c\u4e0d\u540c\u7684\u539f\u56e0,\u662fPyTorch\u57fa\u51c6\u6d4b\u8bd5\u6a21\u5757\u9ed8\u8ba4\u5728\u5355\u7ebf\u7a0b\u4e2d\u8fd0\u884c\u3002\n\u6211\u4eec\u53ef\u4ee5\u4f7f\u7528``num_threads``\u53c2\u6570\u6765\u66f4\u6539\u7ebf\u7a0b\u6570\u91cf\u3002\n\n``torch.utils.benchmark.Timer``\u63a5\u53d7\u51e0\u4e2a\u989d\u5916\u7684\u53c2\u6570,\u5305\u62ec: ``label``\u3001``sub_label``\u3001``description``\u548c``env``,\n\u8fd9\u4e9b\u53c2\u6570\u4f1a\u6539\u53d8\u8fd4\u56de\u7684\u6d4b\u91cf\u5bf9\u8c61\u7684__repr__,\u5e76\u7528\u4e8e\u5bf9\u7ed3\u679c\u8fdb\u884c\u5206\u7ec4(\u7a0d\u540e\u4f1a\u8be6\u7ec6\u4ecb\u7ecd)\u3002\n\n\n" ] }, { @@ -101,7 +101,7 @@ }, "outputs": [], "source": [ - "num_threads = torch.get_num_threads()\nprint(f'Benchmarking on {num_threads} threads')\n\nt0 = benchmark.Timer(\n stmt='batched_dot_mul_sum(x, x)', \n setup='from __main__ import batched_dot_mul_sum',\n globals={'x': x},\n num_threads=num_threads,\n label='Multithreaded batch dot',\n sub_label='Implemented using mul and sum')\n\nt1 = benchmark.Timer(\n stmt='batched_dot_bmm(x, x)',\n setup='from __main__ import batched_dot_bmm',\n globals={'x': x},\n num_threads=num_threads,\n label='Multithreaded batch dot',\n sub_label='Implemented using bmm')\n\nprint(t0.timeit(100))\nprint(t1.timeit(100))" + "num_threads = torch.get_num_threads()\nprint(f\"Benchmarking on {num_threads} threads\")\n\nt0 = benchmark.Timer(\n stmt=\"batched_dot_mul_sum(x, x)\",\n setup=\"from __main__ import batched_dot_mul_sum\",\n globals={\"x\": x},\n num_threads=num_threads,\n label=\"Multithreaded batch dot\",\n sub_label=\"Implemented using mul and sum\",\n)\n\nt1 = benchmark.Timer(\n stmt=\"batched_dot_bmm(x, x)\",\n setup=\"from __main__ import batched_dot_bmm\",\n globals={\"x\": x},\n num_threads=num_threads,\n label=\"Multithreaded batch dot\",\n sub_label=\"Implemented using bmm\",\n)\n\nprint(t0.timeit(100))\nprint(t1.timeit(100))" ] }, { @@ -115,7 +115,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Running ``benchmark`` with all threads available gives similar results\nas the ``timeit`` module. More importantly, which version is faster\ndepends on how many threads we run the code with. This is why it's\nimportant to benchmark the code with thread settings that are\nrepresentative of real use cases. Another important thing to remember\nis to synchronize CPU and CUDA when benchmarking on the GPU. Let's run\nthe above benchmarks again on a CUDA tensor and see what happens.\n\n\n" + "\u4f7f\u7528\u6240\u6709\u53ef\u7528\u7ebf\u7a0b\u8fd0\u884c ``benchmark`` \u4f1a\u5f97\u5230\u4e0e ``timeit`` \u6a21\u5757\u7c7b\u4f3c\u7684\u7ed3\u679c\u3002\n\u66f4\u91cd\u8981\u7684\u662f,\u54ea\u4e2a\u7248\u672c\u66f4\u5feb\u53d6\u51b3\u4e8e\u6211\u4eec\u4f7f\u7528\u591a\u5c11\u7ebf\u7a0b\u8fd0\u884c\u4ee3\u7801\u3002\n\u8fd9\u5c31\u662f\u4e3a\u4ec0\u4e48\u5728\u57fa\u51c6\u6d4b\u8bd5\u65f6,\u4f7f\u7528\u4e0e\u5b9e\u9645\u7528\u4f8b\u76f8\u7b26\u7684\u7ebf\u7a0b\u8bbe\u7f6e\u975e\u5e38\u91cd\u8981\u3002\n\u53e6\u4e00\u4e2a\u9700\u8981\u8bb0\u4f4f\u7684\u91cd\u8981\u4e8b\u60c5\u662f,\u5728 GPU \u4e0a\u8fdb\u884c\u57fa\u51c6\u6d4b\u8bd5\u65f6,\u8981\u540c\u6b65CPU\u548cCUDA\u3002\n\u8ba9\u6211\u4eec\u518d\u6b21\u5728CUDA\u5f20\u91cf\u4e0a\u8fd0\u884c\u4e0a\u9762\u7684\u57fa\u51c6\u6d4b\u8bd5,\u770b\u770b\u4f1a\u53d1\u751f\u4ec0\u4e48\u3002\n\n\n" ] }, { @@ -126,7 +126,7 @@ }, "outputs": [], "source": [ - "x = torch.randn(10000, 1024, device='cuda')\n\nt0 = timeit.Timer(\n stmt='batched_dot_mul_sum(x, x)', \n setup='from __main__ import batched_dot_mul_sum',\n globals={'x': x})\n\nt1 = timeit.Timer(\n stmt='batched_dot_bmm(x, x)',\n setup='from __main__ import batched_dot_bmm',\n globals={'x': x})\n\n# Ran each twice to show difference before/after warm-up\nprint(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us')\nprint(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us')\nprint(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us')\nprint(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us')" + "x = torch.randn(10000, 1024, device=\"cuda\")\n\nt0 = timeit.Timer(\n stmt=\"batched_dot_mul_sum(x, x)\",\n setup=\"from __main__ import batched_dot_mul_sum\",\n globals={\"x\": x},\n)\n\nt1 = timeit.Timer(\n stmt=\"batched_dot_bmm(x, x)\",\n setup=\"from __main__ import batched_dot_bmm\",\n globals={\"x\": x},\n)\n\n# Ran each twice to show difference before/after warm-up\nprint(f\"mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us\")\nprint(f\"mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us\")\nprint(f\"bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us\")\nprint(f\"bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us\")" ] }, { @@ -144,7 +144,7 @@ }, "outputs": [], "source": [ - "t0 = benchmark.Timer(\n stmt='batched_dot_mul_sum(x, x)', \n setup='from __main__ import batched_dot_mul_sum',\n globals={'x': x})\n\nt1 = benchmark.Timer(\n stmt='batched_dot_bmm(x, x)',\n setup='from __main__ import batched_dot_bmm',\n globals={'x': x})\n\n# Run only once since benchmark module does warm-up for us\nprint(t0.timeit(100))\nprint(t1.timeit(100))" + "t0 = benchmark.Timer(\n stmt=\"batched_dot_mul_sum(x, x)\",\n setup=\"from __main__ import batched_dot_mul_sum\",\n globals={\"x\": x},\n)\n\nt1 = benchmark.Timer(\n stmt=\"batched_dot_bmm(x, x)\",\n setup=\"from __main__ import batched_dot_bmm\",\n globals={\"x\": x},\n)\n\n# Run only once since benchmark module does warm-up for us\nprint(t0.timeit(100))\nprint(t1.timeit(100))" ] }, { @@ -158,14 +158,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The results reveal something interesting. The first run of the ``bmm``\nversion using the ``timeit`` module takes much longer than the second\nrun. This is because ``bmm`` calls into `cuBLAS` which needs to be\nloaded the first time it's called which takes some time. This is why\nit's important to do a warm-up run before benchmarking, luckily for\nus, PyTorch's ``benchmark`` module takes care of that.\n\nThe difference in the results between ``timeit`` and ``benchmark`` modules\nis because the `timeit` module is not synchronizing CUDA and is thus only\ntiming the time to launch the kernel. PyTorch's ``benchmark`` module does\nthe synchronization for us.\n\n" + "\u7ed3\u679c\u63ed\u793a\u4e86\u4e00\u4e9b\u6709\u8da3\u7684\u4e8b\u60c5\u3002\u4f7f\u7528 `timeit` \u6a21\u5757\u8fd0\u884c `bmm` \u7248\u672c\u7684\u7b2c\u4e00\u6b21\u8fd0\u884c\u6bd4\u7b2c\u4e8c\u6b21\u8fd0\u884c\u6162\u5f88\u591a\u3002\n\u8fd9\u662f\u56e0\u4e3a `bmm` \u9700\u8981\u8c03\u7528 `cuBLAS`,\u7b2c\u4e00\u6b21\u8c03\u7528\u65f6\u9700\u8981\u52a0\u8f7d\u5b83,\u8fd9\u9700\u8981\u4e00\u4e9b\u65f6\u95f4\u3002\n\u8fd9\u5c31\u662f\u4e3a\u4ec0\u4e48\u5728\u57fa\u51c6\u6d4b\u8bd5\u4e4b\u524d\u505a\u4e00\u6b21\u9884\u70ed\u8fd0\u884c\u5f88\u91cd\u8981,\u5e78\u8fd0\u7684\u662f, PyTorch \u7684 `benchmark` \u6a21\u5757\u4e3a\u6211\u4eec\u5904\u7406\u4e86\u8fd9\u4e2a\u95ee\u9898\u3002\n\n`timeit` \u6a21\u5757\u548c `benchmark` \u6a21\u5757\u4e4b\u95f4\u7ed3\u679c\u7684\u5dee\u5f02\u662f\u56e0\u4e3a `timeit` \u6a21\u5757\u6ca1\u6709\u540c\u6b65 CUDA,\u56e0\u6b64\u53ea\u8ba1\u65f6\u4e86\u542f\u52a8\u5185\u6838\u7684\u65f6\u95f4\u3002\nPyTorch \u7684 `benchmark` \u6a21\u5757\u4e3a\u6211\u4eec\u505a\u4e86\u540c\u6b65\u3002\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### 4. Benchmarking with `Blocked Autorange`\n\nWhile ``timeit.Timer.autorange`` takes a single continuous measurement\nof at least 0.2 seconds, `torch.utils.benchmark.blocked_autorange`\ntakes many measurements whose times total at least 0.2 seconds (which\ncan be changed by the `min_run_time` parameter) subject to the constraint\nthat timing overhead is a small fraction of the overall measurement.\nThis is accomplished by first running with an increasing number of runs\nper loop until the runtime is much larger than measurement overhead\n(which also serves as a warm up), and then taking measurements until\nthe target time is reached. This has the useful properties that it wastes\nless data and allows us to compute statistics to estimate the reliability\nof the measurements.\n\n\n" + "### 4. \u4f7f\u7528 `Blocked Autorange` \u8fdb\u884c\u57fa\u51c6\u6d4b\u8bd5\n\n\u867d\u7136 `timeit.Timer.autorange` \u91c7\u53d6\u81f3\u5c11 0.2 \u79d2\u7684\u5355\u6b21\u8fde\u7eed\u6d4b\u91cf,\n\u4f46 `torch.utils.benchmark.blocked_autorange` \u91c7\u53d6\u591a\u6b21\u6d4b\u91cf,\u5176\u603b\u65f6\u95f4\u81f3\u5c11\u4e3a 0.2 \u79d2(\u53ef\u901a\u8fc7 `min_run_time` \u53c2\u6570\u66f4\u6539),\n\u5e76\u4e14\u6d4b\u91cf\u5f00\u9500\u53ea\u5360\u603b\u4f53\u6d4b\u91cf\u7684\u4e00\u5c0f\u90e8\u5206\u3002\n\u8fd9\u662f\u901a\u8fc7\u9996\u5148\u4ee5\u9012\u589e\u7684\u5faa\u73af\u6b21\u6570\u8fd0\u884c,\u76f4\u5230\u8fd0\u884c\u65f6\u95f4\u8fdc\u5927\u4e8e\u6d4b\u91cf\u5f00\u9500(\u8fd9\u4e5f\u8d77\u5230\u4e86\u70ed\u8eab\u7684\u4f5c\u7528),\n\u7136\u540e\u8fdb\u884c\u6d4b\u91cf\u76f4\u5230\u8fbe\u5230\u76ee\u6807\u65f6\u95f4\u3002\u8fd9\u6709\u4e00\u4e2a\u6709\u7528\u7684\u7279\u6027,\u5373\u5b83\u6d6a\u8d39\u7684\u6570\u636e\u66f4\u5c11,\u5e76\u4e14\u5141\u8bb8\u6211\u4eec\u8ba1\u7b97\u7edf\u8ba1\u6570\u636e\u6765\u4f30\u8ba1\u6d4b\u91cf\u7684\u53ef\u9760\u6027\u3002\n\n\n" ] }, { @@ -190,7 +190,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can also inspect the individual statistics from the returned\nmeasurements object.\n\n" + "\u6211\u4eec\u8fd8\u53ef\u4ee5\u67e5\u770b\u8fd4\u56de\u7684\u6d4b\u91cf\u5bf9\u8c61\u4e2d\u83b7\u5f97\u7684\u5404\u4e2a\u7edf\u8ba1\u6570\u636e\u3002\n\n" ] }, { @@ -215,7 +215,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 5. Comparing benchmark results\n\nSo far we've been comparing our two versions of batched dot against a\nsingle input. In practice, we want to try a combination of inputs as\nwell as different number of threads. The ``Compare`` class helps display\nthe results of many measurements in a formatted table. It uses the\nannotations described above (`label`, `sub_label`, `num_threads`, etc.) as\nwell as `description` to group and organize the table. Let's use\n``Compare`` to see how our functions perform for different input sizes\nand number of threads.\n\n\n" + "### 5. \u6bd4\u8f83\u57fa\u51c6\u6d4b\u8bd5\u7ed3\u679c\n\n\u5230\u76ee\u524d\u4e3a\u6b62,\u6211\u4eec\u4e00\u76f4\u5728\u6bd4\u8f83\u6211\u4eec\u7684\u4e24\u4e2a\u6279\u91cf\u70b9\u79ef\u7248\u672c\u5bf9\u540c\u4e00\u8f93\u5165\u7684\u8868\u73b0\u3002\n\u5728\u5b9e\u8df5\u4e2d,\u6211\u4eec\u5e0c\u671b\u5c1d\u8bd5\u4e0d\u540c\u7684\u8f93\u5165\u7ec4\u5408\u4ee5\u53ca\u4e0d\u540c\u7684\u7ebf\u7a0b\u6570\u91cf\u3002\n`Compare` \u7c7b\u5e2e\u52a9\u6211\u4eec\u4ee5\u683c\u5f0f\u5316\u8868\u683c\u7684\u5f62\u5f0f\u663e\u793a\u591a\u4e2a\u6d4b\u91cf\u7ed3\u679c\u3002\n\u5b83\u4f7f\u7528\u4e0a\u9762\u63cf\u8ff0\u7684\u6ce8\u91ca( `label`\u3001 `sub_label`\u3001 `num_threads` \u7b49)\u4ee5\u53ca `description` \u6765\u5bf9\u8868\u683c\u8fdb\u884c\u5206\u7ec4\u548c\u7ec4\u7ec7\u3002\n\u8ba9\u6211\u4eec\u4f7f\u7528 `Compare` \u6765\u770b\u770b\u6211\u4eec\u7684\u51fd\u6570\u5728\u4e0d\u540c\u7684\u8f93\u5165\u5927\u5c0f\u548c\u7ebf\u7a0b\u6570\u91cf\u4e0b\u7684\u8868\u73b0\u5982\u4f55\u3002\n\n\n" ] }, { @@ -226,21 +226,21 @@ }, "outputs": [], "source": [ - "from itertools import product\n\n# Compare takes a list of measurements which we'll save in results.\nresults = []\n\nsizes = [1, 64, 1024, 10000]\nfor b, n in product(sizes, sizes):\n # label and sub_label are the rows\n # description is the column\n label = 'Batched dot'\n sub_label = f'[{b}, {n}]'\n x = torch.ones((b, n))\n for num_threads in [1, 4, 16, 32]:\n results.append(benchmark.Timer(\n stmt='batched_dot_mul_sum(x, x)',\n setup='from __main__ import batched_dot_mul_sum',\n globals={'x': x},\n num_threads=num_threads,\n label=label,\n sub_label=sub_label,\n description='mul/sum',\n ).blocked_autorange(min_run_time=1))\n results.append(benchmark.Timer(\n stmt='batched_dot_bmm(x, x)',\n setup='from __main__ import batched_dot_bmm',\n globals={'x': x},\n num_threads=num_threads,\n label=label,\n sub_label=sub_label,\n description='bmm',\n ).blocked_autorange(min_run_time=1))\n\ncompare = benchmark.Compare(results)\ncompare.print()" + "from itertools import product\n\n# Compare takes a list of measurements which we'll save in results.\nresults = []\n\nsizes = [1, 64, 1024, 10000]\nfor b, n in product(sizes, sizes):\n # label and sub_label are the rows\n # description is the column\n label = \"Batched dot\"\n sub_label = f\"[{b}, {n}]\"\n x = torch.ones((b, n))\n for num_threads in [1, 4, 16, 32]:\n results.append(\n benchmark.Timer(\n stmt=\"batched_dot_mul_sum(x, x)\",\n setup=\"from __main__ import batched_dot_mul_sum\",\n globals={\"x\": x},\n num_threads=num_threads,\n label=label,\n sub_label=sub_label,\n description=\"mul/sum\",\n ).blocked_autorange(min_run_time=1)\n )\n results.append(\n benchmark.Timer(\n stmt=\"batched_dot_bmm(x, x)\",\n setup=\"from __main__ import batched_dot_bmm\",\n globals={\"x\": x},\n num_threads=num_threads,\n label=label,\n sub_label=sub_label,\n description=\"bmm\",\n ).blocked_autorange(min_run_time=1)\n )\n\ncompare = benchmark.Compare(results)\ncompare.print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - ".. code-block:: none\n :caption: Output\n\n [--------------- Batched dot ----------------]\n | mul/sum | bmm \n 1 threads: -----------------------------------\n [1, 1] | 5.9 | 11.2\n [1, 64] | 6.4 | 11.4\n [1, 1024] | 6.7 | 14.2\n [1, 10000] | 10.2 | 23.7\n [64, 1] | 6.3 | 11.5\n [64, 64] | 8.6 | 15.4\n [64, 1024] | 39.4 | 204.4\n [64, 10000] | 274.9 | 748.5\n [1024, 1] | 7.7 | 17.8\n [1024, 64] | 40.3 | 76.4\n [1024, 1024] | 432.4 | 2795.9\n [1024, 10000] | 22657.3 | 11899.5\n [10000, 1] | 16.9 | 74.8\n [10000, 64] | 300.3 | 609.4\n [10000, 1024] | 23098.6 | 27246.1\n [10000, 10000] | 267073.7 | 118823.7\n 4 threads: -----------------------------------\n [1, 1] | 6.0 | 11.5\n [1, 64] | 6.2 | 11.2\n [1, 1024] | 6.8 | 14.3\n [1, 10000] | 10.2 | 23.7\n [64, 1] | 6.3 | 16.2\n [64, 64] | 8.8 | 18.2\n [64, 1024] | 41.5 | 189.1\n [64, 10000] | 91.7 | 849.1\n [1024, 1] | 7.6 | 17.4\n [1024, 64] | 43.5 | 33.5\n [1024, 1024] | 135.4 | 2782.3\n [1024, 10000] | 7471.1 | 11874.0\n [10000, 1] | 16.8 | 33.9\n [10000, 64] | 118.7 | 173.2\n [10000, 1024] | 7264.6 | 27824.7\n [10000, 10000] | 100060.9 | 121499.0\n 16 threads: ----------------------------------\n [1, 1] | 6.0 | 11.3\n [1, 64] | 6.2 | 11.2\n [1, 1024] | 6.9 | 14.2\n [1, 10000] | 10.3 | 23.8\n [64, 1] | 6.4 | 24.1\n [64, 64] | 9.0 | 23.8\n [64, 1024] | 54.1 | 188.5\n [64, 10000] | 49.9 | 748.0\n [1024, 1] | 7.6 | 23.4\n [1024, 64] | 55.5 | 28.2\n [1024, 1024] | 66.9 | 2773.9\n [1024, 10000] | 6111.5 | 12833.7\n [10000, 1] | 16.9 | 27.5\n [10000, 64] | 59.5 | 73.7\n [10000, 1024] | 6295.9 | 27062.0\n [10000, 10000] | 71804.5 | 120365.8\n 32 threads: ----------------------------------\n [1, 1] | 5.9 | 11.3\n [1, 64] | 6.2 | 11.3\n [1, 1024] | 6.7 | 14.2\n [1, 10000] | 10.5 | 23.8\n [64, 1] | 6.3 | 31.7\n [64, 64] | 9.1 | 30.4\n [64, 1024] | 72.0 | 190.4\n [64, 10000] | 103.1 | 746.9\n [1024, 1] | 7.6 | 28.4\n [1024, 64] | 70.5 | 31.9\n [1024, 1024] | 65.6 | 2804.6\n [1024, 10000] | 6764.0 | 11871.4\n [10000, 1] | 17.8 | 31.8\n [10000, 64] | 110.3 | 56.0\n [10000, 1024] | 6640.2 | 27592.2\n [10000, 10000] | 73003.4 | 120083.2\n\n Times are in microseconds (us).\n\n\n" + ".. code-block:: none\n :caption: Output\n\n [--------------- Batched dot ----------------]\n | mul/sum | bmm\n 1 threads: -----------------------------------\n [1, 1] | 5.9 | 11.2\n [1, 64] | 6.4 | 11.4\n [1, 1024] | 6.7 | 14.2\n [1, 10000] | 10.2 | 23.7\n [64, 1] | 6.3 | 11.5\n [64, 64] | 8.6 | 15.4\n [64, 1024] | 39.4 | 204.4\n [64, 10000] | 274.9 | 748.5\n [1024, 1] | 7.7 | 17.8\n [1024, 64] | 40.3 | 76.4\n [1024, 1024] | 432.4 | 2795.9\n [1024, 10000] | 22657.3 | 11899.5\n [10000, 1] | 16.9 | 74.8\n [10000, 64] | 300.3 | 609.4\n [10000, 1024] | 23098.6 | 27246.1\n [10000, 10000] | 267073.7 | 118823.7\n 4 threads: -----------------------------------\n [1, 1] | 6.0 | 11.5\n [1, 64] | 6.2 | 11.2\n [1, 1024] | 6.8 | 14.3\n [1, 10000] | 10.2 | 23.7\n [64, 1] | 6.3 | 16.2\n [64, 64] | 8.8 | 18.2\n [64, 1024] | 41.5 | 189.1\n [64, 10000] | 91.7 | 849.1\n [1024, 1] | 7.6 | 17.4\n [1024, 64] | 43.5 | 33.5\n [1024, 1024] | 135.4 | 2782.3\n [1024, 10000] | 7471.1 | 11874.0\n [10000, 1] | 16.8 | 33.9\n [10000, 64] | 118.7 | 173.2\n [10000, 1024] | 7264.6 | 27824.7\n [10000, 10000] | 100060.9 | 121499.0\n 16 threads: ----------------------------------\n [1, 1] | 6.0 | 11.3\n [1, 64] | 6.2 | 11.2\n [1, 1024] | 6.9 | 14.2\n [1, 10000] | 10.3 | 23.8\n [64, 1] | 6.4 | 24.1\n [64, 64] | 9.0 | 23.8\n [64, 1024] | 54.1 | 188.5\n [64, 10000] | 49.9 | 748.0\n [1024, 1] | 7.6 | 23.4\n [1024, 64] | 55.5 | 28.2\n [1024, 1024] | 66.9 | 2773.9\n [1024, 10000] | 6111.5 | 12833.7\n [10000, 1] | 16.9 | 27.5\n [10000, 64] | 59.5 | 73.7\n [10000, 1024] | 6295.9 | 27062.0\n [10000, 10000] | 71804.5 | 120365.8\n 32 threads: ----------------------------------\n [1, 1] | 5.9 | 11.3\n [1, 64] | 6.2 | 11.3\n [1, 1024] | 6.7 | 14.2\n [1, 10000] | 10.5 | 23.8\n [64, 1] | 6.3 | 31.7\n [64, 64] | 9.1 | 30.4\n [64, 1024] | 72.0 | 190.4\n [64, 10000] | 103.1 | 746.9\n [1024, 1] | 7.6 | 28.4\n [1024, 64] | 70.5 | 31.9\n [1024, 1024] | 65.6 | 2804.6\n [1024, 10000] | 6764.0 | 11871.4\n [10000, 1] | 17.8 | 31.8\n [10000, 64] | 110.3 | 56.0\n [10000, 1024] | 6640.2 | 27592.2\n [10000, 10000] | 73003.4 | 120083.2\n\n Times are in microseconds (us).\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The results above indicate that the version which reduces to ``bmm``\nis better for larger tensors running on multiple threads, while for\nsmaller and/or single thread code, the other version is better.\n\n``Compare`` also provides functions for changing the table format\n\n\n" + "\u4e0a\u9762\u7684\u7ed3\u679c\u8868\u660e,\u5bf9\u4e8e\u5728\u591a\u7ebf\u7a0b\u4e0a\u8fd0\u884c\u7684\u8f83\u5927\u5f20\u91cf, `bmm` \u7684\u7248\u672c\u6548\u679c\u66f4\u597d,\n\u800c\u5bf9\u4e8e\u8f83\u5c0f\u548c/\u6216\u5355\u7ebf\u7a0b\u4ee3\u7801,\u53e6\u4e00\u4e2a\u7248\u672c\u6548\u679c\u66f4\u597d\u3002\n\n`Compare` \u8fd8\u63d0\u4f9b\u4e86\u7528\u4e8e\u66f4\u6539\u8868\u683c\u683c\u5f0f\u7684\u51fd\u6570\n\n" ] }, { @@ -258,7 +258,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 6. Saving/Loading benchmark results\n\n`Measurements` (and ``CallgrindStats`` which are described in section 8)\ncan be serialized by the ``pickle`` module. This makes A/B testing easy, as you can collect\nmeasurements from two separate environments, pickle them, and then\nload both in a single environment. Timer even takes an `env`\nconstructor argument so that such A/B testing works seamlessly.\n\nLet's imagine that rather than two Python functions, the add/sum\nand ``bmm`` approaches were in two different builds of PyTorch.\nThe example below demonstrates how one might A/B test them. For\nsimplicity, we only use a subset of shapes, and simply round trip\nresults through pickle rather than actually using multiple environments\nand writing results to disk.\n\n\n" + "### 6. \u4fdd\u5b58/\u52a0\u8f7d\u57fa\u51c6\u6d4b\u8bd5\u7ed3\u679c\n\n`Measurements` (\u548c\u7b2c8\u8282\u4e2d\u63cf\u8ff0\u7684 `CallgrindStats` )\u53ef\u4ee5\u901a\u8fc7 `pickle` \u6a21\u5757\u5e8f\u5217\u5316\u3002\n\u8fd9\u4f7f\u5f97A/B\u6d4b\u8bd5\u53d8\u5f97\u5f88\u5bb9\u6613,\u56e0\u4e3a\u60a8\u53ef\u4ee5\u4ece\u4e24\u4e2a\u72ec\u7acb\u7684\u73af\u5883\u4e2d\u6536\u96c6\u6d4b\u91cf\u7ed3\u679c,\n\u5c06\u5b83\u4eec\u5e8f\u5217\u5316,\u7136\u540e\u5728\u5355\u4e2a\u73af\u5883\u4e2d\u52a0\u8f7d\u4e24\u8005\u3002Timer\u751a\u81f3\u63a5\u53d7\u4e00\u4e2a `env`\n\u6784\u9020\u51fd\u6570\u53c2\u6570,\u4ee5\u4fbf\u8fd9\u79cdA/B\u6d4b\u8bd5\u53ef\u4ee5\u65e0\u7f1d\u8854\u63a5\u3002\n\n\u5047\u8bbe add/sum \u548c `bmm` \u65b9\u6cd5\u4e0d\u662f\u4e24\u4e2aPython\u51fd\u6570,\u800c\u662f PyTorch \u7684\u4e24\u4e2a\u4e0d\u540c\u7248\u672c\u3002\n\u4e0b\u9762\u7684\u793a\u4f8b\u6f14\u793a\u4e86\u5982\u4f55\u8fdb\u884cA/B\u6d4b\u8bd5\u3002\u4e3a\u4e86\u7b80\u5355\u8d77\u89c1,\u6211\u4eec\u53ea\u4f7f\u7528\u4e86\u4e00\u90e8\u5206\u6570\u636e,\n\u5e76\u7b80\u5355\u5730\u901a\u8fc7pickle\u6765\u56de\u4f20\u7ed3\u679c,\u800c\u4e0d\u662f\u5b9e\u9645\u4f7f\u7528\u591a\u4e2a\u73af\u5883\u5e76\u5c06\u7ed3\u679c\u5199\u5165\u78c1\u76d8\u3002\n\n\n" ] }, { @@ -269,7 +269,7 @@ }, "outputs": [], "source": [ - "import pickle\n\nab_test_results = []\nfor env in ('environment A: mul/sum', 'environment B: bmm'):\n for b, n in ((1, 1), (1024, 10000), (10000, 1)):\n x = torch.ones((b, n))\n dot_fn = (batched_dot_mul_sum if env == 'environment A: mul/sum' else batched_dot_bmm)\n m = benchmark.Timer(\n stmt='batched_dot(x, x)',\n globals={'x': x, 'batched_dot': dot_fn},\n num_threads=1,\n label='Batched dot',\n description=f'[{b}, {n}]',\n env=env,\n ).blocked_autorange(min_run_time=1)\n ab_test_results.append(pickle.dumps(m))\n\nab_results = [pickle.loads(i) for i in ab_test_results]\ncompare = benchmark.Compare(ab_results)\ncompare.trim_significant_figures()\ncompare.colorize()\ncompare.print()" + "import pickle\n\nab_test_results = []\nfor env in (\"environment A: mul/sum\", \"environment B: bmm\"):\n for b, n in ((1, 1), (1024, 10000), (10000, 1)):\n x = torch.ones((b, n))\n dot_fn = (\n batched_dot_mul_sum if env == \"environment A: mul/sum\" else batched_dot_bmm\n )\n m = benchmark.Timer(\n stmt=\"batched_dot(x, x)\",\n globals={\"x\": x, \"batched_dot\": dot_fn},\n num_threads=1,\n label=\"Batched dot\",\n description=f\"[{b}, {n}]\",\n env=env,\n ).blocked_autorange(min_run_time=1)\n ab_test_results.append(pickle.dumps(m))\n\nab_results = [pickle.loads(i) for i in ab_test_results]\ncompare = benchmark.Compare(ab_results)\ncompare.trim_significant_figures()\ncompare.colorize()\ncompare.print()" ] }, { @@ -287,14 +287,14 @@ }, "outputs": [], "source": [ - "# And just to show that we can round trip all of the results from earlier:\nround_tripped_results = pickle.loads(pickle.dumps(results))\nassert(str(benchmark.Compare(results)) == str(benchmark.Compare(round_tripped_results)))" + "# \u4ec5\u4e3a\u5c55\u793a\u53ef\u4ee5\u5c06\u4e4b\u524d\u6240\u6709\u7684\u7ed3\u679c\u901a\u8fc7 pickle \u8fdb\u884c\u56de\u4f20:\nround_tripped_results = pickle.loads(pickle.dumps(results))\nassert str(benchmark.Compare(results)) == str(benchmark.Compare(round_tripped_results))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### 7. Generating inputs with `Fuzzed Parameters`\n\nAs we've seen in the previous section, there can be some stark\nperformance differences depending on the input tensors. Hence, it\nis a good idea to run benchmarks on a number of different inputs.\nHowever, creating all these input tensors can be tedious which is\nwhere ``torch.utils.benchmark.Fuzzer`` and related classes come in.\nLet's take a look at how we can use the ``Fuzzer`` to create some test\ncases for the benchmark.\n\n\n" + "### 7. \u4f7f\u7528 `Fuzzed Parameters` \u751f\u6210\u8f93\u5165\n\n\u6b63\u5982\u6211\u4eec\u5728\u4e0a\u4e00\u8282\u4e2d\u770b\u5230\u7684,\u6839\u636e\u8f93\u5165\u5f20\u91cf\u7684\u4e0d\u540c,\u6027\u80fd\u5dee\u5f02\u53ef\u80fd\u4f1a\u5f88\u5927\u3002\n\u56e0\u6b64,\u5728\u591a\u4e2a\u4e0d\u540c\u7684\u8f93\u5165\u4e0a\u8fd0\u884c\u57fa\u51c6\u6d4b\u8bd5\u662f\u4e00\u4e2a\u597d\u4e3b\u610f\u3002\n\u4f46\u662f,\u521b\u5efa\u6240\u6709\u8fd9\u4e9b\u8f93\u5165\u5f20\u91cf\u53ef\u80fd\u4f1a\u5f88\u9ebb\u70e6,\u8fd9\u5c31\u662f `torch.utils.benchmark.Fuzzer`\n\u548c\u76f8\u5173\u7c7b\u7684\u7528\u6b66\u4e4b\u5730\u3002\u8ba9\u6211\u4eec\u770b\u770b\u5982\u4f55\u4f7f\u7528 `Fuzzer` \u6765\u521b\u5efa\u4e00\u4e9b\u7528\u4e8e\u57fa\u51c6\u6d4b\u8bd5\u7684\u6d4b\u8bd5\u7528\u4f8b\u3002\n\n\n" ] }, { @@ -305,21 +305,21 @@ }, "outputs": [], "source": [ - "from torch.utils.benchmark import Fuzzer, FuzzedParameter, FuzzedTensor, ParameterAlias\n\n# Generates random tensors with 128 to 10000000 elements and sizes k0 and k1 chosen from a\n# ``loguniform`` distribution in [1, 10000], 40% of which will be discontiguous on average.\nexample_fuzzer = Fuzzer(\n parameters = [\n FuzzedParameter('k0', minval=1, maxval=10000, distribution='loguniform'),\n FuzzedParameter('k1', minval=1, maxval=10000, distribution='loguniform'),\n ],\n tensors = [\n FuzzedTensor('x', size=('k0', 'k1'), min_elements=128, max_elements=10000000, probability_contiguous=0.6)\n ],\n seed=0,\n)\n\nresults = []\nfor tensors, tensor_params, params in example_fuzzer.take(10):\n # description is the column label\n sub_label=f\"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}\"\n results.append(benchmark.Timer(\n stmt='batched_dot_mul_sum(x, x)',\n setup='from __main__ import batched_dot_mul_sum',\n globals=tensors,\n label='Batched dot',\n sub_label=sub_label,\n description='mul/sum',\n ).blocked_autorange(min_run_time=1))\n results.append(benchmark.Timer(\n stmt='batched_dot_bmm(x, x)',\n setup='from __main__ import batched_dot_bmm',\n globals=tensors,\n label='Batched dot',\n sub_label=sub_label,\n description='bmm',\n ).blocked_autorange(min_run_time=1))\n\ncompare = benchmark.Compare(results)\ncompare.trim_significant_figures()\ncompare.print()" + "from torch.utils.benchmark import FuzzedParameter, FuzzedTensor, Fuzzer, ParameterAlias\n\n# \u751f\u6210\u968f\u673a\u5f20\u91cf,\u5143\u7d20\u6570\u91cf\u5728 128 \u5230 10000000 \u4e4b\u95f4,\u5927\u5c0f k0 \u548c k1 \u4ece [1, 10000] \u7684 `loguniform` \u5206\u5e03\u4e2d\u9009\u62e9,\n# \u5176\u4e2d\u5e73\u5747 40% \u5c06\u662f\u4e0d\u8fde\u7eed\u7684\u3002\nexample_fuzzer = Fuzzer(\n parameters=[\n FuzzedParameter(\"k0\", minval=1, maxval=10000, distribution=\"loguniform\"),\n FuzzedParameter(\"k1\", minval=1, maxval=10000, distribution=\"loguniform\"),\n ],\n tensors=[\n FuzzedTensor(\n \"x\",\n size=(\"k0\", \"k1\"),\n min_elements=128,\n max_elements=10000000,\n probability_contiguous=0.6,\n )\n ],\n seed=0,\n)\n\nresults = []\nfor tensors, tensor_params, params in example_fuzzer.take(10):\n # description is the column label\n sub_label = f\"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}\"\n results.append(\n benchmark.Timer(\n stmt=\"batched_dot_mul_sum(x, x)\",\n setup=\"from __main__ import batched_dot_mul_sum\",\n globals=tensors,\n label=\"Batched dot\",\n sub_label=sub_label,\n description=\"mul/sum\",\n ).blocked_autorange(min_run_time=1)\n )\n results.append(\n benchmark.Timer(\n stmt=\"batched_dot_bmm(x, x)\",\n setup=\"from __main__ import batched_dot_bmm\",\n globals=tensors,\n label=\"Batched dot\",\n sub_label=sub_label,\n description=\"bmm\",\n ).blocked_autorange(min_run_time=1)\n )\n\ncompare = benchmark.Compare(results)\ncompare.trim_significant_figures()\ncompare.print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - ".. code-block:: none\n :caption: Output\n\n [--------------------- Batched dot ---------------------]\n | mul/sum | bmm \n 1 threads: ----------------------------------------------\n 725 x 257 | 87 | 180\n 49 x 383 | 15 | 30\n 34 x 1468 | 30 | 118\n 187 x 5039 | 400 | 1200\n 2140 x 1296 (discontiguous) | 2000 | 41000\n 78 x 1598 | 74 | 310\n 519 x 763 | 190 | 1500\n 141 x 1082 | 87 | 500\n 78 x 5 (discontiguous) | 9 | 20\n 187 x 1 | 12 | 10\n\n Times are in microseconds (us). \n\n\n" + ".. code-block:: none\n :caption: Output\n\n [--------------------- Batched dot ---------------------]\n | mul/sum | bmm\n 1 threads: ----------------------------------------------\n 725 x 257 | 87 | 180\n 49 x 383 | 15 | 30\n 34 x 1468 | 30 | 118\n 187 x 5039 | 400 | 1200\n 2140 x 1296 (discontiguous) | 2000 | 41000\n 78 x 1598 | 74 | 310\n 519 x 763 | 190 | 1500\n 141 x 1082 | 87 | 500\n 78 x 5 (discontiguous) | 9 | 20\n 187 x 1 | 12 | 10\n\n Times are in microseconds (us).\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "There is a lot of flexibility for defining your own ``fuzzers`` which\nis great for creating a powerful set of inputs to benchmark. But to\nmake things even simpler, PyTorch benchmark module comes with some\nbuilt-in ``fuzzers`` for common benchmarking needs. Let's take a look at\nhow we can use one of these built-in ``fuzzers``.\n\n\n" + "\u5b9a\u4e49\u81ea\u5df1\u7684 `fuzzers` \u6709\u5f88\u5927\u7684\u7075\u6d3b\u6027,\u8fd9\u5bf9\u4e8e\u521b\u5efa\u5f3a\u5927\u7684\u8f93\u5165\u96c6\u8fdb\u884c\u57fa\u51c6\u6d4b\u8bd5\u975e\u5e38\u6709\u7528\u3002\n\u4f46\u4e3a\u4e86\u8ba9\u4e8b\u60c5\u53d8\u5f97\u66f4\u7b80\u5355, PyTorch \u57fa\u51c6\u6d4b\u8bd5\u6a21\u5757\u4e3a\u5e38\u89c1\u7684\u57fa\u51c6\u6d4b\u8bd5\u9700\u6c42\u63d0\u4f9b\u4e86\u4e00\u4e9b\u5185\u7f6e\u7684 `fuzzers`\u3002\n\u8ba9\u6211\u4eec\u770b\u770b\u5982\u4f55\u4f7f\u7528\u5176\u4e2d\u4e00\u4e2a\u5185\u7f6e\u7684 `fuzzers` \u3002\n\n\n" ] }, { @@ -330,21 +330,21 @@ }, "outputs": [], "source": [ - "from torch.utils.benchmark.op_fuzzers import binary\n\nresults = []\nfor tensors, tensor_params, params in binary.BinaryOpFuzzer(seed=0).take(10):\n sub_label=f\"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}\"\n results.append(benchmark.Timer(\n stmt='batched_dot_mul_sum(x, x)',\n setup='from __main__ import batched_dot_mul_sum',\n globals=tensors,\n label='Batched dot',\n sub_label=sub_label,\n description='mul/sum',\n ).blocked_autorange(min_run_time=1))\n results.append(benchmark.Timer(\n stmt='batched_dot_bmm(x, x)',\n setup='from __main__ import batched_dot_bmm',\n globals=tensors,\n label='Batched dot',\n sub_label=sub_label,\n description='bmm',\n ).blocked_autorange(min_run_time=1))\n\ncompare = benchmark.Compare(results)\ncompare.trim_significant_figures()\ncompare.colorize(rowwise=True)\ncompare.print()" + "from torch.utils.benchmark.op_fuzzers import binary\n\nresults = []\nfor tensors, tensor_params, params in binary.BinaryOpFuzzer(seed=0).take(10):\n sub_label = f\"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}\"\n results.append(\n benchmark.Timer(\n stmt=\"batched_dot_mul_sum(x, x)\",\n setup=\"from __main__ import batched_dot_mul_sum\",\n globals=tensors,\n label=\"Batched dot\",\n sub_label=sub_label,\n description=\"mul/sum\",\n ).blocked_autorange(min_run_time=1)\n )\n results.append(\n benchmark.Timer(\n stmt=\"batched_dot_bmm(x, x)\",\n setup=\"from __main__ import batched_dot_bmm\",\n globals=tensors,\n label=\"Batched dot\",\n sub_label=sub_label,\n description=\"bmm\",\n ).blocked_autorange(min_run_time=1)\n )\n\ncompare = benchmark.Compare(results)\ncompare.trim_significant_figures()\ncompare.colorize(rowwise=True)\ncompare.print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - ".. code-block:: none\n :caption: Output\n\n [----------------------- Batched dot ------------------------]\n | mul/sum | bmm \n 1 threads: ---------------------------------------------------\n 64 x 473 (discontiguous) | 10000 | 40000\n 16384 x 12642115 (discontiguous) | 31 | 78\n 8192 x 892 | 4800 | 20400\n 512 x 64 (discontiguous) | 110000 | 400000\n 493 x 27 (discontiguous) | 1100 | 2440\n 118 x 32 (discontiguous) | 870 | 2030\n 16 x 495 (discontiguous) | 23600 | 24000\n 488 x 62374 | 90000 | 100000\n 240372 x 69 | 40000 | 16000\n 40156 x 32 (discontiguous) | 2670 | 5000\n\n Times are in microseconds (us).\n\n\n" + ".. code-block:: none\n :caption: Output\n\n [----------------------- Batched dot ------------------------]\n | mul/sum | bmm\n 1 threads: ---------------------------------------------------\n 64 x 473 (discontiguous) | 10000 | 40000\n 16384 x 12642115 (discontiguous) | 31 | 78\n 8192 x 892 | 4800 | 20400\n 512 x 64 (discontiguous) | 110000 | 400000\n 493 x 27 (discontiguous) | 1100 | 2440\n 118 x 32 (discontiguous) | 870 | 2030\n 16 x 495 (discontiguous) | 23600 | 24000\n 488 x 62374 | 90000 | 100000\n 240372 x 69 | 40000 | 16000\n 40156 x 32 (discontiguous) | 2670 | 5000\n\n Times are in microseconds (us).\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### 8. Collecting instruction counts with ``Callgrind``\n\nOne of the challenges of optimizing code is the variation and opacity of\nwall time. There are many sources of non-determinism, from adaptive clock\nspeeds to resource contention with other processes. Furthermore, end-to-end\ntime gives no insight into where time is being spent, which is really what\nwe're interested in when optimizing code.\n\nA complementary approach is to also collect instruction counts. These counts\nare a proxy metric and do not capture all aspects of performance\n(e.g. memory or I/O bound tasks), however they do have several useful\nproperties. Instruction counts are reproducible, insensitive to environmental\nvariation, and offer fine grained insight into where a program is spending\ncycles.\n\nTo see the utility of instruction counts, let us look at how we might\nreduce the overhead of `batched_dot_mul_sum`. The obvious solution is to\nmove it to C++, so we avoid going between Python and C++ multiple times.\n\nFortunately, the source is nearly identical. One question that we have to ask\nin C++ is whether we should take arguments by value or reference.\n\n\n" + "### 8. \u4f7f\u7528 `Callgrind` \u6536\u96c6\u6307\u4ee4\u8ba1\u6570\n\n\u4f18\u5316\u4ee3\u7801\u7684\u4e00\u4e2a\u6311\u6218\u662f\u65f6\u95f4\u7684\u53d8\u5316\u548c\u4e0d\u900f\u660e\u6027\u3002\u6709\u8bb8\u591a\u4e0d\u786e\u5b9a\u6027\u7684\u6765\u6e90,\n\u4ece\u81ea\u9002\u5e94\u65f6\u949f\u901f\u5ea6\u5230\u4e0e\u5176\u4ed6\u8fdb\u7a0b\u7684\u8d44\u6e90\u4e89\u7528\u3002\u6b64\u5916,\u7aef\u5230\u7aef\u65f6\u95f4\u5e76\u4e0d\u80fd\u63ed\u793a\u65f6\u95f4\u82b1\u8d39\u5728\u54ea\u91cc,\n\u800c\u8fd9\u6b63\u662f\u6211\u4eec\u5728\u4f18\u5316\u4ee3\u7801\u65f6\u611f\u5174\u8da3\u7684\u3002\n\n\u4e00\u79cd\u8865\u5145\u65b9\u6cd5\u662f\u4e5f\u6536\u96c6\u6307\u4ee4\u8ba1\u6570\u3002\u8fd9\u4e9b\u8ba1\u6570\u662f\u4e00\u79cd\u4ee3\u7406\u6307\u6807,\u5e76\u4e0d\u80fd\u6355\u83b7\u6027\u80fd\u7684\u6240\u6709\u65b9\u9762\n(\u4f8b\u5982\u5185\u5b58\u6216I/O\u7ed1\u5b9a\u4efb\u52a1),\u4f46\u5b83\u4eec\u786e\u5b9e\u5177\u6709\u4e00\u4e9b\u6709\u7528\u7684\u7279\u6027\u3002\u6307\u4ee4\u8ba1\u6570\u662f\u53ef\u91cd\u590d\u7684,\n\u4e0d\u53d7\u73af\u5883\u53d8\u5316\u7684\u5f71\u54cd,\u5e76\u4e14\u53ef\u4ee5\u63d0\u4f9b\u5bf9\u7a0b\u5e8f\u5728\u54ea\u91cc\u82b1\u8d39\u5468\u671f\u7684\u7ec6\u7c92\u5ea6\u6d1e\u5bdf\u3002\n\n\u4e3a\u4e86\u770b\u5230\u6307\u4ee4\u8ba1\u6570\u7684\u5b9e\u7528\u6027,\u8ba9\u6211\u4eec\u770b\u770b\u5982\u4f55\u51cf\u5c11 `batched_dot_mul_sum` \u7684\u5f00\u9500\u3002\n\u663e\u800c\u6613\u89c1\u7684\u89e3\u51b3\u65b9\u6848\u662f\u5c06\u5176\u79fb\u81f3 C++ ,\u8fd9\u6837\u6211\u4eec\u5c31\u53ef\u4ee5\u907f\u514d\u5728 Python \u548c C++ \u4e4b\u95f4\u591a\u6b21\u6765\u56de\u5207\u6362\u3002\n\n\u5e78\u8fd0\u7684\u662f,\u6e90\u4ee3\u7801\u51e0\u4e4e\u662f\u76f8\u540c\u7684\u3002\u5728 C++ \u4e2d\u6211\u4eec\u5fc5\u987b\u95ee\u7684\u4e00\u4e2a\u95ee\u9898\u662f,\n\u6211\u4eec\u662f\u901a\u8fc7\u503c\u8fd8\u662f\u5f15\u7528\u6765\u4f20\u9012\u53c2\u6570\u3002\n\n\n" ] }, { @@ -355,7 +355,7 @@ }, "outputs": [], "source": [ - "batched_dot_src = \"\"\"\\\n/* ---- Python ---- */\n// def batched_dot_mul_sum(a, b):\n// return a.mul(b).sum(-1)\n\ntorch::Tensor batched_dot_mul_sum_v0(\n const torch::Tensor a,\n const torch::Tensor b) {\n return a.mul(b).sum(-1);\n}\n\ntorch::Tensor batched_dot_mul_sum_v1(\n const torch::Tensor& a,\n const torch::Tensor& b) {\n return a.mul(b).sum(-1);\n}\n\"\"\"\n\n\n# PyTorch makes it easy to test our C++ implementations by providing a utility\n# to JIT compile C++ source into Python extensions:\nimport os\nfrom torch.utils import cpp_extension\ncpp_lib = cpp_extension.load_inline(\n name='cpp_lib',\n cpp_sources=batched_dot_src,\n extra_cflags=['-O3'],\n extra_include_paths=[\n # `load_inline` needs to know where to find ``pybind11`` headers.\n os.path.join(os.getenv('CONDA_PREFIX'), 'include')\n ],\n functions=['batched_dot_mul_sum_v0', 'batched_dot_mul_sum_v1']\n)\n\n# `load_inline` will create a shared object that is loaded into Python. When we collect\n# instruction counts Timer will create a subprocess, so we need to re-import it. The\n# import process is slightly more complicated for C extensions, but that's all we're\n# doing here.\nmodule_import_str = f\"\"\"\\\n# https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-full-path\nimport importlib.util\nspec = importlib.util.spec_from_file_location(\"cpp_lib\", {repr(cpp_lib.__file__)})\ncpp_lib = importlib.util.module_from_spec(spec)\nspec.loader.exec_module(cpp_lib)\"\"\"\n\nimport textwrap\ndef pretty_print(result):\n \"\"\"Import machinery for ``cpp_lib.so`` can get repetitive to look at.\"\"\"\n print(repr(result).replace(textwrap.indent(module_import_str, \" \"), \" import cpp_lib\"))\n\n\nt_baseline = benchmark.Timer(\n stmt='batched_dot_mul_sum(x, x)',\n setup='''\\\nfrom __main__ import batched_dot_mul_sum\nx = torch.randn(2, 2)''')\n\nt0 = benchmark.Timer(\n stmt='cpp_lib.batched_dot_mul_sum_v0(x, x)',\n setup=f'''\\\n{module_import_str}\nx = torch.randn(2, 2)''')\n\nt1 = benchmark.Timer(\n stmt='cpp_lib.batched_dot_mul_sum_v1(x, x)',\n setup=f'''\\\n{module_import_str}\nx = torch.randn(2, 2)''')\n\n# Moving to C++ did indeed reduce overhead, but it's hard to tell which\n# calling convention is more efficient. v1 (call with references) seems to\n# be a bit faster, but it's within measurement error.\npretty_print(t_baseline.blocked_autorange())\npretty_print(t0.blocked_autorange())\npretty_print(t1.blocked_autorange())" + "batched_dot_src = \"\"\"\\\n/* ---- Python ---- */\n// def batched_dot_mul_sum(a, b):\n// return a.mul(b).sum(-1)\n\ntorch::Tensor batched_dot_mul_sum_v0(\n const torch::Tensor a,\n const torch::Tensor b) {\n return a.mul(b).sum(-1);\n}\n\ntorch::Tensor batched_dot_mul_sum_v1(\n const torch::Tensor& a,\n const torch::Tensor& b) {\n return a.mul(b).sum(-1);\n}\n\"\"\"\n\n\n# PyTorch \u63d0\u4f9b\u4e00\u4e2a\u5b9e\u7528\u7a0b\u5e8f\u6765 JIT \u7f16\u8bd1 C++ \u6e90\u4ee3\u7801\u4e3a Python \u6269\u5c55,\n# \u4f7f\u5f97\u6d4b\u8bd5\u6211\u4eec\u7684 C++ \u5b9e\u73b0\u53d8\u5f97\u5f88\u5bb9\u6613:\nimport os\n\nfrom torch.utils import cpp_extension\n\ncpp_lib = cpp_extension.load_inline(\n name=\"cpp_lib\",\n cpp_sources=batched_dot_src,\n extra_cflags=[\"-O3\"],\n extra_include_paths=[\n # `load_inline`\u9700\u8981\u77e5\u9053`pybind11`\u5934\u6587\u4ef6\u7684\u4f4d\u7f6e\u3002\n os.path.join(os.getenv(\"CONDA_PREFIX\"), \"include\")\n ],\n functions=[\"batched_dot_mul_sum_v0\", \"batched_dot_mul_sum_v1\"],\n)\n\n# `load_inline` \u5c06\u521b\u5efa\u4e00\u4e2a\u5171\u4eab\u5bf9\u8c61,\u5e76\u52a0\u8f7d\u5230Python\u4e2d\u3002\u5f53\u6211\u4eec\u6536\u96c6\u6307\u4ee4\u8ba1\u6570\u65f6,\n# Timer\u5c06\u521b\u5efa\u4e00\u4e2a\u5b50\u8fdb\u7a0b,\u56e0\u6b64\u6211\u4eec\u9700\u8981\u91cd\u65b0\u5bfc\u5165\u5b83\u3002\u5bf9\u4e8eC\u6269\u5c55,\u5bfc\u5165\u8fc7\u7a0b\u7565\u6709\u4e0d\u540c,\n# \u4f46\u8fd9\u5c31\u662f\u6211\u4eec\u5728\u8fd9\u91cc\u6240\u505a\u7684\u3002\nmodule_import_str = f\"\"\"\\\n# https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-full-path\nimport importlib.util\nspec = importlib.util.spec_from_file_location(\"cpp_lib\", {repr(cpp_lib.__file__)})\ncpp_lib = importlib.util.module_from_spec(spec)\nspec.loader.exec_module(cpp_lib)\"\"\"\n\nimport textwrap\n\n\ndef pretty_print(result):\n \"\"\"Import machinery for ``cpp_lib.so`` can get repetitive to look at.\"\"\"\n print(\n repr(result).replace(\n textwrap.indent(module_import_str, \" \"), \" import cpp_lib\"\n )\n )\n\n\nt_baseline = benchmark.Timer(\n stmt=\"batched_dot_mul_sum(x, x)\",\n setup=\"\"\"\\\nfrom __main__ import batched_dot_mul_sum\nx = torch.randn(2, 2)\"\"\",\n)\n\nt0 = benchmark.Timer(\n stmt=\"cpp_lib.batched_dot_mul_sum_v0(x, x)\",\n setup=f\"\"\"\\\n{module_import_str}\nx = torch.randn(2, 2)\"\"\",\n)\n\nt1 = benchmark.Timer(\n stmt=\"cpp_lib.batched_dot_mul_sum_v1(x, x)\",\n setup=f\"\"\"\\\n{module_import_str}\nx = torch.randn(2, 2)\"\"\",\n)\n\n# \u8f6c\u79fb\u5230 C++ \u786e\u5b9e\u51cf\u5c11\u4e86\u5f00\u9500,\u4f46\u5f88\u96be\u5224\u65ad\u54ea\u79cd\u8c03\u7528\u7ea6\u5b9a\u66f4\u6709\u6548\u3002v1(\u4f7f\u7528\u5f15\u7528\u8c03\u7528)\u4f3c\u4e4e\u7a0d\u5feb\u4e00\u4e9b,\u4f46\u5728\u6d4b\u91cf\u8bef\u5dee\u8303\u56f4\u5185\u3002\npretty_print(t_baseline.blocked_autorange())\npretty_print(t0.blocked_autorange())\npretty_print(t1.blocked_autorange())" ] }, { @@ -373,7 +373,7 @@ }, "outputs": [], "source": [ - "# Let's use ``Callgrind`` to determine which is better.\nstats_v0 = t0.collect_callgrind()\nstats_v1 = t1.collect_callgrind()\n\npretty_print(stats_v0)\npretty_print(stats_v1)\n\n# `.as_standardized` removes file names and some path prefixes, and makes\n# it easier to read the function symbols.\nstats_v0 = stats_v0.as_standardized()\nstats_v1 = stats_v1.as_standardized()\n\n# `.delta` diffs the instruction counts, and `.denoise` removes several\n# functions in the Python interpreter that are known to have significant\n# jitter.\ndelta = stats_v1.delta(stats_v0).denoise()\n\n# `.transform` is a convenience API for transforming function names. It is\n# useful for increasing cancelation when ``diff-ing`` instructions, as well as\n# just generally improving readability.\nreplacements = (\n (\"???:void pybind11\", \"pybind11\"),\n (\"batched_dot_mul_sum_v0\", \"batched_dot_mul_sum_v1\"),\n (\"at::Tensor, at::Tensor\", \"...\"),\n (\"at::Tensor const&, at::Tensor const&\", \"...\"),\n (\"auto torch::detail::wrap_pybind_function_impl_\", \"wrap_pybind_function_impl_\"),\n)\nfor before, after in replacements:\n delta = delta.transform(lambda l: l.replace(before, after))\n\n# We can use print options to control how much of the function to display.\ntorch.set_printoptions(linewidth=160)\n\n# Once parsed, the instruction counts make clear that passing `a` and `b`\n# by reference is more efficient as it skips some ``c10::TensorImpl`` bookkeeping\n# for the intermediate Tensors, and is also works better with ``pybind11``. This\n# is consistent with our noisy wall time observations.\nprint(delta)" + "# \u8ba9\u6211\u4eec\u4f7f\u7528 ``Callgrind`` \u6765\u786e\u5b9a\u54ea\u79cd\u65b9\u5f0f\u66f4\u597d\u3002\nstats_v0 = t0.collect_callgrind()\nstats_v1 = t1.collect_callgrind()\n\npretty_print(stats_v0)\npretty_print(stats_v1)\n\n# `.as_standardized` \u79fb\u9664\u4e86\u6587\u4ef6\u540d\u548c\u67d0\u4e9b\u8def\u5f84\u524d\u7f00,\u4f7f\u51fd\u6570\u7b26\u53f7\u66f4\u6613\u8bfb\u3002\nstats_v0 = stats_v0.as_standardized()\nstats_v1 = stats_v1.as_standardized()\n\n# `.delta` \u5bf9\u6307\u4ee4\u8ba1\u6570\u8fdb\u884c\u5dee\u5206, `.denoise` \u5219\u79fb\u9664\u4e86 Python \u89e3\u91ca\u5668\u4e2d\u5df2\u77e5\u5b58\u5728\u663e\u8457\u6296\u52a8\u7684\u51e0\u4e2a\u51fd\u6570\u3002\ndelta = stats_v1.delta(stats_v0).denoise()\n\n# `.transform` \u662f\u4e00\u4e2a\u8f6c\u6362\u51fd\u6570\u540d\u7684\u4fbf\u5229 API\u3002\u5b83\u5728\u8fdb\u884c ``diff-ing`` \u65f6\u5f88\u6709\u7528,\u56e0\u4e3a\u53ef\u4ee5\u589e\u52a0\u62b5\u6d88,\u540c\u65f6\u4e5f\u80fd\u63d0\u9ad8\u53ef\u8bfb\u6027\u3002\nreplacements = (\n (\"???:void pybind11\", \"pybind11\"),\n (\"batched_dot_mul_sum_v0\", \"batched_dot_mul_sum_v1\"),\n (\"at::Tensor, at::Tensor\", \"...\"),\n (\"at::Tensor const&, at::Tensor const&\", \"...\"),\n (\"auto torch::detail::wrap_pybind_function_impl_\", \"wrap_pybind_function_impl_\"),\n)\nfor before, after in replacements:\n delta = delta.transform(lambda l: l.replace(before, after))\n\n# \u6211\u4eec\u53ef\u4ee5\u4f7f\u7528\u6253\u5370\u9009\u9879\u6765\u63a7\u5236\u663e\u793a\u51fd\u6570\u7684\u591a\u5c11\u5185\u5bb9\u3002\ntorch.set_printoptions(linewidth=160)\n\n# \u89e3\u6790\u540e,\u6307\u4ee4\u8ba1\u6570\u6e05\u695a\u5730\u8868\u660e,\u901a\u8fc7\u5f15\u7528\u4f20\u9012 `a` \u548c `b` \u66f4\u6709\u6548,\n# \u56e0\u4e3a\u5b83\u8df3\u8fc7\u4e86\u4e00\u4e9b `c10::TensorImpl` \u4e2d\u95f4\u5f20\u91cf\u7684\u7c3f\u8bb0\u64cd\u4f5c,\u5e76\u4e14\u4e0e `pybind11` \u4e5f\u66f4\u517c\u5bb9\u3002\n# \u8fd9\u4e0e\u6211\u4eec\u6709\u566a\u58f0\u65f6\u95f4\u89c2\u5bdf\u7ed3\u679c\u4e00\u81f4\u3002\nprint(delta)" ] }, { @@ -387,7 +387,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Learn More\n\nTake a look at these other recipes to continue your learning:\n\n- [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler.html)\n\n\n" + "## \u5b66\u4e60\u66f4\u591a\n\n\u67e5\u770b\u5176\u4ed6\u6559\u7a0b\u7ee7\u7eed\u5b66\u4e60:\n\n- [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler.html)\n\n\n" ] } ], diff --git a/docs/_downloads/642248c95070825e7ac912504a919140/Captum_Recipe.ipynb b/docs/_downloads/642248c95070825e7ac912504a919140/Captum_Recipe.ipynb index 5a7365e..d19050f 100644 --- a/docs/_downloads/642248c95070825e7ac912504a919140/Captum_Recipe.ipynb +++ b/docs/_downloads/642248c95070825e7ac912504a919140/Captum_Recipe.ipynb @@ -15,35 +15,35 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n# Model Interpretability using Captum\n" + "\n# \u4f7f\u7528 Captum \u8fdb\u884c\u6a21\u578b\u53ef\u89e3\u91ca\u6027\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Captum helps you understand how the data features impact your model\npredictions or neuron activations, shedding light on how your model\noperates.\n\nUsing Captum, you can apply a wide range of state-of-the-art feature\nattribution algorithms such as \\ ``Guided GradCam``\\ and\n\\ ``Integrated Gradients``\\ in a unified way.\n\nIn this recipe you will learn how to use Captum to: \n\n- Attribute the predictions of an image classifier to their corresponding image features. \n- Visualize the attribution results.\n\n\n" + "Captum \u53ef\u4ee5\u5e2e\u52a9\u60a8\u4e86\u89e3\u6570\u636e\u7279\u5f81\u5982\u4f55\u5f71\u54cd\u6a21\u578b\u7684\u9884\u6d4b\u6216\u795e\u7ecf\u5143\u6fc0\u6d3b,\u4ece\u800c\u63ed\u793a\u6a21\u578b\u7684\u5de5\u4f5c\u539f\u7406\u3002\n\n\u4f7f\u7528 Captum,\u60a8\u53ef\u4ee5\u7edf\u4e00\u5730\u5e94\u7528\u5e7f\u6cdb\u7684\u6700\u5148\u8fdb\u7684\u7279\u5f81\u5f52\u56e0\u7b97\u6cd5,\u5982 ``Guided GradCam`` \u548c ``Integrated Gradients``\u3002\n\n\u5728\u672c\u6559\u7a0b\u4e2d,\u60a8\u5c06\u5b66\u4e60\u5982\u4f55\u4f7f\u7528 Captum:\n\n- \u5c06\u56fe\u50cf\u5206\u7c7b\u5668\u7684\u9884\u6d4b\u5f52\u56e0\u4e8e\u76f8\u5e94\u7684\u56fe\u50cf\u7279\u5f81\u3002\n- \u53ef\u89c6\u5316\u5f52\u56e0\u7ed3\u679c\u3002\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Before you begin\n\n\n" + "## \u5f00\u59cb\u4e4b\u524d\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Make sure Captum is installed in your active Python environment. Captum\nis available both on GitHub, as a ``pip`` package, or as a ``conda``\npackage. For detailed instructions, consult the installation guide at\nhttps://captum.ai/\n\n\n" + "\u786e\u4fdd\u5728\u60a8\u7684\u6d3b\u8dc3 Python \u73af\u5883\u4e2d\u5b89\u88c5\u4e86 Captum\u3002Captum \u53ef\u4ee5\u5728 GitHub \u4e0a\u83b7\u53d6,\u4e5f\u53ef\u4ee5\u4f5c\u4e3a ``pip`` \u5305\u6216 ``conda`` \u5305\u83b7\u53d6\u3002\n\u6709\u5173\u8be6\u7ec6\u8bf4\u660e,\u8bf7\u67e5\u9605\u5b89\u88c5\u6307\u5357 https://captum.ai/\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "For a model, we use a built-in image classifier in PyTorch. Captum can\nreveal which parts of a sample image support certain predictions made by\nthe model.\n\n\n" + "\u5bf9\u4e8e\u6a21\u578b,\u6211\u4eec\u4f7f\u7528 PyTorch \u4e2d\u7684\u5185\u7f6e\u56fe\u50cf\u5206\u7c7b\u5668\u3002Captum \u53ef\u4ee5\u63ed\u793a\u6837\u672c\u56fe\u50cf\u7684\u54ea\u4e9b\u90e8\u5206\u652f\u6301\u4e86\u6a21\u578b\u505a\u51fa\u7684\u67d0\u4e9b\u9884\u6d4b\u3002\n\n\n" ] }, { @@ -54,21 +54,21 @@ }, "outputs": [], "source": [ - "import torchvision\nfrom torchvision import models, transforms\nfrom PIL import Image\nimport requests\nfrom io import BytesIO\n\nmodel = torchvision.models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1).eval()\n\nresponse = requests.get(\"https://image.freepik.com/free-photo/two-beautiful-puppies-cat-dog_58409-6024.jpg\")\nimg = Image.open(BytesIO(response.content))\n\ncenter_crop = transforms.Compose([\n transforms.Resize(256),\n transforms.CenterCrop(224),\n])\n\nnormalize = transforms.Compose([\n transforms.ToTensor(), # converts the image to a tensor with values between 0 and 1\n transforms.Normalize( # normalize to follow 0-centered imagenet pixel RGB distribution\n mean=[0.485, 0.456, 0.406],\n std=[0.229, 0.224, 0.225]\n )\n])\ninput_img = normalize(center_crop(img)).unsqueeze(0)" + "from io import BytesIO\nimport requests\nimport torchvision\nfrom PIL import Image\nfrom torchvision import models, transforms\n\nmodel = torchvision.models.resnet18(\n weights=models.ResNet18_Weights.IMAGENET1K_V1\n).eval()\n\nresponse = requests.get(\n \"https://image.freepik.com/free-photo/two-beautiful-puppies-cat-dog_58409-6024.jpg\"\n)\nimg = Image.open(BytesIO(response.content))\n\ncenter_crop = transforms.Compose(\n [\n transforms.Resize(256),\n transforms.CenterCrop(224),\n ]\n)\n\nnormalize = transforms.Compose(\n [\n transforms.ToTensor(), # \u5c06\u56fe\u50cf\u8f6c\u6362\u4e3a\u503c\u5728 0 \u5230 1 \u4e4b\u95f4\u7684\u5f20\u91cf\n transforms.Normalize( # \u5f52\u4e00\u5316\u4ee5\u9075\u5faa 0 \u5747\u503c\u7684 ImageNet \u50cf\u7d20 RGB \u5206\u5e03\n mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]\n ),\n ]\n)\ninput_img = normalize(center_crop(img)).unsqueeze(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Computing Attribution\n\n\n" + "## \u8ba1\u7b97\u5f52\u56e0\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Among the top-3 predictions of the models are classes 208 and 283 which\ncorrespond to dog and cat.\n\nLet us attribute each of these predictions to the corresponding part of\nthe input, using Captum\u2019s \\ ``Occlusion``\\ algorithm.\n\n\n" + "\u5728\u6a21\u578b\u7684\u524d 3 \u4e2a\u9884\u6d4b\u4e2d,\u7c7b\u522b 208 \u548c 283 \u5206\u522b\u5bf9\u5e94\u4e8e\u72d7\u548c\u732b\u3002\n\n\u8ba9\u6211\u4eec\u4f7f\u7528 Captum \u7684 ``Occlusion`` \u7b97\u6cd5\u5c06\u8fd9\u4e9b\u9884\u6d4b\u5f52\u56e0\u4e8e\u8f93\u5165\u7684\u76f8\u5e94\u90e8\u5206\u3002\n\n\n" ] }, { @@ -79,28 +79,28 @@ }, "outputs": [], "source": [ - "from captum.attr import Occlusion \n\nocclusion = Occlusion(model)\n\nstrides = (3, 9, 9) # smaller = more fine-grained attribution but slower\ntarget=208, # Labrador index in ImageNet \nsliding_window_shapes=(3,45, 45) # choose size enough to change object appearance\nbaselines = 0 # values to occlude the image with. 0 corresponds to gray\n\nattribution_dog = occlusion.attribute(input_img,\n strides = strides,\n target=target,\n sliding_window_shapes=sliding_window_shapes,\n baselines=baselines)\n\n\ntarget=283, # Persian cat index in ImageNet \nattribution_cat = occlusion.attribute(input_img,\n strides = strides,\n target=target,\n sliding_window_shapes=sliding_window_shapes,\n baselines=0)" + "from captum.attr import Occlusion\n\nocclusion = Occlusion(model)\n\nstrides = (3, 9, 9) # \u6b65\u957f\u8d8a\u5c0f,\u5f52\u56e0\u8d8a\u7ec6\u7c92\u5ea6,\u4f46\u901f\u5ea6\u8d8a\u6162\ntarget = (208,) # ImageNet \u4e2d\u7684\u62c9\u5e03\u62c9\u591a\u7d22\u5f15\nsliding_window_shapes = (3, 45, 45) # \u9009\u62e9\u8db3\u4ee5\u6539\u53d8\u5bf9\u8c61\u5916\u89c2\u7684\u5927\u5c0f\nbaselines = 0 # \u7528\u4e8e\u906e\u6321\u56fe\u50cf\u7684\u503c\u30020 \u5bf9\u5e94\u7070\u8272\n\nattribution_dog = occlusion.attribute(\n input_img,\n strides=strides,\n target=target,\n sliding_window_shapes=sliding_window_shapes,\n baselines=baselines,\n)\n\n\ntarget = (283,) # ImageNet \u4e2d\u7684\u6ce2\u65af\u732b\u7d22\u5f15\nattribution_cat = occlusion.attribute(\n input_img,\n strides=strides,\n target=target,\n sliding_window_shapes=sliding_window_shapes,\n baselines=0,\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Besides ``Occlusion``, Captum features many algorithms such as\n\\ ``Integrated Gradients``\\ , \\ ``Deconvolution``\\ ,\n\\ ``GuidedBackprop``\\ , \\ ``Guided GradCam``\\ , \\ ``DeepLift``\\ , and\n\\ ``GradientShap``\\ . All of these algorithms are subclasses of\n``Attribution`` which expects your model as a callable ``forward_func``\nupon initialization and has an ``attribute(...)`` method which returns\nthe attribution result in a unified format.\n\nLet us visualize the computed attribution results in case of images.\n\n\n" + "\u9664\u4e86 ``Occlusion`` \u4e4b\u5916,Captum \u8fd8\u63d0\u4f9b\u4e86\u8bb8\u591a\u7b97\u6cd5,\u5982 ``Integrated Gradients``\u3001``Deconvolution``\u3001\n``GuidedBackprop``\u3001``Guided GradCam``\u3001``DeepLift`` \u548c ``GradientShap``\u3002\u6240\u6709\u8fd9\u4e9b\u7b97\u6cd5\u90fd\u662f ``Attribution`` \u7684\u5b50\u7c7b,\n\u5728\u521d\u59cb\u5316\u65f6\u9700\u8981\u5c06\u60a8\u7684\u6a21\u578b\u4f5c\u4e3a\u53ef\u8c03\u7528\u7684 ``forward_func``\u4f20\u5165,\u5e76\u5177\u6709 ``attribute(...)`` \u65b9\u6cd5,\u8be5\u65b9\u6cd5\u4ee5\u7edf\u4e00\u7684\u683c\u5f0f\u8fd4\u56de\u5f52\u56e0\u7ed3\u679c\u3002\n\n\u8ba9\u6211\u4eec\u53ef\u89c6\u5316\u8ba1\u7b97\u51fa\u7684\u56fe\u50cf\u5f52\u56e0\u7ed3\u679c\u3002\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Visualizing the Results\n\n\n" + "## \u53ef\u89c6\u5316\u7ed3\u679c\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Captum\u2019s \\ ``visualization``\\ utility provides out-of-the-box methods\nto visualize attribution results both for pictorial and for textual\ninputs.\n\n\n" + "Captum \u7684 ``visualization`` \u5b9e\u7528\u7a0b\u5e8f\u63d0\u4f9b\u4e86\u5f00\u7bb1\u5373\u7528\u7684\u65b9\u6cd5,\u7528\u4e8e\u53ef\u89c6\u5316\u56fe\u50cf\u548c\u6587\u672c\u8f93\u5165\u7684\u5f52\u56e0\u7ed3\u679c\u3002\n\n\n" ] }, { @@ -111,28 +111,28 @@ }, "outputs": [], "source": [ - "import numpy as np\nfrom captum.attr import visualization as viz\n\n# Convert the compute attribution tensor into an image-like numpy array\nattribution_dog = np.transpose(attribution_dog.squeeze().cpu().detach().numpy(), (1,2,0))\n\nvis_types = [\"heat_map\", \"original_image\"]\nvis_signs = [\"all\", \"all\"] # \"positive\", \"negative\", or \"all\" to show both\n# positive attribution indicates that the presence of the area increases the prediction score\n# negative attribution indicates distractor areas whose absence increases the score\n\n_ = viz.visualize_image_attr_multiple(attribution_dog,\n np.array(center_crop(img)),\n vis_types,\n vis_signs,\n [\"attribution for dog\", \"image\"],\n show_colorbar = True\n )\n\n\nattribution_cat = np.transpose(attribution_cat.squeeze().cpu().detach().numpy(), (1,2,0))\n\n_ = viz.visualize_image_attr_multiple(attribution_cat,\n np.array(center_crop(img)),\n [\"heat_map\", \"original_image\"], \n [\"all\", \"all\"], # positive/negative attribution or all\n [\"attribution for cat\", \"image\"],\n show_colorbar = True\n )" + "import numpy as np\nfrom captum.attr import visualization as viz\n\n# \u5c06\u8ba1\u7b97\u51fa\u7684\u5f52\u56e0\u5f20\u91cf\u8f6c\u6362\u4e3a\u7c7b\u4f3c\u56fe\u50cf\u7684 numpy \u6570\u7ec4\nattribution_dog = np.transpose(\n attribution_dog.squeeze().cpu().detach().numpy(), (1, 2, 0)\n)\n\nvis_types = [\"heat_map\", \"original_image\"]\nvis_signs = [\"all\", \"all\"] # \"positive\"\u3001\"negative\" \u6216 \"all\" \u4ee5\u663e\u793a\u4e24\u8005\n# \u6b63\u5f52\u56e0\u8868\u793a\u8be5\u533a\u57df\u7684\u5b58\u5728\u4f1a\u589e\u52a0\u9884\u6d4b\u5206\u6570\n# \u8d1f\u5f52\u56e0\u8868\u793a\u8be5\u533a\u57df\u7684\u7f3a\u5931\u4f1a\u589e\u52a0\u9884\u6d4b\u5206\u6570\n\n_ = viz.visualize_image_attr_multiple(\n attribution_dog,\n np.array(center_crop(img)),\n vis_types,\n vis_signs,\n [\"attribution for dog\", \"image\"],\n show_colorbar=True,\n)\n\n\nattribution_cat = np.transpose(\n attribution_cat.squeeze().cpu().detach().numpy(), (1, 2, 0)\n)\n\n_ = viz.visualize_image_attr_multiple(\n attribution_cat,\n np.array(center_crop(img)),\n [\"heat_map\", \"original_image\"],\n [\"all\", \"all\"], # \u6b63/\u8d1f\u5f52\u56e0\u6216\u5168\u90e8\n [\"attribution for cat\", \"image\"],\n show_colorbar=True,\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "If your data is textual, ``visualization.visualize_text()`` offers a\ndedicated view to explore attribution on top of the input text. Find out\nmore at http://captum.ai/tutorials/IMDB_TorchText_Interpret\n\n\n" + "\u5982\u679c\u60a8\u7684\u6570\u636e\u662f\u6587\u672c,``visualization.visualize_text()`` \u63d0\u4f9b\u4e86\u4e00\u4e2a\u4e13\u7528\u89c6\u56fe,\u7528\u4e8e\u63a2\u7d22\u8f93\u5165\u6587\u672c\u7684\u5f52\u56e0\u3002\n\u66f4\u591a\u4fe1\u606f\u8bf7\u8bbf\u95ee http://captum.ai/tutorials/IMDB_TorchText_Interpret\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Final Notes\n\n\n" + "## \u6700\u540e\u6ce8\u610f\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Captum can handle most model types in PyTorch across modalities\nincluding vision, text, and more. With Captum you can: \\* Attribute a\nspecific output to the model input as illustrated above. \\* Attribute a\nspecific output to a hidden-layer neuron (see Captum API reference). \\*\nAttribute a hidden-layer neuron response to the model input (see Captum\nAPI reference).\n\nFor complete API of the supported methods and a list of tutorials,\nconsult our website http://captum.ai\n\nAnother useful post by Gilbert Tanner:\nhttps://gilberttanner.com/blog/interpreting-pytorch-models-with-captum\n\n\n" + "Captum \u53ef\u4ee5\u5904\u7406 PyTorch \u4e2d\u5305\u62ec\u89c6\u89c9\u3001\u6587\u672c\u7b49\u5404\u79cd\u6a21\u6001\u7684\u5927\u591a\u6570\u6a21\u578b\u7c7b\u578b\u3002\u4f7f\u7528 Captum \u60a8\u53ef\u4ee5:\n* \u5c06\u7279\u5b9a\u8f93\u51fa\u5f52\u56e0\u4e8e\u6a21\u578b\u8f93\u5165,\u5982\u4e0a\u6240\u793a\u3002\n* \u5c06\u7279\u5b9a\u8f93\u51fa\u5f52\u56e0\u4e8e\u9690\u85cf\u5c42\u795e\u7ecf\u5143(\u53c2\u89c1 Captum API \u53c2\u8003)\u3002\n* \u5c06\u9690\u85cf\u5c42\u795e\u7ecf\u5143\u54cd\u5e94\u5f52\u56e0\u4e8e\u6a21\u578b\u8f93\u5165(\u53c2\u89c1 Captum API \u53c2\u8003)\u3002\n\n\u6709\u5173\u652f\u6301\u65b9\u6cd5\u7684\u5b8c\u6574 API \u548c\u6559\u7a0b\u5217\u8868,\u8bf7\u67e5\u9605\u6211\u4eec\u7684\u7f51\u7ad9 http://captum.ai\n\nGilbert Tanner \u7684\u53e6\u4e00\u7bc7\u6709\u7528\u6587\u7ae0:\nhttps://gilberttanner.com/blog/interpreting-pytorch-models-with-captum\n\n\n" ] } ], diff --git a/docs/_downloads/72c2f17ac50228049705f9a4d76c7815/benchmark.py b/docs/_downloads/72c2f17ac50228049705f9a4d76c7815/benchmark.py index d02157a..075c835 100644 --- a/docs/_downloads/72c2f17ac50228049705f9a4d76c7815/benchmark.py +++ b/docs/_downloads/72c2f17ac50228049705f9a4d76c7815/benchmark.py @@ -1,29 +1,21 @@ """ PyTorch Benchmark ==================================== -This recipe provides a quick-start guide to using PyTorch -``benchmark`` module to measure and compare code performance. +本教程提供了使用 PyTorch ``benchmark`` 模块来测量和比较代码性能的快速入门指南。 -Introduction +介绍 ------------ -Benchmarking is an important step in writing code. It helps -us validate that our code meets performance expectations, -compare different approaches to solving the same problem and -prevent performance regressions. - -There are many options when it comes to benchmarking PyTorch code -including the Python builtin ``timeit`` module. However, benchmarking -PyTorch code has many caveats that can be easily overlooked such as -managing the number of threads and synchronizing CUDA devices. Moreover, -generating Tensor inputs for benchmarking can be quite tedious. - -This recipe demonstrates how to use PyTorch ``benchmark`` module to avoid -common mistakes while making it easier to compare performance of -different code, generate input for benchmarking and more. - -Setup +基准测试是编写代码时的一个重要步骤。它帮助我们验证代码是否满足性能预期,比较解决同一问题的不同方法,并防止性能裂化。 + +对于基准测试 PyTorch 代码有许多选择,包括 Python 内置的 ``timeit`` 模块。 +然而,基准测试 PyTorch 代码有许多容易被忽视的注意事项,例如管理线程数量和同步 CUDA 设备。 +此外,为基准测试生成张量输入可能相当繁琐。 + +本教程演示了如何使用 PyTorch ``benchmark`` 模块来避免常见错误,同时更容易比较不同代码的性能、为基准测试生成输入等。 + +设置 ----- -Before we begin, install ``torch`` if it isn’t already available. +在开始之前,如果尚未安装 ``torch``,请先安装。 :: @@ -31,39 +23,36 @@ """ - ###################################################################### -# Steps +# 具体步骤 # ----- # -# 1. Defining functions to benchmark -# 2. Benchmarking with ``timeit.Timer`` -# 3. Benchmarking with ``torch.utils.benchmark.Timer`` -# 4. Benchmarking with ``Blocked Autorange`` -# 5. Comparing benchmark results -# 6. Saving/Loading benchmark results -# 7. Generating inputs with ``Fuzzed Parameters`` -# 8. Collecting instruction counts with ``Callgrind`` +# 1. 定义要基准测试的函数 +# 2. 使用 ``timeit.Timer`` 进行基准测试 +# 3. 使用 ``torch.utils.benchmark.Timer`` 进行基准测试 +# 4. 使用 ``Blocked Autorange`` 进行基准测试 +# 5. 比较基准测试结果 +# 6. 保存/加载基准测试结果 +# 7. 使用 ``Fuzzed Parameters`` 生成输入 +# 8. 使用 ``Callgrind`` 收集指令计数 # -# 1. Defining functions to benchmark +# 1. 定义要基准测试的函数 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # -# As of the time of this writing, `torch.dot `__ -# does not support batched mode, so we will compare two approaches to -# implementing it using existing ``torch`` operators: one approach uses a -# combination of ``mul`` and ``sum`` while the other reduces the problem to ``bmm``. +# 在撰写本文时, `torch.dot `__ +# 不支持批量模式,因此我们将比较使用现有 ``torch`` 运算符实现它的两种方法:一种方法使用 ``mul`` 和 ``sum`` 的组合,另一种方法使用 ``bmm``。 # import torch def batched_dot_mul_sum(a, b): - '''Computes batched dot by multiplying and summing''' + """Computes batched dot by multiplying and summing""" return a.mul(b).sum(-1) def batched_dot_bmm(a, b): - '''Computes batched dot by reducing to ``bmm``''' + """Computes batched dot by reducing to ``bmm``""" a = a.reshape(-1, 1, a.shape[-1]) b = b.reshape(-1, b.shape[-1], 1) return torch.bmm(a, b).flatten(-3) @@ -77,28 +66,28 @@ def batched_dot_bmm(a, b): ###################################################################### -# 2. Benchmarking with ``timeit.Timer`` +# 2. 使用 ``timeit.Timer`` 进行基准测试 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -# -# First, let's benchmark the code using Python's builtin ``timeit`` module. -# We keep the benchmark code simple here so we can compare the defaults -# of ``timeit`` and ``torch.utils.benchmark``. +# 首先,让我们使用 Python 内置的 ``timeit`` 模块对代码进行基准测试。 +# 我们在这里保持基准测试代码简单,以便我们可以比较 ``timeit`` 和 ``torch.utils.benchmark`` 的默认设置。 # import timeit t0 = timeit.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals={'x': x}) + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals={"x": x}, +) t1 = timeit.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals={'x': x}) + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals={"x": x}, +) -print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us') -print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us') +print(f"mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us") +print(f"bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us") ###################################################################### # .. code-block:: none @@ -110,26 +99,25 @@ def batched_dot_bmm(a, b): ###################################################################### -# 3. Benchmarking with ``torch.utils.benchmark.Timer`` +# 3. 使用 ``torch.utils.benchmark.Timer`` 进行基准测试 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -# -# PyTorch ``benchmark`` module was designed to be familiar to those who -# have used the ``timeit`` module before. However, its defaults make it -# easier and safer to use for benchmarking PyTorch code. Let's first -# compare the same basic API as above. -# +# PyTorch ``benchmark``模块的设计使得对于那些曾经使用过 ``timeit`` 模块的人来说,它看起来很熟悉。 +# 然而,它的默认设置使得它更容易且更安全地用于对 PyTorch 代码进行基准测试。 +# 首先让我们对比一下基本API的使用。 import torch.utils.benchmark as benchmark t0 = benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals={'x': x}) + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals={"x": x}, +) t1 = benchmark.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals={'x': x}) + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals={"x": x}, +) print(t0.timeit(100)) print(t1.timeit(100)) @@ -151,40 +139,37 @@ def batched_dot_bmm(a, b): # ###################################################################### -# Even though the APIs are the same for the basic functionality, there -# are some important differences. ``benchmark.Timer.timeit()`` returns the -# time per run as opposed to the total runtime like ``timeit.Timer.timeit()`` -# does. PyTorch ``benchmark`` module also provides formatted string -# representations for printing the results. +# 虽然基本功能的API是相同的,但是还是有一些重要的区别。 +# ``benchmark.Timer.timeit()``返回的是每次运行的时间,而不是 ``timeit.Timer.timeit()`` 返回的总运行时间。 +# PyTorch ``benchmark``模块还提供了格式化的字符串表示,用于打印结果。 # -# Another important difference, and the reason why the results diverge -# is that PyTorch benchmark module runs in a single thread by default. -# We can change the number of threads with the ``num_threads`` argument. +# 另一个重要的区别,也是结果不同的原因,是PyTorch基准测试模块默认在单线程中运行。 +# 我们可以使用``num_threads``参数来更改线程数量。 # -# ``torch.utils.benchmark.Timer`` takes several additional arguments -# including: ``label``, ``sub_label``, ``description`` and ``env`` which change -# the __repr__ of the measurement object returned and are used for -# grouping the results (more on this later). +# ``torch.utils.benchmark.Timer``接受几个额外的参数,包括: ``label``、``sub_label``、``description``和``env``, +# 这些参数会改变返回的测量对象的__repr__,并用于对结果进行分组(稍后会详细介绍)。 # num_threads = torch.get_num_threads() -print(f'Benchmarking on {num_threads} threads') +print(f"Benchmarking on {num_threads} threads") t0 = benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals={'x': x}, + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals={"x": x}, num_threads=num_threads, - label='Multithreaded batch dot', - sub_label='Implemented using mul and sum') + label="Multithreaded batch dot", + sub_label="Implemented using mul and sum", +) t1 = benchmark.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals={'x': x}, + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals={"x": x}, num_threads=num_threads, - label='Multithreaded batch dot', - sub_label='Implemented using bmm') + label="Multithreaded batch dot", + sub_label="Implemented using bmm", +) print(t0.timeit(100)) print(t1.timeit(100)) @@ -206,32 +191,32 @@ def batched_dot_bmm(a, b): # 1 measurement, 100 runs , 40 threads ###################################################################### -# Running ``benchmark`` with all threads available gives similar results -# as the ``timeit`` module. More importantly, which version is faster -# depends on how many threads we run the code with. This is why it's -# important to benchmark the code with thread settings that are -# representative of real use cases. Another important thing to remember -# is to synchronize CPU and CUDA when benchmarking on the GPU. Let's run -# the above benchmarks again on a CUDA tensor and see what happens. +# 使用所有可用线程运行 ``benchmark`` 会得到与 ``timeit`` 模块类似的结果。 +# 更重要的是,哪个版本更快取决于我们使用多少线程运行代码。 +# 这就是为什么在基准测试时,使用与实际用例相符的线程设置非常重要。 +# 另一个需要记住的重要事情是,在 GPU 上进行基准测试时,要同步CPU和CUDA。 +# 让我们再次在CUDA张量上运行上面的基准测试,看看会发生什么。 # -x = torch.randn(10000, 1024, device='cuda') +x = torch.randn(10000, 1024, device="cuda") t0 = timeit.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals={'x': x}) + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals={"x": x}, +) t1 = timeit.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals={'x': x}) + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals={"x": x}, +) # Ran each twice to show difference before/after warm-up -print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us') -print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us') -print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us') -print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us') +print(f"mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us") +print(f"mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us") +print(f"bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us") +print(f"bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us") ###################################################################### # .. code-block:: none @@ -244,14 +229,16 @@ def batched_dot_bmm(a, b): # t0 = benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals={'x': x}) + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals={"x": x}, +) t1 = benchmark.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals={'x': x}) + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals={"x": x}, +) # Run only once since benchmark module does warm-up for us print(t0.timeit(100)) @@ -274,34 +261,23 @@ def batched_dot_bmm(a, b): # ###################################################################### -# The results reveal something interesting. The first run of the ``bmm`` -# version using the ``timeit`` module takes much longer than the second -# run. This is because ``bmm`` calls into `cuBLAS` which needs to be -# loaded the first time it's called which takes some time. This is why -# it's important to do a warm-up run before benchmarking, luckily for -# us, PyTorch's ``benchmark`` module takes care of that. +# 结果揭示了一些有趣的事情。使用 `timeit` 模块运行 `bmm` 版本的第一次运行比第二次运行慢很多。 +# 这是因为 `bmm` 需要调用 `cuBLAS`,第一次调用时需要加载它,这需要一些时间。 +# 这就是为什么在基准测试之前做一次预热运行很重要,幸运的是, PyTorch 的 `benchmark` 模块为我们处理了这个问题。 # -# The difference in the results between ``timeit`` and ``benchmark`` modules -# is because the `timeit` module is not synchronizing CUDA and is thus only -# timing the time to launch the kernel. PyTorch's ``benchmark`` module does -# the synchronization for us. +# `timeit` 模块和 `benchmark` 模块之间结果的差异是因为 `timeit` 模块没有同步 CUDA,因此只计时了启动内核的时间。 +# PyTorch 的 `benchmark` 模块为我们做了同步。 ###################################################################### -# 4. Benchmarking with `Blocked Autorange` +# 4. 使用 `Blocked Autorange` 进行基准测试 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # -# While ``timeit.Timer.autorange`` takes a single continuous measurement -# of at least 0.2 seconds, `torch.utils.benchmark.blocked_autorange` -# takes many measurements whose times total at least 0.2 seconds (which -# can be changed by the `min_run_time` parameter) subject to the constraint -# that timing overhead is a small fraction of the overall measurement. -# This is accomplished by first running with an increasing number of runs -# per loop until the runtime is much larger than measurement overhead -# (which also serves as a warm up), and then taking measurements until -# the target time is reached. This has the useful properties that it wastes -# less data and allows us to compute statistics to estimate the reliability -# of the measurements. +# 虽然 `timeit.Timer.autorange` 采取至少 0.2 秒的单次连续测量, +# 但 `torch.utils.benchmark.blocked_autorange` 采取多次测量,其总时间至少为 0.2 秒(可通过 `min_run_time` 参数更改), +# 并且测量开销只占总体测量的一小部分。 +# 这是通过首先以递增的循环次数运行,直到运行时间远大于测量开销(这也起到了热身的作用), +# 然后进行测量直到达到目标时间。这有一个有用的特性,即它浪费的数据更少,并且允许我们计算统计数据来估计测量的可靠性。 # m0 = t0.blocked_autorange() @@ -327,8 +303,7 @@ def batched_dot_bmm(a, b): # ###################################################################### -# We can also inspect the individual statistics from the returned -# measurements object. +# 我们还可以查看返回的测量对象中获得的各个统计数据。 print(f"Mean: {m0.mean * 1e6:6.2f} us") print(f"Median: {m0.median * 1e6:6.2f} us") @@ -342,17 +317,14 @@ def batched_dot_bmm(a, b): # ###################################################################### -# 5. Comparing benchmark results +# 5. 比较基准测试结果 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # -# So far we've been comparing our two versions of batched dot against a -# single input. In practice, we want to try a combination of inputs as -# well as different number of threads. The ``Compare`` class helps display -# the results of many measurements in a formatted table. It uses the -# annotations described above (`label`, `sub_label`, `num_threads`, etc.) as -# well as `description` to group and organize the table. Let's use -# ``Compare`` to see how our functions perform for different input sizes -# and number of threads. +# 到目前为止,我们一直在比较我们的两个批量点积版本对同一输入的表现。 +# 在实践中,我们希望尝试不同的输入组合以及不同的线程数量。 +# `Compare` 类帮助我们以格式化表格的形式显示多个测量结果。 +# 它使用上面描述的注释( `label`、 `sub_label`、 `num_threads` 等)以及 `description` 来对表格进行分组和组织。 +# 让我们使用 `Compare` 来看看我们的函数在不同的输入大小和线程数量下的表现如何。 # from itertools import product @@ -364,28 +336,32 @@ def batched_dot_bmm(a, b): for b, n in product(sizes, sizes): # label and sub_label are the rows # description is the column - label = 'Batched dot' - sub_label = f'[{b}, {n}]' + label = "Batched dot" + sub_label = f"[{b}, {n}]" x = torch.ones((b, n)) for num_threads in [1, 4, 16, 32]: - results.append(benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals={'x': x}, - num_threads=num_threads, - label=label, - sub_label=sub_label, - description='mul/sum', - ).blocked_autorange(min_run_time=1)) - results.append(benchmark.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals={'x': x}, - num_threads=num_threads, - label=label, - sub_label=sub_label, - description='bmm', - ).blocked_autorange(min_run_time=1)) + results.append( + benchmark.Timer( + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals={"x": x}, + num_threads=num_threads, + label=label, + sub_label=sub_label, + description="mul/sum", + ).blocked_autorange(min_run_time=1) + ) + results.append( + benchmark.Timer( + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals={"x": x}, + num_threads=num_threads, + label=label, + sub_label=sub_label, + description="bmm", + ).blocked_autorange(min_run_time=1) + ) compare = benchmark.Compare(results) compare.print() @@ -395,7 +371,7 @@ def batched_dot_bmm(a, b): # :caption: Output # # [--------------- Batched dot ----------------] -# | mul/sum | bmm +# | mul/sum | bmm # 1 threads: ----------------------------------- # [1, 1] | 5.9 | 11.2 # [1, 64] | 6.4 | 11.4 @@ -469,12 +445,10 @@ def batched_dot_bmm(a, b): # ###################################################################### -# The results above indicate that the version which reduces to ``bmm`` -# is better for larger tensors running on multiple threads, while for -# smaller and/or single thread code, the other version is better. -# -# ``Compare`` also provides functions for changing the table format +# 上面的结果表明,对于在多线程上运行的较大张量, `bmm` 的版本效果更好, +# 而对于较小和/或单线程代码,另一个版本效果更好。 # +# `Compare` 还提供了用于更改表格格式的函数 compare.trim_significant_figures() compare.colorize() @@ -482,36 +456,34 @@ def batched_dot_bmm(a, b): ###################################################################### -# 6. Saving/Loading benchmark results +# 6. 保存/加载基准测试结果 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # -# `Measurements` (and ``CallgrindStats`` which are described in section 8) -# can be serialized by the ``pickle`` module. This makes A/B testing easy, as you can collect -# measurements from two separate environments, pickle them, and then -# load both in a single environment. Timer even takes an `env` -# constructor argument so that such A/B testing works seamlessly. +# `Measurements` (和第8节中描述的 `CallgrindStats` )可以通过 `pickle` 模块序列化。 +# 这使得A/B测试变得很容易,因为您可以从两个独立的环境中收集测量结果, +# 将它们序列化,然后在单个环境中加载两者。Timer甚至接受一个 `env` +# 构造函数参数,以便这种A/B测试可以无缝衔接。 # -# Let's imagine that rather than two Python functions, the add/sum -# and ``bmm`` approaches were in two different builds of PyTorch. -# The example below demonstrates how one might A/B test them. For -# simplicity, we only use a subset of shapes, and simply round trip -# results through pickle rather than actually using multiple environments -# and writing results to disk. +# 假设 add/sum 和 `bmm` 方法不是两个Python函数,而是 PyTorch 的两个不同版本。 +# 下面的示例演示了如何进行A/B测试。为了简单起见,我们只使用了一部分数据, +# 并简单地通过pickle来回传结果,而不是实际使用多个环境并将结果写入磁盘。 # import pickle ab_test_results = [] -for env in ('environment A: mul/sum', 'environment B: bmm'): +for env in ("environment A: mul/sum", "environment B: bmm"): for b, n in ((1, 1), (1024, 10000), (10000, 1)): x = torch.ones((b, n)) - dot_fn = (batched_dot_mul_sum if env == 'environment A: mul/sum' else batched_dot_bmm) + dot_fn = ( + batched_dot_mul_sum if env == "environment A: mul/sum" else batched_dot_bmm + ) m = benchmark.Timer( - stmt='batched_dot(x, x)', - globals={'x': x, 'batched_dot': dot_fn}, + stmt="batched_dot(x, x)", + globals={"x": x, "batched_dot": dot_fn}, num_threads=1, - label='Batched dot', - description=f'[{b}, {n}]', + label="Batched dot", + description=f"[{b}, {n}]", env=env, ).blocked_autorange(min_run_time=1) ab_test_results.append(pickle.dumps(m)) @@ -535,35 +507,38 @@ def batched_dot_bmm(a, b): # Times are in microseconds (us). # -# And just to show that we can round trip all of the results from earlier: +# 仅为展示可以将之前所有的结果通过 pickle 进行回传: round_tripped_results = pickle.loads(pickle.dumps(results)) -assert(str(benchmark.Compare(results)) == str(benchmark.Compare(round_tripped_results))) +assert str(benchmark.Compare(results)) == str(benchmark.Compare(round_tripped_results)) ###################################################################### -# 7. Generating inputs with `Fuzzed Parameters` +# 7. 使用 `Fuzzed Parameters` 生成输入 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # -# As we've seen in the previous section, there can be some stark -# performance differences depending on the input tensors. Hence, it -# is a good idea to run benchmarks on a number of different inputs. -# However, creating all these input tensors can be tedious which is -# where ``torch.utils.benchmark.Fuzzer`` and related classes come in. -# Let's take a look at how we can use the ``Fuzzer`` to create some test -# cases for the benchmark. +# 正如我们在上一节中看到的,根据输入张量的不同,性能差异可能会很大。 +# 因此,在多个不同的输入上运行基准测试是一个好主意。 +# 但是,创建所有这些输入张量可能会很麻烦,这就是 `torch.utils.benchmark.Fuzzer` +# 和相关类的用武之地。让我们看看如何使用 `Fuzzer` 来创建一些用于基准测试的测试用例。 # -from torch.utils.benchmark import Fuzzer, FuzzedParameter, FuzzedTensor, ParameterAlias +from torch.utils.benchmark import FuzzedParameter, FuzzedTensor, Fuzzer, ParameterAlias -# Generates random tensors with 128 to 10000000 elements and sizes k0 and k1 chosen from a -# ``loguniform`` distribution in [1, 10000], 40% of which will be discontiguous on average. +# 生成随机张量,元素数量在 128 到 10000000 之间,大小 k0 和 k1 从 [1, 10000] 的 `loguniform` 分布中选择, +# 其中平均 40% 将是不连续的。 example_fuzzer = Fuzzer( - parameters = [ - FuzzedParameter('k0', minval=1, maxval=10000, distribution='loguniform'), - FuzzedParameter('k1', minval=1, maxval=10000, distribution='loguniform'), + parameters=[ + FuzzedParameter("k0", minval=1, maxval=10000, distribution="loguniform"), + FuzzedParameter("k1", minval=1, maxval=10000, distribution="loguniform"), ], - tensors = [ - FuzzedTensor('x', size=('k0', 'k1'), min_elements=128, max_elements=10000000, probability_contiguous=0.6) + tensors=[ + FuzzedTensor( + "x", + size=("k0", "k1"), + min_elements=128, + max_elements=10000000, + probability_contiguous=0.6, + ) ], seed=0, ) @@ -571,23 +546,27 @@ def batched_dot_bmm(a, b): results = [] for tensors, tensor_params, params in example_fuzzer.take(10): # description is the column label - sub_label=f"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}" - results.append(benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals=tensors, - label='Batched dot', - sub_label=sub_label, - description='mul/sum', - ).blocked_autorange(min_run_time=1)) - results.append(benchmark.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals=tensors, - label='Batched dot', - sub_label=sub_label, - description='bmm', - ).blocked_autorange(min_run_time=1)) + sub_label = f"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}" + results.append( + benchmark.Timer( + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals=tensors, + label="Batched dot", + sub_label=sub_label, + description="mul/sum", + ).blocked_autorange(min_run_time=1) + ) + results.append( + benchmark.Timer( + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals=tensors, + label="Batched dot", + sub_label=sub_label, + description="bmm", + ).blocked_autorange(min_run_time=1) + ) compare = benchmark.Compare(results) compare.trim_significant_figures() @@ -598,7 +577,7 @@ def batched_dot_bmm(a, b): # :caption: Output # # [--------------------- Batched dot ---------------------] -# | mul/sum | bmm +# | mul/sum | bmm # 1 threads: ---------------------------------------------- # 725 x 257 | 87 | 180 # 49 x 383 | 15 | 30 @@ -611,38 +590,40 @@ def batched_dot_bmm(a, b): # 78 x 5 (discontiguous) | 9 | 20 # 187 x 1 | 12 | 10 # -# Times are in microseconds (us). +# Times are in microseconds (us). # ###################################################################### -# There is a lot of flexibility for defining your own ``fuzzers`` which -# is great for creating a powerful set of inputs to benchmark. But to -# make things even simpler, PyTorch benchmark module comes with some -# built-in ``fuzzers`` for common benchmarking needs. Let's take a look at -# how we can use one of these built-in ``fuzzers``. +# 定义自己的 `fuzzers` 有很大的灵活性,这对于创建强大的输入集进行基准测试非常有用。 +# 但为了让事情变得更简单, PyTorch 基准测试模块为常见的基准测试需求提供了一些内置的 `fuzzers`。 +# 让我们看看如何使用其中一个内置的 `fuzzers` 。 # from torch.utils.benchmark.op_fuzzers import binary results = [] for tensors, tensor_params, params in binary.BinaryOpFuzzer(seed=0).take(10): - sub_label=f"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}" - results.append(benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals=tensors, - label='Batched dot', - sub_label=sub_label, - description='mul/sum', - ).blocked_autorange(min_run_time=1)) - results.append(benchmark.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals=tensors, - label='Batched dot', - sub_label=sub_label, - description='bmm', - ).blocked_autorange(min_run_time=1)) + sub_label = f"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}" + results.append( + benchmark.Timer( + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals=tensors, + label="Batched dot", + sub_label=sub_label, + description="mul/sum", + ).blocked_autorange(min_run_time=1) + ) + results.append( + benchmark.Timer( + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals=tensors, + label="Batched dot", + sub_label=sub_label, + description="bmm", + ).blocked_autorange(min_run_time=1) + ) compare = benchmark.Compare(results) compare.trim_significant_figures() @@ -654,7 +635,7 @@ def batched_dot_bmm(a, b): # :caption: Output # # [----------------------- Batched dot ------------------------] -# | mul/sum | bmm +# | mul/sum | bmm # 1 threads: --------------------------------------------------- # 64 x 473 (discontiguous) | 10000 | 40000 # 16384 x 12642115 (discontiguous) | 31 | 78 @@ -666,33 +647,27 @@ def batched_dot_bmm(a, b): # 488 x 62374 | 90000 | 100000 # 240372 x 69 | 40000 | 16000 # 40156 x 32 (discontiguous) | 2670 | 5000 -# +# # Times are in microseconds (us). # ###################################################################### -# 8. Collecting instruction counts with ``Callgrind`` +# 8. 使用 `Callgrind` 收集指令计数 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # -# One of the challenges of optimizing code is the variation and opacity of -# wall time. There are many sources of non-determinism, from adaptive clock -# speeds to resource contention with other processes. Furthermore, end-to-end -# time gives no insight into where time is being spent, which is really what -# we're interested in when optimizing code. +# 优化代码的一个挑战是时间的变化和不透明性。有许多不确定性的来源, +# 从自适应时钟速度到与其他进程的资源争用。此外,端到端时间并不能揭示时间花费在哪里, +# 而这正是我们在优化代码时感兴趣的。 # -# A complementary approach is to also collect instruction counts. These counts -# are a proxy metric and do not capture all aspects of performance -# (e.g. memory or I/O bound tasks), however they do have several useful -# properties. Instruction counts are reproducible, insensitive to environmental -# variation, and offer fine grained insight into where a program is spending -# cycles. +# 一种补充方法是也收集指令计数。这些计数是一种代理指标,并不能捕获性能的所有方面 +# (例如内存或I/O绑定任务),但它们确实具有一些有用的特性。指令计数是可重复的, +# 不受环境变化的影响,并且可以提供对程序在哪里花费周期的细粒度洞察。 # -# To see the utility of instruction counts, let us look at how we might -# reduce the overhead of `batched_dot_mul_sum`. The obvious solution is to -# move it to C++, so we avoid going between Python and C++ multiple times. +# 为了看到指令计数的实用性,让我们看看如何减少 `batched_dot_mul_sum` 的开销。 +# 显而易见的解决方案是将其移至 C++ ,这样我们就可以避免在 Python 和 C++ 之间多次来回切换。 # -# Fortunately, the source is nearly identical. One question that we have to ask -# in C++ is whether we should take arguments by value or reference. +# 幸运的是,源代码几乎是相同的。在 C++ 中我们必须问的一个问题是, +# 我们是通过值还是引用来传递参数。 # batched_dot_src = """\ @@ -714,25 +689,26 @@ def batched_dot_bmm(a, b): """ -# PyTorch makes it easy to test our C++ implementations by providing a utility -# to JIT compile C++ source into Python extensions: +# PyTorch 提供一个实用程序来 JIT 编译 C++ 源代码为 Python 扩展, +# 使得测试我们的 C++ 实现变得很容易: import os + from torch.utils import cpp_extension + cpp_lib = cpp_extension.load_inline( - name='cpp_lib', + name="cpp_lib", cpp_sources=batched_dot_src, - extra_cflags=['-O3'], + extra_cflags=["-O3"], extra_include_paths=[ - # `load_inline` needs to know where to find ``pybind11`` headers. - os.path.join(os.getenv('CONDA_PREFIX'), 'include') + # `load_inline`需要知道`pybind11`头文件的位置。 + os.path.join(os.getenv("CONDA_PREFIX"), "include") ], - functions=['batched_dot_mul_sum_v0', 'batched_dot_mul_sum_v1'] + functions=["batched_dot_mul_sum_v0", "batched_dot_mul_sum_v1"], ) -# `load_inline` will create a shared object that is loaded into Python. When we collect -# instruction counts Timer will create a subprocess, so we need to re-import it. The -# import process is slightly more complicated for C extensions, but that's all we're -# doing here. +# `load_inline` 将创建一个共享对象,并加载到Python中。当我们收集指令计数时, +# Timer将创建一个子进程,因此我们需要重新导入它。对于C扩展,导入过程略有不同, +# 但这就是我们在这里所做的。 module_import_str = f"""\ # https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-full-path import importlib.util @@ -741,32 +717,39 @@ def batched_dot_bmm(a, b): spec.loader.exec_module(cpp_lib)""" import textwrap + + def pretty_print(result): """Import machinery for ``cpp_lib.so`` can get repetitive to look at.""" - print(repr(result).replace(textwrap.indent(module_import_str, " "), " import cpp_lib")) + print( + repr(result).replace( + textwrap.indent(module_import_str, " "), " import cpp_lib" + ) + ) t_baseline = benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='''\ + stmt="batched_dot_mul_sum(x, x)", + setup="""\ from __main__ import batched_dot_mul_sum -x = torch.randn(2, 2)''') +x = torch.randn(2, 2)""", +) t0 = benchmark.Timer( - stmt='cpp_lib.batched_dot_mul_sum_v0(x, x)', - setup=f'''\ + stmt="cpp_lib.batched_dot_mul_sum_v0(x, x)", + setup=f"""\ {module_import_str} -x = torch.randn(2, 2)''') +x = torch.randn(2, 2)""", +) t1 = benchmark.Timer( - stmt='cpp_lib.batched_dot_mul_sum_v1(x, x)', - setup=f'''\ + stmt="cpp_lib.batched_dot_mul_sum_v1(x, x)", + setup=f"""\ {module_import_str} -x = torch.randn(2, 2)''') +x = torch.randn(2, 2)""", +) -# Moving to C++ did indeed reduce overhead, but it's hard to tell which -# calling convention is more efficient. v1 (call with references) seems to -# be a bit faster, but it's within measurement error. +# 转移到 C++ 确实减少了开销,但很难判断哪种调用约定更有效。v1(使用引用调用)似乎稍快一些,但在测量误差范围内。 pretty_print(t_baseline.blocked_autorange()) pretty_print(t0.blocked_autorange()) pretty_print(t1.blocked_autorange()) @@ -780,7 +763,7 @@ def pretty_print(result): # setup: # from __main__ import batched_dot_mul_sum # x = torch.randn(2, 2) -# +# # 6.92 us # 1 measurement, 100000 runs , 1 thread # @@ -788,7 +771,7 @@ def pretty_print(result): # setup: # import cpp_lib # x = torch.randn(2, 2) -# +# # 5.29 us # 1 measurement, 100000 runs , 1 thread # @@ -796,31 +779,26 @@ def pretty_print(result): # setup: # import cpp_lib # x = torch.randn(2, 2) -# +# # 5.22 us # 1 measurement, 100000 runs , 1 thread # -# Let's use ``Callgrind`` to determine which is better. +# 让我们使用 ``Callgrind`` 来确定哪种方式更好。 stats_v0 = t0.collect_callgrind() stats_v1 = t1.collect_callgrind() pretty_print(stats_v0) pretty_print(stats_v1) -# `.as_standardized` removes file names and some path prefixes, and makes -# it easier to read the function symbols. +# `.as_standardized` 移除了文件名和某些路径前缀,使函数符号更易读。 stats_v0 = stats_v0.as_standardized() stats_v1 = stats_v1.as_standardized() -# `.delta` diffs the instruction counts, and `.denoise` removes several -# functions in the Python interpreter that are known to have significant -# jitter. +# `.delta` 对指令计数进行差分, `.denoise` 则移除了 Python 解释器中已知存在显著抖动的几个函数。 delta = stats_v1.delta(stats_v0).denoise() -# `.transform` is a convenience API for transforming function names. It is -# useful for increasing cancelation when ``diff-ing`` instructions, as well as -# just generally improving readability. +# `.transform` 是一个转换函数名的便利 API。它在进行 ``diff-ing`` 时很有用,因为可以增加抵消,同时也能提高可读性。 replacements = ( ("???:void pybind11", "pybind11"), ("batched_dot_mul_sum_v0", "batched_dot_mul_sum_v1"), @@ -831,13 +809,12 @@ def pretty_print(result): for before, after in replacements: delta = delta.transform(lambda l: l.replace(before, after)) -# We can use print options to control how much of the function to display. +# 我们可以使用打印选项来控制显示函数的多少内容。 torch.set_printoptions(linewidth=160) -# Once parsed, the instruction counts make clear that passing `a` and `b` -# by reference is more efficient as it skips some ``c10::TensorImpl`` bookkeeping -# for the intermediate Tensors, and is also works better with ``pybind11``. This -# is consistent with our noisy wall time observations. +# 解析后,指令计数清楚地表明,通过引用传递 `a` 和 `b` 更有效, +# 因为它跳过了一些 `c10::TensorImpl` 中间张量的簿记操作,并且与 `pybind11` 也更兼容。 +# 这与我们有噪声时间观察结果一致。 print(delta) ###################################################################### @@ -879,10 +856,10 @@ def pretty_print(result): ###################################################################### -# Learn More +# 学习更多 # ---------- # -# Take a look at these other recipes to continue your learning: +# 查看其他教程继续学习: # # - `PyTorch Profiler `_ # diff --git a/docs/_downloads/8ec147fe4546ad23cb0cefdb015f3352/swap_tensors.ipynb b/docs/_downloads/8ec147fe4546ad23cb0cefdb015f3352/swap_tensors.ipynb index 7d9ea29..425f272 100644 --- a/docs/_downloads/8ec147fe4546ad23cb0cefdb015f3352/swap_tensors.ipynb +++ b/docs/_downloads/8ec147fe4546ad23cb0cefdb015f3352/swap_tensors.ipynb @@ -15,14 +15,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n# Extension points in ``nn.Module`` for ``load_state_dict`` and tensor subclasses\n**Author:** [Mikayla Gawarecki](https://github.com/mikaylagawarecki)\n\nThis recipe introduces a new utility function ``torch.utils.swap_tensors``\nas well as two new extension points where it has been integrated in\n``nn.Module``:\n\n* ``nn.Module.to()`` and related methods\n* ``nn.Module.load_state_dict()``\n\n

Note

This recipe requires PyTorch 2.3.0 or later.

\n" + "\n# \u5728 ``nn.Module`` \u4e2d\u4e3a ``load_state_dict`` \u548c\u5f20\u91cf\u5b50\u7c7b\u63d0\u4f9b\u6269\u5c55\u70b9\n**\u4f5c\u8005:** [Mikayla Gawarecki](https://github.com/mikaylagawarecki)\n\n\u672c\u6559\u7a0b\u4ecb\u7ecd\u4e86\u4e00\u4e2a\u65b0\u7684\u5b9e\u7528\u51fd\u6570 ``torch.utils.swap_tensors``\uff0c\n\u4ee5\u53ca\u5728 ``nn.Module`` \u4e2d\u96c6\u6210\u5b83\u7684\u4e24\u4e2a\u65b0\u6269\u5c55\u70b9:\n\n* ``nn.Module.to()`` \u548c\u76f8\u5173\u65b9\u6cd5\n* ``nn.Module.load_state_dict()``\n\n

Note

\u672c\u6559\u7a0b\u9700\u8981 PyTorch 2.3.0 \u6216\u66f4\u9ad8\u7248\u672c\u3002

\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## ``torch.utils.swap_tensors``\n``torch.utils.swap_tensors`` (hereafter referred to as ``swap_tensors``) is a\nutility function that takes in two Python tensors and swaps them.\n\n" + "## ``torch.utils.swap_tensors``\n``torch.utils.swap_tensors``(\u4ee5\u4e0b\u7b80\u79f0\u4e3a ``swap_tensors``) \u662f\u4e00\u4e2a\n\u5b9e\u7528\u51fd\u6570,\u5b83\u63a5\u53d7\u4e24\u4e2a Python \u5f20\u91cf\u5e76\u4ea4\u6362\u5b83\u4eec\u3002\n\n" ] }, { @@ -33,14 +33,14 @@ }, "outputs": [], "source": [ - "import torch\nimport torch.nn as nn\nt1 = torch.arange(2)\nt2 = torch.arange(3)\nprint(f\"Before swapping, t1: {t1}, t2: {t2}\")\ntorch.utils.swap_tensors(t1, t2)\nprint(f\"After swapping, t1: {t1}, t2: {t2}\")" + "import torch\nimport torch.nn as nn\n\nt1 = torch.arange(2)\nt2 = torch.arange(3)\nprint(f\"\u4ea4\u6362\u524d, t1: {t1}, t2: {t2}\")\ntorch.utils.swap_tensors(t1, t2)\nprint(f\"\u4ea4\u6362\u540e, t1: {t1}, t2: {t2}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "More specifically, ``swap_tensors`` swaps the Python ``__class__``, ``__dict__``\nand ``__slots__`` of the two tensors, as well as their associated ``at::Tensor``.\n\n\n## Application to ``nn.Module``\nThis utility is pertinent to ``nn.Module`` when a Python object outside\nof the module holds a reference to parameters of the module. If an ``nn.Module``\nmodifies any of its parameters out of place, the object holding references to\nthe parameters will not see the change. A classic example of this is the\noptimizer, which holds a reference to the parameters of the ``nn.Module``.\nThis leads to a silent correctness issue where the ``optimizer.step()`` will\nrun without error but the weights of the ``nn.Module`` will not be updated.\n\n" + "\u66f4\u5177\u4f53\u5730\u8bf4,``swap_tensors`` \u4ea4\u6362\u4e86\u4e24\u4e2a\u5f20\u91cf\u7684 Python ``__class__``\u3001``__dict__``\n\u548c ``__slots__``,\u4ee5\u53ca\u5b83\u4eec\u76f8\u5173\u7684 ``at::Tensor``\u3002\n\n\n## \u5e94\u7528\u4e8e ``nn.Module``\n\u5f53 ``nn.Module`` \u4e4b\u5916\u7684 Python \u5bf9\u8c61\u6301\u6709\u8be5\u6a21\u5757\u53c2\u6570\u7684\u5f15\u7528\u65f6,\u6b64\u5b9e\u7528\u51fd\u6570\u5c31\u5f88\u6709\u7528\u3002\n\u5982\u679c ``nn.Module`` \u5c31\u5730\u4fee\u6539\u4e86\u4efb\u4f55\u53c2\u6570,\u6301\u6709\u8fd9\u4e9b\u53c2\u6570\u5f15\u7528\u7684\u5bf9\u8c61\u5c06\u65e0\u6cd5\u770b\u5230\u66f4\u6539\u3002\n\u4e00\u4e2a\u5178\u578b\u7684\u4f8b\u5b50\u662f\u4f18\u5316\u5668,\u5b83\u6301\u6709 ``nn.Module`` \u53c2\u6570\u7684\u5f15\u7528\u3002\n\u8fd9\u4f1a\u5bfc\u81f4\u4e00\u4e2a\u6f5c\u5728\u7684\u6b63\u786e\u6027\u95ee\u9898,\u5373 ``optimizer.step()`` \u4f1a\u65e0\u9519\u8bef\u8fd0\u884c,\n\u4f46 ``nn.Module`` \u7684\u6743\u91cd\u4e0d\u4f1a\u88ab\u66f4\u65b0\u3002\n\n" ] }, { @@ -51,14 +51,14 @@ }, "outputs": [], "source": [ - "mod = torch.nn.Linear(1, 2, bias=False)\noptimizer = torch.optim.SGD(mod.parameters())\nprint(f\"weight in mod: {mod.weight}\")\nprint(f\"weight in optimizer: {optimizer.param_groups[0]['params']}\")\nmod.weight = torch.nn.Parameter(2 * mod.weight)\nprint(f\"weight in mod: {mod.weight}\")\nprint(f\"weight in optimizer: {optimizer.param_groups[0]['params']}\")" + "mod = torch.nn.Linear(1, 2, bias=False)\noptimizer = torch.optim.SGD(mod.parameters())\nprint(f\"mod \u4e2d\u7684\u6743\u91cd: {mod.weight}\")\nprint(f\"\u4f18\u5316\u5668\u4e2d\u7684\u6743\u91cd: {optimizer.param_groups[0]['params']}\")\nmod.weight = torch.nn.Parameter(2 * mod.weight)\nprint(f\"mod \u4e2d\u7684\u6743\u91cd: {mod.weight}\")\nprint(f\"\u4f18\u5316\u5668\u4e2d\u7684\u6743\u91cd: {optimizer.param_groups[0]['params']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## ``nn.Module.to()`` and related methods\nThis includes methods that change the device of the module (such as ``nn.Module.cpu()``),\nmethods that change the ``dtype`` of the module (such as ``nn.Module.float()``)\nas well as methods that allow the module to be materialized\n(such as ``nn.Module.to_empty()``).\n\nAt first glance, it might be non-intuitive that these methods are able to\nmodify the parameters of the module in-place. The existing approach has been\nto use a nasty hack dating back from the first days of PyTorch.\n\nNotably, the existing approach does not work in these cases:\n\n* when using ``__torch_dispatch__`` subclasses\n* when ``param`` and ``new_param`` do not have the same Python ``type()``\n* For tensors with special C++ representations (such as sparse tensors and ``XLA`` tensors)\n\nIn the following part of this recipe, we will define a toy ``__torch_dispatch__``\nsubclass ``MyQuantizedLinearWeight`` that represents quantized linear weights.\nThis subclass will be used for illustration purposes throughout the rest of\nthe tutorial. For brevity, we omit most of the ``__torch_dispatch__``\nimplementation.\n\n" + "## ``nn.Module.to()`` \u548c\u76f8\u5173\u65b9\u6cd5\n\u8fd9\u5305\u62ec\u6539\u53d8\u6a21\u5757\u8bbe\u5907\u7684\u65b9\u6cd5(\u5982 ``nn.Module.cpu()``)\u3001\n\u6539\u53d8\u6a21\u5757 ``dtype`` \u7684\u65b9\u6cd5(\u5982 ``nn.Module.float()``)\u3001\n\u4ee5\u53ca\u5141\u8bb8\u6a21\u5757\u5b9e\u4f8b\u5316\u7684\u65b9\u6cd5(\u5982 ``nn.Module.to_empty()``)\u3002\n\n\u4e4d\u4e00\u770b,\u8fd9\u4e9b\u65b9\u6cd5\u80fd\u591f\u5c31\u5730\u4fee\u6539\u6a21\u5757\u7684\u53c2\u6570\u53ef\u80fd\u770b\u8d77\u6765\u4e0d\u592a\u76f4\u89c2\u3002\n\u73b0\u6709\u7684\u65b9\u6cd5\u662f\u4f7f\u7528\u4e00\u79cd\u8ffd\u6eaf\u5230 PyTorch \u6700\u521d\u51e0\u5929\u7684\u4e11\u964b\u9ed1\u5ba2\u624b\u6bb5\u3002\n\n\u503c\u5f97\u6ce8\u610f\u7684\u662f,\u73b0\u6709\u65b9\u6cd5\u5728\u4ee5\u4e0b\u60c5\u51b5\u4e0b\u65e0\u6cd5\u5de5\u4f5c:\n\n* \u4f7f\u7528 ``__torch_dispatch__`` \u5b50\u7c7b\n* ``param`` \u548c ``new_param`` \u7684 Python ``type()`` \u4e0d\u540c\n* \u5bf9\u4e8e\u5177\u6709\u7279\u6b8a C++ \u8868\u793a\u7684\u5f20\u91cf(\u5982\u7a00\u758f\u5f20\u91cf\u548c ``XLA`` \u5f20\u91cf)\n\n\u5728\u672c\u6559\u7a0b\u7684\u4e0b\u4e00\u90e8\u5206,\u6211\u4eec\u5c06\u5b9a\u4e49\u4e00\u4e2a\u73a9\u5177 ``__torch_dispatch__`` \u5b50\u7c7b ``MyQuantizedLinearWeight``\n\u6765\u8868\u793a\u91cf\u5316\u7684\u7ebf\u6027\u6743\u91cd\u3002\u5728\u672c\u6559\u7a0b\u7684\u5269\u4f59\u90e8\u5206,\u6211\u4eec\u5c06\u4f7f\u7528\u8fd9\u4e2a\u5b50\u7c7b\u8fdb\u884c\u8bf4\u660e\u3002\n\u4e3a\u7b80\u6d01\u8d77\u89c1,\u6211\u4eec\u7701\u7565\u4e86\u5927\u90e8\u5206 ``__torch_dispatch__`` \u5b9e\u73b0\u3002\n\n" ] }, { @@ -69,14 +69,14 @@ }, "outputs": [], "source": [ - "aten = torch.ops.aten\n\nclass MyQuantizedLinearWeight(torch.Tensor):\n @staticmethod\n def __new__(cls, elem, scale):\n return torch.Tensor._make_wrapper_subclass(\n cls,\n elem.shape,\n dtype=elem.dtype,\n layout=elem.layout,\n device=elem.device,\n strides=elem.stride(),\n storage_offset=elem.storage_offset())\n\n def __init__(self, elem: torch.Tensor, scale: float):\n self.elem = elem\n self.scale = scale\n\n def __repr__(self):\n return f\"MyQuantizedLinearWeight({self.elem}, scale={self.scale})\"\n\n @classmethod\n def __torch_dispatch__(cls, func, types, args, kwargs):\n if func in (aten.detach.default, aten._to_copy.default):\n new_elem = func(args[0].elem, *args[1:], **kwargs)\n return cls(new_elem, args[0].scale)\n # Implementations for certain ops would be added to ``OP_TABLE``.\n # We omit this for brevity.\n OP_TABLE = dict()\n if func in OP_TABLE:\n return OP_TABLE[func](func, args, kwargs)\n raise NotImplementedError(f\"Unsupported function {func}\")" + "aten = torch.ops.aten\n\n\nclass MyQuantizedLinearWeight(torch.Tensor):\n @staticmethod\n def __new__(cls, elem, scale):\n return torch.Tensor._make_wrapper_subclass(\n cls,\n elem.shape,\n dtype=elem.dtype,\n layout=elem.layout,\n device=elem.device,\n strides=elem.stride(),\n storage_offset=elem.storage_offset(),\n )\n\n def __init__(self, elem: torch.Tensor, scale: float):\n self.elem = elem\n self.scale = scale\n\n def __repr__(self):\n return f\"MyQuantizedLinearWeight({self.elem}, scale={self.scale})\"\n\n @classmethod\n def __torch_dispatch__(cls, func, types, args, kwargs):\n if func in (aten.detach.default, aten._to_copy.default):\n new_elem = func(args[0].elem, *args[1:], **kwargs)\n return cls(new_elem, args[0].scale)\n # \u67d0\u4e9b\u64cd\u4f5c\u7684\u5b9e\u73b0\u5c06\u6dfb\u52a0\u5230 ``OP_TABLE``\u3002\n # \u4e3a\u7b80\u6d01\u8d77\u89c1,\u6211\u4eec\u5728\u6b64\u7701\u7565\u3002\n OP_TABLE = dict()\n if func in OP_TABLE:\n return OP_TABLE[func](func, args, kwargs)\n raise NotImplementedError(f\"\u4e0d\u652f\u6301\u7684\u51fd\u6570 {func}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Let us create an ``nn.Linear`` layer of ``dtype`` ``torch.float32`` where the weight is\na ``MyQuantizedLinearWeight`` and try to convert it to ``torch.bfloat16``.\nObserve that the weight's ``dtype`` changes as expected. However, the ``dtype``\nof the subclass' payload (``elem``) does not change.\n\n" + "\u8ba9\u6211\u4eec\u521b\u5efa\u4e00\u4e2a ``dtype`` \u4e3a ``torch.float32`` \u7684 ``nn.Linear`` \u5c42,\n\u5176\u6743\u91cd\u662f ``MyQuantizedLinearWeight``\u3002\u7136\u540e\u5c1d\u8bd5\u5c06\u5176\u8f6c\u6362\u4e3a ``torch.bfloat16``\u3002\n\u89c2\u5bdf\u5230\u6743\u91cd\u7684 ``dtype`` \u5982\u9884\u671f\u822c\u6539\u53d8\u4e86\u3002\u4f46\u662f\u5b50\u7c7b\u7684\u6709\u6548\u8f7d\u8377(``elem``)\u7684 ``dtype`` \u6ca1\u6709\u6539\u53d8\u3002\n\n" ] }, { @@ -87,14 +87,14 @@ }, "outputs": [], "source": [ - "m = nn.Linear(3, 5, dtype=torch.float32)\nm.weight = torch.nn.Parameter(MyQuantizedLinearWeight(m.weight, 0.5))\nprint(f\"Before: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}\")\nm.bfloat16()\nprint(f\"After: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}\")\nprint(f\"m.weight.dtype: {m.weight.dtype}\")\nprint(f\"m.weight.elem.dtype: {m.weight.elem.dtype}\")\nprint(f\"m.bias.dtype: {m.bias.dtype}\")" + "m = nn.Linear(3, 5, dtype=torch.float32)\nm.weight = torch.nn.Parameter(MyQuantizedLinearWeight(m.weight, 0.5))\nprint(f\"\u4e4b\u524d: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}\")\nm.bfloat16()\nprint(f\"\u4e4b\u540e: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}\")\nprint(f\"m.weight.dtype: {m.weight.dtype}\")\nprint(f\"m.weight.elem.dtype: {m.weight.elem.dtype}\")\nprint(f\"m.bias.dtype: {m.bias.dtype}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "To this end, we introduce a global config\n``torch.__future__.set_swap_module_params_on_conversion`` that will use\n``swap_tensors`` to swap the parameters of the module while preserving\nreferences in place of ``.data`` setting. When this config is set,\n``swap_tensors`` will be used during the conversion, which ensures that\nthe ``dtype`` of the payload is properly converted.\n\n" + "\u4e3a\u6b64,\u6211\u4eec\u5f15\u5165\u4e86\u4e00\u4e2a\u5168\u5c40\u914d\u7f6e ``torch.__future__.set_swap_module_params_on_conversion``\n\u5b83\u5c06\u4f7f\u7528 ``swap_tensors`` \u4ea4\u6362\u6a21\u5757\u7684\u53c2\u6570,\u540c\u65f6\u4fdd\u7559 ``.data`` \u8bbe\u7f6e\u4e2d\u7684\u5f15\u7528\u3002\n\u8bbe\u7f6e\u6b64\u914d\u7f6e\u540e,\u5728\u8f6c\u6362\u671f\u95f4\u5c06\u4f7f\u7528 ``swap_tensors``,\u4ece\u800c\u786e\u4fdd\u6709\u6548\u8f7d\u8377\u7684 ``dtype`` \u6b63\u786e\u8f6c\u6362\u3002\n\n" ] }, { @@ -105,14 +105,14 @@ }, "outputs": [], "source": [ - "torch.__future__.set_swap_module_params_on_conversion(True)\nm = nn.Linear(3, 5, dtype=torch.float32)\nm.weight = torch.nn.Parameter(MyQuantizedLinearWeight(m.weight, 0.5))\nprint(f\"Before: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}\")\nm.bfloat16()\nprint(f\"After: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}\")\nprint(f\"m.weight.dtype: {m.weight.dtype}\")\nprint(f\"m.weight.elem.dtype: {m.weight.elem.dtype}\")\nprint(f\"m.bias.dtype: {m.bias.dtype}\")\ntorch.__future__.set_swap_module_params_on_conversion(False)" + "torch.__future__.set_swap_module_params_on_conversion(True)\nm = nn.Linear(3, 5, dtype=torch.float32)\nm.weight = torch.nn.Parameter(MyQuantizedLinearWeight(m.weight, 0.5))\nprint(f\"\u4e4b\u524d: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}\")\nm.bfloat16()\nprint(f\"\u4e4b\u540e: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}\")\nprint(f\"m.weight.dtype: {m.weight.dtype}\")\nprint(f\"m.weight.elem.dtype: {m.weight.elem.dtype}\")\nprint(f\"m.bias.dtype: {m.bias.dtype}\")\ntorch.__future__.set_swap_module_params_on_conversion(False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## ``nn.Module.load_state_dict()``\nDepending on the value of the ``assign`` keyword argument passed\nto ``load_state_dict()``, there are two ways to load the ``state_dict``:\n\n* ``assign=False``: preserves the properties of ``module.param`` and only takes the values\n from ``state_dict['param_name']``\n* ``assign=True``: preserves the properties and values of ``state_dict['param_name']``.\n\n\nPreviously, these were implemented with in-place ``copy_`` and ``__setattr__`` respectively.\nWith the existing implementation, each approach had its own limitations -- ``assign=False``\nimposes the constraint that the type of the parameter in the ``state_dict`` must\nbe the same as the type of the parameter in the module while ``assign=True`` imposes\nthe constraint that anything that holds references to the module's parameters must\nbe initialized after ``nn.Module.load_state_dict()``.\n\nNow, we address both constraints by adding a ``swap_tensors`` path to ``load_state_dict()``\nand introducing a new extension point ``torch.Tensor.module_load(self, other, assign=False)``.\nWhen the ``swap_tensors`` path is enabled via the ``__future__`` mentioned above,\nwe can use a ``__torch_function__`` handler for ``module_load`` to apply a\ncustom transformation to the value in the ``state_dict``. The result of this\ntransformation will be swapped with the parameter in the module.\n\nIn the following example, we will use the ``MyQuantizedLinearWeight`` subclass\ndefined above to illustrate how we can use these features to apply a\ncustom quantization scheme to the weights of a linear layer when\nloading the ``state_dict``.\n\nRecall that the ``__torch_function__`` handler for ``module_load`` will be\ninvoked if either ``self`` or ``other`` (in this case ``param`` or\n``state_dict[param_key]``) are ``MyQuantizedLinearWeight`` subclasses.\n\nAssume that we expect the ``state_dict`` to contain plain tensors and the\nmodule to contain ``MyQuantizedLinearWeight`` parameters where we want the\ntensors in the ``state_dict`` to be transformed into the subclass. Then we\ncan define a ``__torch_function__`` handler for ``torch.Tensor.module_load``\nas such:\n\n" + "## ``nn.Module.load_state_dict()``\n\u6839\u636e\u4f20\u9012\u7ed9 ``load_state_dict()`` \u7684 ``assign`` \u5173\u952e\u5b57\u53c2\u6570\u7684\u503c,\n\u6709\u4e24\u79cd\u65b9\u5f0f\u52a0\u8f7d ``state_dict``\uff1a\n\n* ``assign=False``: \u4fdd\u7559 ``module.param`` \u7684\u5c5e\u6027,\u53ea\u4ece ``state_dict['param_name']`` \u4e2d\u83b7\u53d6\u503c\n* ``assign=True``: \u4fdd\u7559 ``state_dict['param_name']`` \u7684\u5c5e\u6027\u548c\u503c\u3002\n\n\n\u4e4b\u524d,\u8fd9\u4e9b\u5206\u522b\u662f\u901a\u8fc7\u5c31\u5730 ``copy_`` \u548c ``__setattr__`` \u5b9e\u73b0\u7684\u3002\n\u5728\u73b0\u6709\u5b9e\u73b0\u4e2d,\u6bcf\u79cd\u65b9\u6cd5\u90fd\u6709\u81ea\u5df1\u7684\u9650\u5236 - ``assign=False`` \u8981\u6c42 ``state_dict`` \u4e2d\u7684\u53c2\u6570\u7c7b\u578b\n\u5fc5\u987b\u4e0e\u6a21\u5757\u4e2d\u7684\u53c2\u6570\u7c7b\u578b\u76f8\u540c,\u800c ``assign=True`` \u8981\u6c42\u5728 ``nn.Module.load_state_dict()`` \u4e4b\u540e\n\u521d\u59cb\u5316\u4efb\u4f55\u6301\u6709\u6a21\u5757\u53c2\u6570\u5f15\u7528\u7684\u5bf9\u8c61\u3002\n\n\u73b0\u5728,\u6211\u4eec\u901a\u8fc7\u5728 ``load_state_dict()`` \u4e2d\u6dfb\u52a0 ``swap_tensors`` \u8def\u5f84\u5e76\u5f15\u5165\u65b0\u7684\u6269\u5c55\u70b9\n``torch.Tensor.module_load(self, other, assign=False)`` \u6765\u89e3\u51b3\u8fd9\u4e24\u4e2a\u9650\u5236\u3002\n\u5f53\u542f\u7528\u4e0a\u8ff0 ``__future__`` \u65f6,\u6211\u4eec\u53ef\u4ee5\u4f7f\u7528 ``module_load`` \u7684 ``__torch_function__`` \u5904\u7406\u7a0b\u5e8f\n\u5bf9 ``state_dict`` \u4e2d\u7684\u503c\u5e94\u7528\u81ea\u5b9a\u4e49\u8f6c\u6362\u3002\u8f6c\u6362\u7684\u7ed3\u679c\u5c06\u4e0e\u6a21\u5757\u4e2d\u7684\u53c2\u6570\u4ea4\u6362\u3002\n\n\u5728\u4e0b\u9762\u7684\u793a\u4f8b\u4e2d,\u6211\u4eec\u5c06\u4f7f\u7528\u4e0a\u9762\u5b9a\u4e49\u7684 ``MyQuantizedLinearWeight`` \u5b50\u7c7b\n\u6765\u8bf4\u660e\u5982\u4f55\u4f7f\u7528\u8fd9\u4e9b\u529f\u80fd\u5728\u52a0\u8f7d ``state_dict`` \u65f6\u5bf9\u7ebf\u6027\u5c42\u7684\u6743\u91cd\u5e94\u7528\u81ea\u5b9a\u4e49\u91cf\u5316\u65b9\u6848\u3002\n\n\u56de\u987e\u4e00\u4e0b,\u5982\u679c ``self`` \u6216 ``other``(\u5728\u672c\u4f8b\u4e2d\u662f ``param`` \u6216 ``state_dict[param_key]``)\n\u662f ``MyQuantizedLinearWeight`` \u5b50\u7c7b,\u5219\u4f1a\u8c03\u7528 ``module_load`` \u7684 ``__torch_function__`` \u5904\u7406\u7a0b\u5e8f\u3002\n\n\u5047\u8bbe\u6211\u4eec\u671f\u671b ``state_dict`` \u5305\u542b\u666e\u901a\u5f20\u91cf,\u800c\u6a21\u5757\u5305\u542b ``MyQuantizedLinearWeight`` \u53c2\u6570,\n\u6211\u4eec\u5e0c\u671b\u5c06 ``state_dict`` \u4e2d\u7684\u5f20\u91cf\u8f6c\u6362\u4e3a\u5b50\u7c7b\u3002\u90a3\u4e48\u6211\u4eec\u53ef\u4ee5\u4e3a ``torch.Tensor.module_load`` \u5b9a\u4e49\n\u4e00\u4e2a ``__torch_function__`` \u5904\u7406\u7a0b\u5e8f,\u5982\u4e0b\u6240\u793a:\n\n" ] }, { @@ -123,14 +123,14 @@ }, "outputs": [], "source": [ - "@classmethod\ndef custom_torch_function(cls, func, types, args=(), kwargs=None):\n kwargs = {} if kwargs is None else kwargs\n\n if func is torch.Tensor.module_load:\n dest, src = args[0], args[1]\n assert type(dest) == cls and type(src) == torch.Tensor\n return MyQuantizedLinearWeight(src, dest.scale)\n else:\n with torch._C.DisableTorchFunctionSubclass():\n return func(*args, **kwargs)\n\nMyQuantizedLinearWeight.__torch_function__ = custom_torch_function" + "@classmethod\ndef custom_torch_function(cls, func, types, args=(), kwargs=None):\n kwargs = {} if kwargs is None else kwargs\n\n if func is torch.Tensor.module_load:\n dest, src = args[0], args[1]\n assert type(dest) == cls and type(src) == torch.Tensor\n return MyQuantizedLinearWeight(src, dest.scale)\n else:\n with torch._C.DisableTorchFunctionSubclass():\n return func(*args, **kwargs)\n\n\nMyQuantizedLinearWeight.__torch_function__ = custom_torch_function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "First, let us create a skeleton of a model on the meta device to avoid\nmaterializing storages. We convert all weights in the modules to\n``MyQuantizedLinearWeight`` subclasses while leaving biases intact.\n\n" + "\u9996\u5148,\u8ba9\u6211\u4eec\u5728 meta \u8bbe\u5907\u4e0a\u521b\u5efa\u4e00\u4e2a\u6a21\u578b\u6846\u67b6,\u4ee5\u907f\u514d\u5b9e\u4f8b\u5316\u5b58\u50a8\u3002\n\u6211\u4eec\u5c06\u6a21\u5757\u4e2d\u7684\u6240\u6709\u6743\u91cd\u8f6c\u6362\u4e3a ``MyQuantizedLinearWeight`` \u5b50\u7c7b,\u540c\u65f6\u4fdd\u7559\u504f\u7f6e\u4e0d\u53d8\u3002\n\n" ] }, { @@ -141,14 +141,14 @@ }, "outputs": [], "source": [ - "def fn(m):\n if isinstance(m, nn.Linear):\n requires_grad = m.weight.requires_grad\n m.weight = torch.nn.Parameter(\n MyQuantizedLinearWeight(m.weight, 0.5), requires_grad=requires_grad\n )\n\nwith torch.device(\"meta\"):\n m = nn.Linear(3, 5)\n m.apply(fn)" + "def fn(m):\n if isinstance(m, nn.Linear):\n requires_grad = m.weight.requires_grad\n m.weight = torch.nn.Parameter(\n MyQuantizedLinearWeight(m.weight, 0.5), requires_grad=requires_grad\n )\n\n\nwith torch.device(\"meta\"):\n m = nn.Linear(3, 5)\n m.apply(fn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We can then load the ``state_dict``. Observe that we use ``assign=True`` because\nfor biases, we want to preserve the properties of the tensor in the ``state_dict``\n(for example, we do not want the bias to be on the ``meta`` device after loading).\n\n" + "\u7136\u540e\u6211\u4eec\u53ef\u4ee5\u52a0\u8f7d ``state_dict``\u3002\u6ce8\u610f\u6211\u4eec\u4f7f\u7528 ``assign=True``\uff0c\u56e0\u4e3a\u5bf9\u4e8e\u504f\u7f6e,\n\u6211\u4eec\u5e0c\u671b\u4fdd\u7559 ``state_dict`` \u4e2d\u5f20\u91cf\u7684\u5c5e\u6027(\u4f8b\u5982,\u6211\u4eec\u4e0d\u5e0c\u671b\u504f\u7f6e\u5728\u52a0\u8f7d\u540e\u4f4d\u4e8e ``meta`` \u8bbe\u5907\u4e0a)\u3002\n\n" ] }, { @@ -159,14 +159,14 @@ }, "outputs": [], "source": [ - "torch.__future__.set_swap_module_params_on_conversion(True)\nprint(f\"Before: id(weight)={id(m.weight)}, id(bias)={id(m.bias)}\")\nprint(f\"m.state_dict() before load_state_dict():\\n {m.state_dict()}\")\nstate_dict = nn.Linear(3, 5).state_dict()\nprint(f\"state_dict:\\n {state_dict}\")\nm.load_state_dict(state_dict, assign=True)\nprint(f\"After: id(weight)={id(m.weight)}, id(bias)={id(m.bias)}\")\nprint(f\"m.state_dict() after load_state_dict():\\n {m.state_dict()}\")" + "torch.__future__.set_swap_module_params_on_conversion(True)\nprint(f\"\u4e4b\u524d: id(weight)={id(m.weight)}, id(bias)={id(m.bias)}\")\nprint(f\"load_state_dict() \u4e4b\u524d\u7684 m.state_dict():\\n {m.state_dict()}\")\nstate_dict = nn.Linear(3, 5).state_dict()\nprint(f\"state_dict:\\n {state_dict}\")\nm.load_state_dict(state_dict, assign=True)\nprint(f\"\u4e4b\u540e: id(weight)={id(m.weight)}, id(bias)={id(m.bias)}\")\nprint(f\"load_state_dict() \u4e4b\u540e\u7684 m.state_dict():\\n {m.state_dict()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The above is a toy example of how we can use the new extension point in\n``nn.Module.load_state_dict()``. One can also imagine alternate scenarios such\nas when we have tensor subclasses in the ``state_dict`` and plain ``nn.Parameters``/\ntensors in the module or when both are tensor subclasses. Based on the use\ncase, we can define the ``__torch_function__`` handler for ``module_load``\nto apply the transforms as needed.\n\n## Conclusion\nIn this recipe, we learned about ``swap_tensors``, the importance\nof preserving references for parameters in ``nn.Module`` as well as how to\nuse the two new extension points that are gated by\n``torch.__future__.set_swap_module_params_on_conversion``.\n\n" + "\u4e0a\u9762\u662f\u4e00\u4e2a\u5982\u4f55\u4f7f\u7528 ``nn.Module.load_state_dict()`` \u4e2d\u7684\u65b0\u6269\u5c55\u70b9\u7684\u73a9\u5177\u793a\u4f8b\u3002\n\u6211\u4eec\u8fd8\u53ef\u4ee5\u60f3\u8c61\u5176\u4ed6\u573a\u666f,\u4f8b\u5982\u5f53 ``state_dict`` \u4e2d\u6709\u5f20\u91cf\u5b50\u7c7b\u800c\u6a21\u5757\u4e2d\u6709\u666e\u901a ``nn.Parameters``/\u5f20\u91cf\u65f6,\n\u6216\u8005\u4e24\u8005\u90fd\u662f\u5f20\u91cf\u5b50\u7c7b\u65f6\u3002\u6839\u636e\u4f7f\u7528\u573a\u666f,\u6211\u4eec\u53ef\u4ee5\u5b9a\u4e49 ``module_load`` \u7684 ``__torch_function__`` \u5904\u7406\u7a0b\u5e8f\n\u6765\u5e94\u7528\u6240\u9700\u7684\u8f6c\u6362\u3002\n\n## \u7ed3\u8bba\n\u5728\u672c\u6559\u7a0b\u4e2d,\u6211\u4eec\u5b66\u4e60\u4e86 ``swap_tensors``\u3001\u5728 ``nn.Module`` \u4e2d\u4fdd\u7559\u53c2\u6570\u5f15\u7528\u7684\u91cd\u8981\u6027,\n\u4ee5\u53ca\u5982\u4f55\u4f7f\u7528\u7531 ``torch.__future__.set_swap_module_params_on_conversion`` \u63a7\u5236\u7684\u4e24\u4e2a\u65b0\u6269\u5c55\u70b9\u3002\n\n" ] } ], diff --git a/docs/_downloads/9d3fdce6265a4c437c6242553a2aa24d/tensorboard_with_pytorch.py b/docs/_downloads/9d3fdce6265a4c437c6242553a2aa24d/tensorboard_with_pytorch.py index 4bceda8..d1a7bf9 100644 --- a/docs/_downloads/9d3fdce6265a4c437c6242553a2aa24d/tensorboard_with_pytorch.py +++ b/docs/_downloads/9d3fdce6265a4c437c6242553a2aa24d/tensorboard_with_pytorch.py @@ -1,24 +1,23 @@ """ -How to use TensorBoard with PyTorch +如何在PyTorch中使用TensorBoard =================================== -TensorBoard is a visualization toolkit for machine learning experimentation. -TensorBoard allows tracking and visualizing metrics such as loss and accuracy, -visualizing the model graph, viewing histograms, displaying images and much more. -In this tutorial we are going to cover TensorBoard installation, -basic usage with PyTorch, and how to visualize data you logged in TensorBoard UI. +TensorBoard是一个用于机器学习实验的可视化工具包。 +TensorBoard允许跟踪和可视化指标,如损失和准确率, +可视化模型图,查看直方图,显示图像等。 +在本教程中,我们将介绍TensorBoard的安装、 +在PyTorch中的基本用法,以及如何在TensorBoard UI中可视化您记录的数据。 -Installation +安装 ---------------------- -PyTorch should be installed to log models and metrics into TensorBoard log -directory. The following command will install PyTorch 1.4+ via -Anaconda (recommended): +应安装PyTorch以将模型和指标记录到TensorBoard日志 +目录。以下命令将通过Anaconda(推荐)安装PyTorch 1.4+: .. code-block:: sh - $ conda install pytorch torchvision -c pytorch + $ conda install pytorch torchvision -c pytorch -or pip +或者使用pip: .. code-block:: sh @@ -27,34 +26,34 @@ """ ###################################################################### -# Using TensorBoard in PyTorch +# 在PyTorch中使用TensorBoard # ----------------------------- -# -# Let’s now try using TensorBoard with PyTorch! Before logging anything, -# we need to create a ``SummaryWriter`` instance. -# +# +# 现在让我们尝试在PyTorch中使用TensorBoard!在记录任何内容之前, +# 我们需要创建一个 ``SummaryWriter`` 实例。 +# import torch from torch.utils.tensorboard import SummaryWriter + writer = SummaryWriter() ###################################################################### -# Writer will output to ``./runs/`` directory by default. -# +# 写入器默认将输出到 ``./runs/`` 目录。 +# ###################################################################### -# Log scalars +# 记录标量 # ----------- -# -# In machine learning, it’s important to understand key metrics such as -# loss and how they change during training. Scalar helps to save -# the loss value of each training step, or the accuracy after each epoch. # -# To log a scalar value, use -# ``add_scalar(tag, scalar_value, global_step=None, walltime=None)``. -# For example, lets create a simple linear regression training, and -# log loss value using ``add_scalar`` +# 在机器学习中,了解关键指标(如损失)及其在训练期间的变化非常重要。 +# 标量可用于保存每个训练步骤的损失值或每个epoch的准确率。 +# +# 要记录标量值,请使用 +# ``add_scalar(tag, scalar_value, global_step=None, walltime=None)``。 +# 例如,让我们创建一个简单的线性回归训练,并 +# 使用 ``add_scalar`` 记录损失值 # x = torch.arange(-5, 5, 0.1).view(-1, 1) @@ -62,7 +61,8 @@ model = torch.nn.Linear(1, 1) criterion = torch.nn.MSELoss() -optimizer = torch.optim.SGD(model.parameters(), lr = 0.1) +optimizer = torch.optim.SGD(model.parameters(), lr=0.1) + def train_model(iter): for epoch in range(iter): @@ -72,59 +72,58 @@ def train_model(iter): optimizer.zero_grad() loss.backward() optimizer.step() - + + train_model(10) writer.flush() -###################################################################### -# Call ``flush()`` method to make sure that all pending events -# have been written to disk. -# -# See `torch.utils.tensorboard tutorials `_ -# to find more TensorBoard visualization types you can log. -# -# If you do not need the summary writer anymore, call ``close()`` method. +###################################################################### +# 调用 ``flush()`` 方法以确保所有待处理事件 +# 已写入磁盘。 +# +# 请参阅 `torch.utils.tensorboard 教程 `_ +# 以了解您可以记录的更多TensorBoard可视化类型。 +# +# 如果您不再需要摘要写入器,请调用 ``close()`` 方法。 # writer.close() ###################################################################### -# Run TensorBoard +# 运行TensorBoard # ---------------- -# -# Install TensorBoard through the command line to visualize data you logged +# +# 通过命令行安装TensorBoard以可视化您记录的数据 # # .. code-block:: sh # # pip install tensorboard # # -# Now, start TensorBoard, specifying the root log directory you used above. -# Argument ``logdir`` points to directory where TensorBoard will look to find -# event files that it can display. TensorBoard will recursively walk -# the directory structure rooted at ``logdir``, looking for ``.*tfevents.*`` files. +# 现在,启动TensorBoard,指定您之前使用的根日志目录。 +# 参数 ``logdir`` 指向TensorBoard将查找可显示的事件文件的目录。 +# TensorBoard将递归遍历 ``logdir`` 根目录下的目录结构,寻找 ``.*tfevents.*`` 文件。 # # .. code-block:: sh # # tensorboard --logdir=runs # -# Go to the URL it provides OR to `http://localhost:6006/ `_ +# 转到它提供的URL或 `http://localhost:6006/ `_ # # .. image:: ../../_static/img/thumbnails/tensorboard_scalars.png # :scale: 40 % # -# This dashboard shows how the loss and accuracy change with every epoch. -# You can use it to also track training speed, learning rate, and other -# scalar values. It’s helpful to compare these metrics across different -# training runs to improve your model. +# 此仪表板显示了损失和准确率如何随着每个epoch而变化。 +# 您可以使用它来跟踪训练速度、学习率和其他标量值。 +# 比较不同训练运行的这些指标有助于改进您的模型。 # ######################################################################## -# Learn More +# 了解更多 # ---------------------------- -# -# - `torch.utils.tensorboard `_ docs -# - `Visualizing models, data, and training with TensorBoard `_ tutorial +# +# - `torch.utils.tensorboard `_ 文档 +# - `使用TensorBoard可视化模型、数据和训练 `_ 教程 # diff --git a/docs/_downloads/9f4fb47ef3d58524029d86df50e90a08/reasoning_about_shapes.py b/docs/_downloads/9f4fb47ef3d58524029d86df50e90a08/reasoning_about_shapes.py index 12c85dc..5ee8a2a 100644 --- a/docs/_downloads/9f4fb47ef3d58524029d86df50e90a08/reasoning_about_shapes.py +++ b/docs/_downloads/9f4fb47ef3d58524029d86df50e90a08/reasoning_about_shapes.py @@ -1,23 +1,19 @@ """ -Reasoning about Shapes in PyTorch +在PyTorch中推理形状 ================================= -When writing models with PyTorch, it is commonly the case that the parameters -to a given layer depend on the shape of the output of the previous layer. For -example, the ``in_features`` of an ``nn.Linear`` layer must match the -``size(-1)`` of the input. For some layers, the shape computation involves -complex equations, for example convolution operations. +在使用PyTorch编写模型时,通常会遇到某一层的参数取决于前一层输出的形状的情况。例如, +``nn.Linear``层的``in_features``必须与输入的``size(-1)``相匹配。对于某些层,形状计算涉及复杂的等式,例如卷积运算。 -One way around this is to run the forward pass with random inputs, but this is -wasteful in terms of memory and compute. +一种解决方法是使用随机输入进行前向传播,但这在内存和计算方面是浪费的。 -Instead, we can make use of the ``meta`` device to determine the output shapes -of a layer without materializing any data. +相反,我们可以使用``meta``设备来确定层的输出形状,而无需实际化任何数据。 """ -import torch import timeit +import torch + t = torch.rand(2, 3, 10, 10, device="meta") conv = torch.nn.Conv2d(3, 5, 2, device="meta") start = timeit.default_timer() @@ -25,12 +21,11 @@ end = timeit.default_timer() print(out) -print(f"Time taken: {end-start}") +print(f"所需时间: {end-start}") ########################################################################## -# Observe that since data is not materialized, passing arbitrarily large -# inputs will not significantly alter the time taken for shape computation. +# 观察到,由于没有实际化数据,即使传入任意大的输入,用于形状计算的时间也不会显著改变。 t_large = torch.rand(2**10, 3, 2**16, 2**16, device="meta") start = timeit.default_timer() @@ -38,11 +33,11 @@ end = timeit.default_timer() print(out) -print(f"Time taken: {end-start}") +print(f"所需时间: {end-start}") ###################################################### -# Consider an arbitrary network such as the following: +# 考虑以下任意网络: import torch.nn as nn import torch.nn.functional as F @@ -61,7 +56,7 @@ def __init__(self): def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) - x = torch.flatten(x, 1) # flatten all dimensions except batch + x = torch.flatten(x, 1) # 展平除批次维度外的所有维度 x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) @@ -69,15 +64,14 @@ def forward(self, x): ############################################################################### -# We can view the intermediate shapes within an entire network by registering a -# forward hook to each layer that prints the shape of the output. +# 我们可以通过为每一层注册一个前向钩子来打印输出的形状,从而查看整个网络中间层的形状。 + def fw_hook(module, input, output): - print(f"Shape of output to {module} is {output.shape}.") + print(f"{module}的输出形状为{output.shape}。") -# Any tensor created within this torch.device context manager will be -# on the meta device. +# 在此torch.device上下文管理器中创建的任何张量都将在meta设备上。 with torch.device("meta"): net = Net() inp = torch.randn((1024, 3, 32, 32)) diff --git a/docs/_downloads/aa116673383c7eeeacfb92b8c9beb97a/torch_logs.py b/docs/_downloads/aa116673383c7eeeacfb92b8c9beb97a/torch_logs.py index b5c3f0b..2dfa426 100644 --- a/docs/_downloads/aa116673383c7eeeacfb92b8c9beb97a/torch_logs.py +++ b/docs/_downloads/aa116673383c7eeeacfb92b8c9beb97a/torch_logs.py @@ -1,96 +1,86 @@ """ -(beta) Using TORCH_LOGS python API with torch.compile +(Beta) 使用 TORCH_LOGS python API 与 torch.compile ========================================================================================== -**Author:** `Michael Lazos `_ +**作者:** `Michael Lazos `_ """ import logging ###################################################################### # -# This tutorial introduces the ``TORCH_LOGS`` environment variable, as well as the Python API, and -# demonstrates how to apply it to observe the phases of ``torch.compile``. +# 本教程介绍了 ``TORCH_LOGS`` 环境变量以及 Python API,并演示了如何将其应用于观察 ``torch.compile`` 的各个阶段。 # # .. note:: # -# This tutorial requires PyTorch 2.2.0 or later. +# 本教程需要 PyTorch 2.2.0 或更高版本。 # # ###################################################################### -# Setup +# 设置 # ~~~~~~~~~~~~~~~~~~~~~ -# In this example, we'll set up a simple Python function which performs an elementwise -# add and observe the compilation process with ``TORCH_LOGS`` Python API. +# 在这个例子中,我们将设置一个简单的 Python 函数,执行元素级加法,并使用 ``TORCH_LOGS`` Python API 观察编译过程。 # # .. note:: # -# There is also an environment variable ``TORCH_LOGS``, which can be used to -# change logging settings at the command line. The equivalent environment -# variable setting is shown for each example. +# 还有一个环境变量 ``TORCH_LOGS``,可用于在命令行中更改日志设置。每个示例都显示了等效的环境变量设置。 import torch -# exit cleanly if we are on a device that doesn't support torch.compile +# 如果设备不支持 torch.compile,则干净地退出 if torch.cuda.get_device_capability() < (7, 0): - print("Skipping because torch.compile is not supported on this device.") + print("跳过,因为此设备不支持 torch.compile。") else: + @torch.compile() def fn(x, y): z = x + y return z + 2 - inputs = (torch.ones(2, 2, device="cuda"), torch.zeros(2, 2, device="cuda")) - -# print separator and reset dynamo -# between each example + # 在每个示例之间打印分隔符并重置 dynamo def separator(name): print(f"==================={name}=========================") torch._dynamo.reset() - - separator("Dynamo Tracing") -# View dynamo tracing -# TORCH_LOGS="+dynamo" + separator("Dynamo 跟踪") + # 查看 dynamo 跟踪 + # TORCH_LOGS="+dynamo" torch._logging.set_logs(dynamo=logging.DEBUG) fn(*inputs) - separator("Traced Graph") -# View traced graph -# TORCH_LOGS="graph" + separator("跟踪的图形") + # 查看跟踪的图形 + # TORCH_LOGS="graph" torch._logging.set_logs(graph=True) fn(*inputs) - separator("Fusion Decisions") -# View fusion decisions -# TORCH_LOGS="fusion" + separator("融合决策") + # 查看融合决策 + # TORCH_LOGS="fusion" torch._logging.set_logs(fusion=True) fn(*inputs) - separator("Output Code") -# View output code generated by inductor -# TORCH_LOGS="output_code" + separator("输出代码") + # 查看 inductor 生成的输出代码 + # TORCH_LOGS="output_code" torch._logging.set_logs(output_code=True) fn(*inputs) separator("") ###################################################################### -# Conclusion +# 结论 # ~~~~~~~~~~ # -# In this tutorial we introduced the TORCH_LOGS environment variable and python API -# by experimenting with a small number of the available logging options. -# To view descriptions of all available options, run any python script -# which imports torch and set TORCH_LOGS to "help". +# 在本教程中,我们介绍了 TORCH_LOGS 环境变量和 python API,并通过实验了一小部分可用的日志选项。 +# 要查看所有可用选项的描述,请运行任何导入 torch 的 python 脚本,并将 TORCH_LOGS 设置为 "help"。 # -# Alternatively, you can view the `torch._logging documentation`_ to see -# descriptions of all available logging options. +# 或者,您可以查看 `torch._logging 文档`_ 以查看所有可用日志选项的描述。 # -# For more information on torch.compile, see the `torch.compile tutorial`_. +# 有关 torch.compile 的更多信息,请参阅 `torch.compile 教程`_。 # -# .. _torch._logging documentation: https://pytorch.org/docs/main/logging.html -# .. _torch.compile tutorial: https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html +# .. _torch._logging 文档: https://pytorch.org/docs/main/logging.html +# .. _torch.compile 教程: https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html diff --git a/docs/_downloads/b94668e8c8d352e04672e8eaad180ae6/Captum_Recipe.py b/docs/_downloads/b94668e8c8d352e04672e8eaad180ae6/Captum_Recipe.py index 11fdc24..99b0814 100644 --- a/docs/_downloads/b94668e8c8d352e04672e8eaad180ae6/Captum_Recipe.py +++ b/docs/_downloads/b94668e8c8d352e04672e8eaad180ae6/Captum_Recipe.py @@ -1,190 +1,174 @@ """ -Model Interpretability using Captum +使用 Captum 进行模型可解释性 =================================== - """ - ###################################################################### -# Captum helps you understand how the data features impact your model -# predictions or neuron activations, shedding light on how your model -# operates. -# -# Using Captum, you can apply a wide range of state-of-the-art feature -# attribution algorithms such as \ ``Guided GradCam``\ and -# \ ``Integrated Gradients``\ in a unified way. -# -# In this recipe you will learn how to use Captum to: +# Captum 可以帮助您了解数据特征如何影响模型的预测或神经元激活,从而揭示模型的工作原理。 +# +# 使用 Captum,您可以统一地应用广泛的最先进的特征归因算法,如 ``Guided GradCam`` 和 ``Integrated Gradients``。 +# +# 在本教程中,您将学习如何使用 Captum: +# +# - 将图像分类器的预测归因于相应的图像特征。 +# - 可视化归因结果。 # -# - Attribute the predictions of an image classifier to their corresponding image features. -# - Visualize the attribution results. -# - ###################################################################### -# Before you begin +# 开始之前 # ---------------- -# - +# ###################################################################### -# Make sure Captum is installed in your active Python environment. Captum -# is available both on GitHub, as a ``pip`` package, or as a ``conda`` -# package. For detailed instructions, consult the installation guide at -# https://captum.ai/ -# - +# 确保在您的活跃 Python 环境中安装了 Captum。Captum 可以在 GitHub 上获取,也可以作为 ``pip`` 包或 ``conda`` 包获取。 +# 有关详细说明,请查阅安装指南 https://captum.ai/ +# ###################################################################### -# For a model, we use a built-in image classifier in PyTorch. Captum can -# reveal which parts of a sample image support certain predictions made by -# the model. -# +# 对于模型,我们使用 PyTorch 中的内置图像分类器。Captum 可以揭示样本图像的哪些部分支持了模型做出的某些预测。 +# +from io import BytesIO +import requests import torchvision -from torchvision import models, transforms from PIL import Image -import requests -from io import BytesIO +from torchvision import models, transforms -model = torchvision.models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1).eval() +model = torchvision.models.resnet18( + weights=models.ResNet18_Weights.IMAGENET1K_V1 +).eval() -response = requests.get("https://image.freepik.com/free-photo/two-beautiful-puppies-cat-dog_58409-6024.jpg") +response = requests.get( + "https://image.freepik.com/free-photo/two-beautiful-puppies-cat-dog_58409-6024.jpg" +) img = Image.open(BytesIO(response.content)) -center_crop = transforms.Compose([ - transforms.Resize(256), - transforms.CenterCrop(224), -]) - -normalize = transforms.Compose([ - transforms.ToTensor(), # converts the image to a tensor with values between 0 and 1 - transforms.Normalize( # normalize to follow 0-centered imagenet pixel RGB distribution - mean=[0.485, 0.456, 0.406], - std=[0.229, 0.224, 0.225] - ) -]) +center_crop = transforms.Compose( + [ + transforms.Resize(256), + transforms.CenterCrop(224), + ] +) + +normalize = transforms.Compose( + [ + transforms.ToTensor(), # 将图像转换为值在 0 到 1 之间的张量 + transforms.Normalize( # 归一化以遵循 0 均值的 ImageNet 像素 RGB 分布 + mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] + ), + ] +) input_img = normalize(center_crop(img)).unsqueeze(0) - ###################################################################### -# Computing Attribution +# 计算归因 # --------------------- -# - +# ###################################################################### -# Among the top-3 predictions of the models are classes 208 and 283 which -# correspond to dog and cat. -# -# Let us attribute each of these predictions to the corresponding part of -# the input, using Captum’s \ ``Occlusion``\ algorithm. -# +# 在模型的前 3 个预测中,类别 208 和 283 分别对应于狗和猫。 +# +# 让我们使用 Captum 的 ``Occlusion`` 算法将这些预测归因于输入的相应部分。 +# -from captum.attr import Occlusion +from captum.attr import Occlusion occlusion = Occlusion(model) -strides = (3, 9, 9) # smaller = more fine-grained attribution but slower -target=208, # Labrador index in ImageNet -sliding_window_shapes=(3,45, 45) # choose size enough to change object appearance -baselines = 0 # values to occlude the image with. 0 corresponds to gray - -attribution_dog = occlusion.attribute(input_img, - strides = strides, - target=target, - sliding_window_shapes=sliding_window_shapes, - baselines=baselines) - - -target=283, # Persian cat index in ImageNet -attribution_cat = occlusion.attribute(input_img, - strides = strides, - target=target, - sliding_window_shapes=sliding_window_shapes, - baselines=0) - +strides = (3, 9, 9) # 步长越小,归因越细粒度,但速度越慢 +target = (208,) # ImageNet 中的拉布拉多索引 +sliding_window_shapes = (3, 45, 45) # 选择足以改变对象外观的大小 +baselines = 0 # 用于遮挡图像的值。0 对应灰色 + +attribution_dog = occlusion.attribute( + input_img, + strides=strides, + target=target, + sliding_window_shapes=sliding_window_shapes, + baselines=baselines, +) + + +target = (283,) # ImageNet 中的波斯猫索引 +attribution_cat = occlusion.attribute( + input_img, + strides=strides, + target=target, + sliding_window_shapes=sliding_window_shapes, + baselines=0, +) ###################################################################### -# Besides ``Occlusion``, Captum features many algorithms such as -# \ ``Integrated Gradients``\ , \ ``Deconvolution``\ , -# \ ``GuidedBackprop``\ , \ ``Guided GradCam``\ , \ ``DeepLift``\ , and -# \ ``GradientShap``\ . All of these algorithms are subclasses of -# ``Attribution`` which expects your model as a callable ``forward_func`` -# upon initialization and has an ``attribute(...)`` method which returns -# the attribution result in a unified format. -# -# Let us visualize the computed attribution results in case of images. -# - +# 除了 ``Occlusion`` 之外,Captum 还提供了许多算法,如 ``Integrated Gradients``、``Deconvolution``、 +# ``GuidedBackprop``、``Guided GradCam``、``DeepLift`` 和 ``GradientShap``。所有这些算法都是 ``Attribution`` 的子类, +# 在初始化时需要将您的模型作为可调用的 ``forward_func``传入,并具有 ``attribute(...)`` 方法,该方法以统一的格式返回归因结果。 +# +# 让我们可视化计算出的图像归因结果。 +# ###################################################################### -# Visualizing the Results +# 可视化结果 # ----------------------- -# - +# ###################################################################### -# Captum’s \ ``visualization``\ utility provides out-of-the-box methods -# to visualize attribution results both for pictorial and for textual -# inputs. -# +# Captum 的 ``visualization`` 实用程序提供了开箱即用的方法,用于可视化图像和文本输入的归因结果。 +# import numpy as np from captum.attr import visualization as viz -# Convert the compute attribution tensor into an image-like numpy array -attribution_dog = np.transpose(attribution_dog.squeeze().cpu().detach().numpy(), (1,2,0)) +# 将计算出的归因张量转换为类似图像的 numpy 数组 +attribution_dog = np.transpose( + attribution_dog.squeeze().cpu().detach().numpy(), (1, 2, 0) +) vis_types = ["heat_map", "original_image"] -vis_signs = ["all", "all"] # "positive", "negative", or "all" to show both -# positive attribution indicates that the presence of the area increases the prediction score -# negative attribution indicates distractor areas whose absence increases the score - -_ = viz.visualize_image_attr_multiple(attribution_dog, - np.array(center_crop(img)), - vis_types, - vis_signs, - ["attribution for dog", "image"], - show_colorbar = True - ) - - -attribution_cat = np.transpose(attribution_cat.squeeze().cpu().detach().numpy(), (1,2,0)) - -_ = viz.visualize_image_attr_multiple(attribution_cat, - np.array(center_crop(img)), - ["heat_map", "original_image"], - ["all", "all"], # positive/negative attribution or all - ["attribution for cat", "image"], - show_colorbar = True - ) - +vis_signs = ["all", "all"] # "positive"、"negative" 或 "all" 以显示两者 +# 正归因表示该区域的存在会增加预测分数 +# 负归因表示该区域的缺失会增加预测分数 + +_ = viz.visualize_image_attr_multiple( + attribution_dog, + np.array(center_crop(img)), + vis_types, + vis_signs, + ["attribution for dog", "image"], + show_colorbar=True, +) + + +attribution_cat = np.transpose( + attribution_cat.squeeze().cpu().detach().numpy(), (1, 2, 0) +) + +_ = viz.visualize_image_attr_multiple( + attribution_cat, + np.array(center_crop(img)), + ["heat_map", "original_image"], + ["all", "all"], # 正/负归因或全部 + ["attribution for cat", "image"], + show_colorbar=True, +) ###################################################################### -# If your data is textual, ``visualization.visualize_text()`` offers a -# dedicated view to explore attribution on top of the input text. Find out -# more at http://captum.ai/tutorials/IMDB_TorchText_Interpret -# - +# 如果您的数据是文本,``visualization.visualize_text()`` 提供了一个专用视图,用于探索输入文本的归因。 +# 更多信息请访问 http://captum.ai/tutorials/IMDB_TorchText_Interpret +# ###################################################################### -# Final Notes +# 最后注意 # ----------- -# - +# ###################################################################### -# Captum can handle most model types in PyTorch across modalities -# including vision, text, and more. With Captum you can: \* Attribute a -# specific output to the model input as illustrated above. \* Attribute a -# specific output to a hidden-layer neuron (see Captum API reference). \* -# Attribute a hidden-layer neuron response to the model input (see Captum -# API reference). -# -# For complete API of the supported methods and a list of tutorials, -# consult our website http://captum.ai -# -# Another useful post by Gilbert Tanner: +# Captum 可以处理 PyTorch 中包括视觉、文本等各种模态的大多数模型类型。使用 Captum 您可以: +# * 将特定输出归因于模型输入,如上所示。 +# * 将特定输出归因于隐藏层神经元(参见 Captum API 参考)。 +# * 将隐藏层神经元响应归因于模型输入(参见 Captum API 参考)。 +# +# 有关支持方法的完整 API 和教程列表,请查阅我们的网站 http://captum.ai +# +# Gilbert Tanner 的另一篇有用文章: # https://gilberttanner.com/blog/interpreting-pytorch-models-with-captum -# +# diff --git a/docs/_downloads/d493dae89f8804b07cdf678f7d0c2dc6/tensorboard_with_pytorch.ipynb b/docs/_downloads/d493dae89f8804b07cdf678f7d0c2dc6/tensorboard_with_pytorch.ipynb index 86352e0..1e3b81f 100644 --- a/docs/_downloads/d493dae89f8804b07cdf678f7d0c2dc6/tensorboard_with_pytorch.ipynb +++ b/docs/_downloads/d493dae89f8804b07cdf678f7d0c2dc6/tensorboard_with_pytorch.ipynb @@ -15,14 +15,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n# How to use TensorBoard with PyTorch\nTensorBoard is a visualization toolkit for machine learning experimentation. \nTensorBoard allows tracking and visualizing metrics such as loss and accuracy, \nvisualizing the model graph, viewing histograms, displaying images and much more. \nIn this tutorial we are going to cover TensorBoard installation, \nbasic usage with PyTorch, and how to visualize data you logged in TensorBoard UI.\n\n## Installation\nPyTorch should be installed to log models and metrics into TensorBoard log \ndirectory. The following command will install PyTorch 1.4+ via \nAnaconda (recommended):\n\n```sh\n$ conda install pytorch torchvision -c pytorch\n```\nor pip\n\n```sh\n$ pip install torch torchvision\n```\n" + "\n# \u5982\u4f55\u5728PyTorch\u4e2d\u4f7f\u7528TensorBoard\nTensorBoard\u662f\u4e00\u4e2a\u7528\u4e8e\u673a\u5668\u5b66\u4e60\u5b9e\u9a8c\u7684\u53ef\u89c6\u5316\u5de5\u5177\u5305\u3002\nTensorBoard\u5141\u8bb8\u8ddf\u8e2a\u548c\u53ef\u89c6\u5316\u6307\u6807,\u5982\u635f\u5931\u548c\u51c6\u786e\u7387,\n\u53ef\u89c6\u5316\u6a21\u578b\u56fe,\u67e5\u770b\u76f4\u65b9\u56fe,\u663e\u793a\u56fe\u50cf\u7b49\u3002\n\u5728\u672c\u6559\u7a0b\u4e2d,\u6211\u4eec\u5c06\u4ecb\u7ecdTensorBoard\u7684\u5b89\u88c5\u3001\n\u5728PyTorch\u4e2d\u7684\u57fa\u672c\u7528\u6cd5,\u4ee5\u53ca\u5982\u4f55\u5728TensorBoard UI\u4e2d\u53ef\u89c6\u5316\u60a8\u8bb0\u5f55\u7684\u6570\u636e\u3002\n\n## \u5b89\u88c5\n\u5e94\u5b89\u88c5PyTorch\u4ee5\u5c06\u6a21\u578b\u548c\u6307\u6807\u8bb0\u5f55\u5230TensorBoard\u65e5\u5fd7\n\u76ee\u5f55\u3002\u4ee5\u4e0b\u547d\u4ee4\u5c06\u901a\u8fc7Anaconda(\u63a8\u8350)\u5b89\u88c5PyTorch 1.4+:\n\n```sh\n$ conda install pytorch torchvision -c pytorch\n```\n\u6216\u8005\u4f7f\u7528pip:\n\n```sh\n$ pip install torch torchvision\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Using TensorBoard in PyTorch\n\nLet\u2019s now try using TensorBoard with PyTorch! Before logging anything, \nwe need to create a ``SummaryWriter`` instance.\n\n\n" + "## \u5728PyTorch\u4e2d\u4f7f\u7528TensorBoard\n\n\u73b0\u5728\u8ba9\u6211\u4eec\u5c1d\u8bd5\u5728PyTorch\u4e2d\u4f7f\u7528TensorBoard!\u5728\u8bb0\u5f55\u4efb\u4f55\u5185\u5bb9\u4e4b\u524d,\n\u6211\u4eec\u9700\u8981\u521b\u5efa\u4e00\u4e2a ``SummaryWriter`` \u5b9e\u4f8b\u3002\n\n\n" ] }, { @@ -33,21 +33,21 @@ }, "outputs": [], "source": [ - "import torch\nfrom torch.utils.tensorboard import SummaryWriter\nwriter = SummaryWriter()" + "import torch\nfrom torch.utils.tensorboard import SummaryWriter\n\nwriter = SummaryWriter()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Writer will output to ``./runs/`` directory by default.\n\n\n" + "\u5199\u5165\u5668\u9ed8\u8ba4\u5c06\u8f93\u51fa\u5230 ``./runs/`` \u76ee\u5f55\u3002\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Log scalars\n\nIn machine learning, it\u2019s important to understand key metrics such as \nloss and how they change during training. Scalar helps to save \nthe loss value of each training step, or the accuracy after each epoch. \n\nTo log a scalar value, use \n``add_scalar(tag, scalar_value, global_step=None, walltime=None)``. \nFor example, lets create a simple linear regression training, and \nlog loss value using ``add_scalar``\n\n\n" + "## \u8bb0\u5f55\u6807\u91cf\n\n\u5728\u673a\u5668\u5b66\u4e60\u4e2d,\u4e86\u89e3\u5173\u952e\u6307\u6807(\u5982\u635f\u5931)\u53ca\u5176\u5728\u8bad\u7ec3\u671f\u95f4\u7684\u53d8\u5316\u975e\u5e38\u91cd\u8981\u3002\n\u6807\u91cf\u53ef\u7528\u4e8e\u4fdd\u5b58\u6bcf\u4e2a\u8bad\u7ec3\u6b65\u9aa4\u7684\u635f\u5931\u503c\u6216\u6bcf\u4e2aepoch\u7684\u51c6\u786e\u7387\u3002\n\n\u8981\u8bb0\u5f55\u6807\u91cf\u503c,\u8bf7\u4f7f\u7528\n``add_scalar(tag, scalar_value, global_step=None, walltime=None)``\u3002\n\u4f8b\u5982,\u8ba9\u6211\u4eec\u521b\u5efa\u4e00\u4e2a\u7b80\u5355\u7684\u7ebf\u6027\u56de\u5f52\u8bad\u7ec3,\u5e76\n\u4f7f\u7528 ``add_scalar`` \u8bb0\u5f55\u635f\u5931\u503c\n\n\n" ] }, { @@ -58,14 +58,14 @@ }, "outputs": [], "source": [ - "x = torch.arange(-5, 5, 0.1).view(-1, 1)\ny = -5 * x + 0.1 * torch.randn(x.size())\n\nmodel = torch.nn.Linear(1, 1)\ncriterion = torch.nn.MSELoss()\noptimizer = torch.optim.SGD(model.parameters(), lr = 0.1)\n\ndef train_model(iter):\n for epoch in range(iter):\n y1 = model(x)\n loss = criterion(y1, y)\n writer.add_scalar(\"Loss/train\", loss, epoch)\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n \ntrain_model(10)\nwriter.flush()" + "x = torch.arange(-5, 5, 0.1).view(-1, 1)\ny = -5 * x + 0.1 * torch.randn(x.size())\n\nmodel = torch.nn.Linear(1, 1)\ncriterion = torch.nn.MSELoss()\noptimizer = torch.optim.SGD(model.parameters(), lr=0.1)\n\n\ndef train_model(iter):\n for epoch in range(iter):\n y1 = model(x)\n loss = criterion(y1, y)\n writer.add_scalar(\"Loss/train\", loss, epoch)\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n\ntrain_model(10)\nwriter.flush()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Call ``flush()`` method to make sure that all pending events \nhave been written to disk.\n\nSee [torch.utils.tensorboard tutorials](https://pytorch.org/docs/stable/tensorboard.html) \nto find more TensorBoard visualization types you can log.\n\nIf you do not need the summary writer anymore, call ``close()`` method.\n\n\n" + "\u8c03\u7528 ``flush()`` \u65b9\u6cd5\u4ee5\u786e\u4fdd\u6240\u6709\u5f85\u5904\u7406\u4e8b\u4ef6\n\u5df2\u5199\u5165\u78c1\u76d8\u3002\n\n\u8bf7\u53c2\u9605 [torch.utils.tensorboard \u6559\u7a0b](https://pytorch.org/docs/stable/tensorboard.html)\n\u4ee5\u4e86\u89e3\u60a8\u53ef\u4ee5\u8bb0\u5f55\u7684\u66f4\u591aTensorBoard\u53ef\u89c6\u5316\u7c7b\u578b\u3002\n\n\u5982\u679c\u60a8\u4e0d\u518d\u9700\u8981\u6458\u8981\u5199\u5165\u5668,\u8bf7\u8c03\u7528 ``close()`` \u65b9\u6cd5\u3002\n\n\n" ] }, { @@ -83,14 +83,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Run TensorBoard\n\nInstall TensorBoard through the command line to visualize data you logged\n\n```sh\npip install tensorboard\n```\nNow, start TensorBoard, specifying the root log directory you used above. \nArgument ``logdir`` points to directory where TensorBoard will look to find \nevent files that it can display. TensorBoard will recursively walk \nthe directory structure rooted at ``logdir``, looking for ``.*tfevents.*`` files.\n\n```sh\ntensorboard --logdir=runs\n```\nGo to the URL it provides OR to [http://localhost:6006/](http://localhost:6006/)\n\n\n\nThis dashboard shows how the loss and accuracy change with every epoch. \nYou can use it to also track training speed, learning rate, and other \nscalar values. It\u2019s helpful to compare these metrics across different \ntraining runs to improve your model.\n\n\n" + "## \u8fd0\u884cTensorBoard\n\n\u901a\u8fc7\u547d\u4ee4\u884c\u5b89\u88c5TensorBoard\u4ee5\u53ef\u89c6\u5316\u60a8\u8bb0\u5f55\u7684\u6570\u636e\n\n```sh\npip install tensorboard\n```\n\u73b0\u5728,\u542f\u52a8TensorBoard,\u6307\u5b9a\u60a8\u4e4b\u524d\u4f7f\u7528\u7684\u6839\u65e5\u5fd7\u76ee\u5f55\u3002\n\u53c2\u6570 ``logdir`` \u6307\u5411TensorBoard\u5c06\u67e5\u627e\u53ef\u663e\u793a\u7684\u4e8b\u4ef6\u6587\u4ef6\u7684\u76ee\u5f55\u3002\nTensorBoard\u5c06\u9012\u5f52\u904d\u5386 ``logdir`` \u6839\u76ee\u5f55\u4e0b\u7684\u76ee\u5f55\u7ed3\u6784,\u5bfb\u627e ``.*tfevents.*`` \u6587\u4ef6\u3002\n\n```sh\ntensorboard --logdir=runs\n```\n\u8f6c\u5230\u5b83\u63d0\u4f9b\u7684URL\u6216 [http://localhost:6006/](http://localhost:6006/)\n\n\n\n\u6b64\u4eea\u8868\u677f\u663e\u793a\u4e86\u635f\u5931\u548c\u51c6\u786e\u7387\u5982\u4f55\u968f\u7740\u6bcf\u4e2aepoch\u800c\u53d8\u5316\u3002\n\u60a8\u53ef\u4ee5\u4f7f\u7528\u5b83\u6765\u8ddf\u8e2a\u8bad\u7ec3\u901f\u5ea6\u3001\u5b66\u4e60\u7387\u548c\u5176\u4ed6\u6807\u91cf\u503c\u3002\n\u6bd4\u8f83\u4e0d\u540c\u8bad\u7ec3\u8fd0\u884c\u7684\u8fd9\u4e9b\u6307\u6807\u6709\u52a9\u4e8e\u6539\u8fdb\u60a8\u7684\u6a21\u578b\u3002\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Learn More\n\n- [torch.utils.tensorboard](https://pytorch.org/docs/stable/tensorboard.html) docs\n- [Visualizing models, data, and training with TensorBoard](https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html) tutorial\n\n\n" + "## \u4e86\u89e3\u66f4\u591a\n\n- [torch.utils.tensorboard](https://pytorch.org/docs/stable/tensorboard.html) \u6587\u6863\n- [\u4f7f\u7528TensorBoard\u53ef\u89c6\u5316\u6a21\u578b\u3001\u6570\u636e\u548c\u8bad\u7ec3](https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html) \u6559\u7a0b\n\n\n" ] } ], diff --git a/docs/_downloads/db0de66558b1ca13e18495862bf4b024/swap_tensors.py b/docs/_downloads/db0de66558b1ca13e18495862bf4b024/swap_tensors.py index d3b90c6..d0f3007 100644 --- a/docs/_downloads/db0de66558b1ca13e18495862bf4b024/swap_tensors.py +++ b/docs/_downloads/db0de66558b1ca13e18495862bf4b024/swap_tensors.py @@ -1,81 +1,77 @@ """ -Extension points in ``nn.Module`` for ``load_state_dict`` and tensor subclasses +在 ``nn.Module`` 中为 ``load_state_dict`` 和张量子类提供扩展点 =============================================================================== -**Author:** `Mikayla Gawarecki `_ +**作者:** `Mikayla Gawarecki `_ -This recipe introduces a new utility function ``torch.utils.swap_tensors`` -as well as two new extension points where it has been integrated in -``nn.Module``: +本教程介绍了一个新的实用函数 ``torch.utils.swap_tensors``, +以及在 ``nn.Module`` 中集成它的两个新扩展点: -* ``nn.Module.to()`` and related methods +* ``nn.Module.to()`` 和相关方法 * ``nn.Module.load_state_dict()`` .. note:: - This recipe requires PyTorch 2.3.0 or later. + 本教程需要 PyTorch 2.3.0 或更高版本。 """ ############################################################################### # ``torch.utils.swap_tensors`` # ---------------------------- -# ``torch.utils.swap_tensors`` (hereafter referred to as ``swap_tensors``) is a -# utility function that takes in two Python tensors and swaps them. +# ``torch.utils.swap_tensors``(以下简称为 ``swap_tensors``) 是一个 +# 实用函数,它接受两个 Python 张量并交换它们。 import torch import torch.nn as nn + t1 = torch.arange(2) t2 = torch.arange(3) -print(f"Before swapping, t1: {t1}, t2: {t2}") +print(f"交换前, t1: {t1}, t2: {t2}") torch.utils.swap_tensors(t1, t2) -print(f"After swapping, t1: {t1}, t2: {t2}") +print(f"交换后, t1: {t1}, t2: {t2}") ################################################################################ -# More specifically, ``swap_tensors`` swaps the Python ``__class__``, ``__dict__`` -# and ``__slots__`` of the two tensors, as well as their associated ``at::Tensor``. +# 更具体地说,``swap_tensors`` 交换了两个张量的 Python ``__class__``、``__dict__`` +# 和 ``__slots__``,以及它们相关的 ``at::Tensor``。 # # -# Application to ``nn.Module`` +# 应用于 ``nn.Module`` # ---------------------------- -# This utility is pertinent to ``nn.Module`` when a Python object outside -# of the module holds a reference to parameters of the module. If an ``nn.Module`` -# modifies any of its parameters out of place, the object holding references to -# the parameters will not see the change. A classic example of this is the -# optimizer, which holds a reference to the parameters of the ``nn.Module``. -# This leads to a silent correctness issue where the ``optimizer.step()`` will -# run without error but the weights of the ``nn.Module`` will not be updated. +# 当 ``nn.Module`` 之外的 Python 对象持有该模块参数的引用时,此实用函数就很有用。 +# 如果 ``nn.Module`` 就地修改了任何参数,持有这些参数引用的对象将无法看到更改。 +# 一个典型的例子是优化器,它持有 ``nn.Module`` 参数的引用。 +# 这会导致一个潜在的正确性问题,即 ``optimizer.step()`` 会无错误运行, +# 但 ``nn.Module`` 的权重不会被更新。 mod = torch.nn.Linear(1, 2, bias=False) optimizer = torch.optim.SGD(mod.parameters()) -print(f"weight in mod: {mod.weight}") -print(f"weight in optimizer: {optimizer.param_groups[0]['params']}") +print(f"mod 中的权重: {mod.weight}") +print(f"优化器中的权重: {optimizer.param_groups[0]['params']}") mod.weight = torch.nn.Parameter(2 * mod.weight) -print(f"weight in mod: {mod.weight}") -print(f"weight in optimizer: {optimizer.param_groups[0]['params']}") +print(f"mod 中的权重: {mod.weight}") +print(f"优化器中的权重: {optimizer.param_groups[0]['params']}") ################################################################################ -# ``nn.Module.to()`` and related methods +# ``nn.Module.to()`` 和相关方法 # -------------------------------------- -# This includes methods that change the device of the module (such as ``nn.Module.cpu()``), -# methods that change the ``dtype`` of the module (such as ``nn.Module.float()``) -# as well as methods that allow the module to be materialized -# (such as ``nn.Module.to_empty()``). +# 这包括改变模块设备的方法(如 ``nn.Module.cpu()``)、 +# 改变模块 ``dtype`` 的方法(如 ``nn.Module.float()``)、 +# 以及允许模块实例化的方法(如 ``nn.Module.to_empty()``)。 # -# At first glance, it might be non-intuitive that these methods are able to -# modify the parameters of the module in-place. The existing approach has been -# to use a nasty hack dating back from the first days of PyTorch. +# 乍一看,这些方法能够就地修改模块的参数可能看起来不太直观。 +# 现有的方法是使用一种追溯到 PyTorch 最初几天的丑陋黑客手段。 # -# Notably, the existing approach does not work in these cases: +# 值得注意的是,现有方法在以下情况下无法工作: # -# * when using ``__torch_dispatch__`` subclasses -# * when ``param`` and ``new_param`` do not have the same Python ``type()`` -# * For tensors with special C++ representations (such as sparse tensors and ``XLA`` tensors) +# * 使用 ``__torch_dispatch__`` 子类 +# * ``param`` 和 ``new_param`` 的 Python ``type()`` 不同 +# * 对于具有特殊 C++ 表示的张量(如稀疏张量和 ``XLA`` 张量) # -# In the following part of this recipe, we will define a toy ``__torch_dispatch__`` -# subclass ``MyQuantizedLinearWeight`` that represents quantized linear weights. -# This subclass will be used for illustration purposes throughout the rest of -# the tutorial. For brevity, we omit most of the ``__torch_dispatch__`` -# implementation. +# 在本教程的下一部分,我们将定义一个玩具 ``__torch_dispatch__`` 子类 ``MyQuantizedLinearWeight`` +# 来表示量化的线性权重。在本教程的剩余部分,我们将使用这个子类进行说明。 +# 为简洁起见,我们省略了大部分 ``__torch_dispatch__`` 实现。 + aten = torch.ops.aten + class MyQuantizedLinearWeight(torch.Tensor): @staticmethod def __new__(cls, elem, scale): @@ -86,7 +82,8 @@ def __new__(cls, elem, scale): layout=elem.layout, device=elem.device, strides=elem.stride(), - storage_offset=elem.storage_offset()) + storage_offset=elem.storage_offset(), + ) def __init__(self, elem: torch.Tensor, scale: float): self.elem = elem @@ -100,42 +97,39 @@ def __torch_dispatch__(cls, func, types, args, kwargs): if func in (aten.detach.default, aten._to_copy.default): new_elem = func(args[0].elem, *args[1:], **kwargs) return cls(new_elem, args[0].scale) - # Implementations for certain ops would be added to ``OP_TABLE``. - # We omit this for brevity. + # 某些操作的实现将添加到 ``OP_TABLE``。 + # 为简洁起见,我们在此省略。 OP_TABLE = dict() if func in OP_TABLE: - return OP_TABLE[func](func, args, kwargs) - raise NotImplementedError(f"Unsupported function {func}") + return OP_TABLE[func](func, args, kwargs) + raise NotImplementedError(f"不支持的函数 {func}") + ################################################################################# -# Let us create an ``nn.Linear`` layer of ``dtype`` ``torch.float32`` where the weight is -# a ``MyQuantizedLinearWeight`` and try to convert it to ``torch.bfloat16``. -# Observe that the weight's ``dtype`` changes as expected. However, the ``dtype`` -# of the subclass' payload (``elem``) does not change. +# 让我们创建一个 ``dtype`` 为 ``torch.float32`` 的 ``nn.Linear`` 层, +# 其权重是 ``MyQuantizedLinearWeight``。然后尝试将其转换为 ``torch.bfloat16``。 +# 观察到权重的 ``dtype`` 如预期般改变了。但是子类的有效载荷(``elem``)的 ``dtype`` 没有改变。 m = nn.Linear(3, 5, dtype=torch.float32) m.weight = torch.nn.Parameter(MyQuantizedLinearWeight(m.weight, 0.5)) -print(f"Before: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") +print(f"之前: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") m.bfloat16() -print(f"After: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") +print(f"之后: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") print(f"m.weight.dtype: {m.weight.dtype}") print(f"m.weight.elem.dtype: {m.weight.elem.dtype}") print(f"m.bias.dtype: {m.bias.dtype}") ################################################################################ -# To this end, we introduce a global config -# ``torch.__future__.set_swap_module_params_on_conversion`` that will use -# ``swap_tensors`` to swap the parameters of the module while preserving -# references in place of ``.data`` setting. When this config is set, -# ``swap_tensors`` will be used during the conversion, which ensures that -# the ``dtype`` of the payload is properly converted. +# 为此,我们引入了一个全局配置 ``torch.__future__.set_swap_module_params_on_conversion`` +# 它将使用 ``swap_tensors`` 交换模块的参数,同时保留 ``.data`` 设置中的引用。 +# 设置此配置后,在转换期间将使用 ``swap_tensors``,从而确保有效载荷的 ``dtype`` 正确转换。 torch.__future__.set_swap_module_params_on_conversion(True) m = nn.Linear(3, 5, dtype=torch.float32) m.weight = torch.nn.Parameter(MyQuantizedLinearWeight(m.weight, 0.5)) -print(f"Before: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") +print(f"之前: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") m.bfloat16() -print(f"After: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") +print(f"之后: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") print(f"m.weight.dtype: {m.weight.dtype}") print(f"m.weight.elem.dtype: {m.weight.elem.dtype}") print(f"m.bias.dtype: {m.bias.dtype}") @@ -144,42 +138,33 @@ def __torch_dispatch__(cls, func, types, args, kwargs): ################################################################################ # ``nn.Module.load_state_dict()`` # -------------------------------- -# Depending on the value of the ``assign`` keyword argument passed -# to ``load_state_dict()``, there are two ways to load the ``state_dict``: +# 根据传递给 ``load_state_dict()`` 的 ``assign`` 关键字参数的值, +# 有两种方式加载 ``state_dict``: # -# * ``assign=False``: preserves the properties of ``module.param`` and only takes the values -# from ``state_dict['param_name']`` -# * ``assign=True``: preserves the properties and values of ``state_dict['param_name']``. +# * ``assign=False``: 保留 ``module.param`` 的属性,只从 ``state_dict['param_name']`` 中获取值 +# * ``assign=True``: 保留 ``state_dict['param_name']`` 的属性和值。 # # -# Previously, these were implemented with in-place ``copy_`` and ``__setattr__`` respectively. -# With the existing implementation, each approach had its own limitations -- ``assign=False`` -# imposes the constraint that the type of the parameter in the ``state_dict`` must -# be the same as the type of the parameter in the module while ``assign=True`` imposes -# the constraint that anything that holds references to the module's parameters must -# be initialized after ``nn.Module.load_state_dict()``. +# 之前,这些分别是通过就地 ``copy_`` 和 ``__setattr__`` 实现的。 +# 在现有实现中,每种方法都有自己的限制 - ``assign=False`` 要求 ``state_dict`` 中的参数类型 +# 必须与模块中的参数类型相同,而 ``assign=True`` 要求在 ``nn.Module.load_state_dict()`` 之后 +# 初始化任何持有模块参数引用的对象。 # -# Now, we address both constraints by adding a ``swap_tensors`` path to ``load_state_dict()`` -# and introducing a new extension point ``torch.Tensor.module_load(self, other, assign=False)``. -# When the ``swap_tensors`` path is enabled via the ``__future__`` mentioned above, -# we can use a ``__torch_function__`` handler for ``module_load`` to apply a -# custom transformation to the value in the ``state_dict``. The result of this -# transformation will be swapped with the parameter in the module. +# 现在,我们通过在 ``load_state_dict()`` 中添加 ``swap_tensors`` 路径并引入新的扩展点 +# ``torch.Tensor.module_load(self, other, assign=False)`` 来解决这两个限制。 +# 当启用上述 ``__future__`` 时,我们可以使用 ``module_load`` 的 ``__torch_function__`` 处理程序 +# 对 ``state_dict`` 中的值应用自定义转换。转换的结果将与模块中的参数交换。 # -# In the following example, we will use the ``MyQuantizedLinearWeight`` subclass -# defined above to illustrate how we can use these features to apply a -# custom quantization scheme to the weights of a linear layer when -# loading the ``state_dict``. +# 在下面的示例中,我们将使用上面定义的 ``MyQuantizedLinearWeight`` 子类 +# 来说明如何使用这些功能在加载 ``state_dict`` 时对线性层的权重应用自定义量化方案。 # -# Recall that the ``__torch_function__`` handler for ``module_load`` will be -# invoked if either ``self`` or ``other`` (in this case ``param`` or -# ``state_dict[param_key]``) are ``MyQuantizedLinearWeight`` subclasses. +# 回顾一下,如果 ``self`` 或 ``other``(在本例中是 ``param`` 或 ``state_dict[param_key]``) +# 是 ``MyQuantizedLinearWeight`` 子类,则会调用 ``module_load`` 的 ``__torch_function__`` 处理程序。 # -# Assume that we expect the ``state_dict`` to contain plain tensors and the -# module to contain ``MyQuantizedLinearWeight`` parameters where we want the -# tensors in the ``state_dict`` to be transformed into the subclass. Then we -# can define a ``__torch_function__`` handler for ``torch.Tensor.module_load`` -# as such: +# 假设我们期望 ``state_dict`` 包含普通张量,而模块包含 ``MyQuantizedLinearWeight`` 参数, +# 我们希望将 ``state_dict`` 中的张量转换为子类。那么我们可以为 ``torch.Tensor.module_load`` 定义 +# 一个 ``__torch_function__`` 处理程序,如下所示: + @classmethod def custom_torch_function(cls, func, types, args=(), kwargs=None): @@ -191,51 +176,48 @@ def custom_torch_function(cls, func, types, args=(), kwargs=None): return MyQuantizedLinearWeight(src, dest.scale) else: with torch._C.DisableTorchFunctionSubclass(): - return func(*args, **kwargs) + return func(*args, **kwargs) + MyQuantizedLinearWeight.__torch_function__ = custom_torch_function ################################################################################# -# First, let us create a skeleton of a model on the meta device to avoid -# materializing storages. We convert all weights in the modules to -# ``MyQuantizedLinearWeight`` subclasses while leaving biases intact. +# 首先,让我们在 meta 设备上创建一个模型框架,以避免实例化存储。 +# 我们将模块中的所有权重转换为 ``MyQuantizedLinearWeight`` 子类,同时保留偏置不变。 + def fn(m): if isinstance(m, nn.Linear): requires_grad = m.weight.requires_grad m.weight = torch.nn.Parameter( - MyQuantizedLinearWeight(m.weight, 0.5), requires_grad=requires_grad - ) + MyQuantizedLinearWeight(m.weight, 0.5), requires_grad=requires_grad + ) + with torch.device("meta"): m = nn.Linear(3, 5) m.apply(fn) ################################################################################# -# We can then load the ``state_dict``. Observe that we use ``assign=True`` because -# for biases, we want to preserve the properties of the tensor in the ``state_dict`` -# (for example, we do not want the bias to be on the ``meta`` device after loading). +# 然后我们可以加载 ``state_dict``。注意我们使用 ``assign=True``,因为对于偏置, +# 我们希望保留 ``state_dict`` 中张量的属性(例如,我们不希望偏置在加载后位于 ``meta`` 设备上)。 torch.__future__.set_swap_module_params_on_conversion(True) -print(f"Before: id(weight)={id(m.weight)}, id(bias)={id(m.bias)}") -print(f"m.state_dict() before load_state_dict():\n {m.state_dict()}") +print(f"之前: id(weight)={id(m.weight)}, id(bias)={id(m.bias)}") +print(f"load_state_dict() 之前的 m.state_dict():\n {m.state_dict()}") state_dict = nn.Linear(3, 5).state_dict() print(f"state_dict:\n {state_dict}") m.load_state_dict(state_dict, assign=True) -print(f"After: id(weight)={id(m.weight)}, id(bias)={id(m.bias)}") -print(f"m.state_dict() after load_state_dict():\n {m.state_dict()}") +print(f"之后: id(weight)={id(m.weight)}, id(bias)={id(m.bias)}") +print(f"load_state_dict() 之后的 m.state_dict():\n {m.state_dict()}") ################################################################################# -# The above is a toy example of how we can use the new extension point in -# ``nn.Module.load_state_dict()``. One can also imagine alternate scenarios such -# as when we have tensor subclasses in the ``state_dict`` and plain ``nn.Parameters``/ -# tensors in the module or when both are tensor subclasses. Based on the use -# case, we can define the ``__torch_function__`` handler for ``module_load`` -# to apply the transforms as needed. +# 上面是一个如何使用 ``nn.Module.load_state_dict()`` 中的新扩展点的玩具示例。 +# 我们还可以想象其他场景,例如当 ``state_dict`` 中有张量子类而模块中有普通 ``nn.Parameters``/张量时, +# 或者两者都是张量子类时。根据使用场景,我们可以定义 ``module_load`` 的 ``__torch_function__`` 处理程序 +# 来应用所需的转换。 # -# Conclusion +# 结论 # ---------- -# In this recipe, we learned about ``swap_tensors``, the importance -# of preserving references for parameters in ``nn.Module`` as well as how to -# use the two new extension points that are gated by -# ``torch.__future__.set_swap_module_params_on_conversion``. +# 在本教程中,我们学习了 ``swap_tensors``、在 ``nn.Module`` 中保留参数引用的重要性, +# 以及如何使用由 ``torch.__future__.set_swap_module_params_on_conversion`` 控制的两个新扩展点。 diff --git a/docs/_downloads/edc021e5f7c55efead2a89b91cdfae27/module_load_state_dict_tips.py b/docs/_downloads/edc021e5f7c55efead2a89b91cdfae27/module_load_state_dict_tips.py index 17c812b..1ed96c3 100644 --- a/docs/_downloads/edc021e5f7c55efead2a89b91cdfae27/module_load_state_dict_tips.py +++ b/docs/_downloads/edc021e5f7c55efead2a89b91cdfae27/module_load_state_dict_tips.py @@ -1,26 +1,25 @@ """ - -Tips for Loading an ``nn.Module`` from a Checkpoint +从检查点加载 ``nn.Module`` 的技巧 =================================================== -**Author:** `Mikayla Gawarecki `_ +**作者:** `Mikayla Gawarecki `_ -If you're loading a checkpoint and want to reduce compute and memory as much as possible, -this tutorial shares some recommended practices. In particular, we will discuss +如果你要加载一个检查点并希望尽可能减少计算和内存的使用,本教程将分享一些推荐的做法。特别是我们将讨论以下几点: -1. The ``mmap`` keyword argument on ``torch.load`` -2. The ``torch.device()`` context manager -3. The ``assign`` keyword argument on ``nn.Module.load_state_dict()`` +1. ``torch.load`` 中的 ``mmap`` 关键字参数 +2. ``torch.device()`` 上下文管理器 +3. ``nn.Module.load_state_dict()`` 中的 ``assign`` 关键字参数 .. note:: - This recipe requires PyTorch 2.1.0 or later. + 本教程需要 PyTorch 2.1.0 或更高版本。 """ +import time ############################################################################### -# Let us consider a simple ``nn.Module`` that contains a list of Linear layers: +# 让我们考虑一个简单的 ``nn.Module``,它包含一个线性层列表: import torch from torch import nn -import time + class SomeModule(torch.nn.Module): def __init__(self, size): @@ -32,141 +31,122 @@ def forward(self, x): m = SomeModule(1000) -torch.save(m.state_dict(), 'checkpoint.pth') +torch.save(m.state_dict(), "checkpoint.pth") ################################################################################# -# The following snippet demonstrates the use of the the ``mmap`` keyword argument -# to ``torch.load``, the ``torch.device()`` context manager and the ``assign`` -# keyword argument to ``nn.Module.load_state_dict()``. +# 以下代码片段演示了如何使用 ``torch.load`` 中的 ``mmap`` 关键字参数、``torch.device()`` 上下文管理器和 ``nn.Module.load_state_dict()`` 中的 ``assign`` 关键字参数。 -state_dict = torch.load('checkpoint.pth', mmap=True) -with torch.device('meta'): - meta_m = SomeModule(1000) +state_dict = torch.load("checkpoint.pth", mmap=True) +with torch.device("meta"): + meta_m = SomeModule(1000) meta_m.load_state_dict(state_dict, assign=True) ############################################################################# -# Compare the snippet below to the one above: +# 将下面的代码片段与上面的进行比较: -state_dict = torch.load('checkpoint.pth') +state_dict = torch.load("checkpoint.pth") m = SomeModule(1000) m.load_state_dict(state_dict) ############################################################################# -# The second example does not use any of the features listed above and will be -# less compute and memory efficient for loading a checkpoint. In the following -# sections, we will discuss each of the features in further detail. +# 第二个示例没有使用上面列出的任何特性,因此在加载检查点时计算和内存效率会较低。在下面的部分中,我们将详细讨论每个特性。 ##################################################################################### -# Using ``torch.load(mmap=True)`` +# 使用 ``torch.load(mmap=True)`` # ------------------------------- -# First, let us consider what happens when we load the checkpoint with ``torch.load``. -# When we save a checkpoint with ``torch.save``, tensor storages are tagged with the device they are -# saved on. With ``torch.load``, tensor storages will be loaded to the device -# they were tagged with (unless this behavior is overridden using the -# ``map_location`` flag). For ease of explanation, let us assume that the tensors -# were saved on CPU. This means that on the first line all tensor storages will be -# loaded into CPU RAM, which can be undesirable when: -# -# * CPU RAM is smaller than the size of the checkpoint. -# * Waiting for the entire checkpoint to be loaded into RAM before performing, for example, some per-tensor processing. +# 首先,让我们考虑使用 ``torch.load`` 加载检查点时会发生什么。 +# 当我们使用 ``torch.save`` 保存检查点时,张量存储会被标记为保存时所在的设备。 +# 使用 ``torch.load`` 时,张量存储将被加载到它们被标记的设备上(除非使用 ``map_location`` 标志覆盖此行为)。 +# 为了解释方便,我们假设张量是保存在 CPU 上的。这意味着在第一行中,所有张量存储将被加载到 CPU 内存中,在以下情况下这是不可取的: + +# * CPU 内存小于检查点的大小。 +# * 在执行一些每张量处理之前等待整个检查点被加载到内存中。 start_time = time.time() -state_dict = torch.load('checkpoint.pth') +state_dict = torch.load("checkpoint.pth") end_time = time.time() -print(f"loading time without mmap={end_time - start_time}") +print(f"不使用 mmap 的加载时间={end_time - start_time}") ################################################################################# -# The ``mmap`` keyword argument to ``torch.load`` attempts to solve the above two -# problems. As its name implies, the ``mmap`` keyword argument to ``torch.load`` -# makes use of an `mmap call `_ -# which maps a file on disk into virtual memory and lets the OS handle loading and -# unloading into physical memory automatically. When this flag is passed, tensor -# storages will be memory-mapped. +# ``torch.load`` 中的 ``mmap`` 关键字参数试图解决上述两个问题。 +# 顾名思义,``torch.load`` 中的 ``mmap`` 关键字参数使用了 `mmap 调用 `_, +# 它将磁盘上的文件映射到虚拟内存中,并让操作系统自动处理加载和卸载到物理内存。 +# 当传递此标志时,张量存储将被内存映射。 start_time = time.time() -state_dict = torch.load('checkpoint.pth', mmap=True) +state_dict = torch.load("checkpoint.pth", mmap=True) end_time = time.time() -print(f"loading time with mmap={end_time - start_time}") +print(f"使用 mmap 的加载时间={end_time - start_time}") + ###################################################################################### -# As mentioned above, one can use this argument to do per-tensor processing on a -# checkpoint without loading all tensor storages into CPU memory upfront. For example: +# 如上所述,可以使用此参数在不将所有张量存储加载到 CPU 内存中的情况下对检查点执行每张量处理。例如: def my_special_routine(t, device): - # this could be a much fancier operation + # 这可能是一个更复杂的操作 return t.to(dtype=torch.bfloat16, device=device) + def my_processing_function(key, device): t = state_dict[key] processed_t = my_special_routine(t, device) del t state_dict[key] = processed_t + for key in state_dict.keys(): - device = torch.device('cuda') + device = torch.device("cuda") my_processing_function(key, device) ################################################## -# Using ``torch.device('meta')`` +# 使用 ``torch.device('meta')`` # ------------------------------ -# Next, let's consider the creation of the module. +# 接下来,让我们考虑模块的创建。 m = SomeModule(1000) ####################################################################################################### -# This allocates memory for all parameters/buffers and initializes them per -# the default initialization schemes defined in ``SomeModule.__init__()``, which -# is wasteful when we want to load a checkpoint for the following reasons: -# -# * The result of the initialization kernels will be overwritten by ``load_state_dict()`` without ever being used, so -# initialization is wasteful. -# * We are allocating memory for these parameters/buffers in RAM while ``torch.load`` of the saved state dictionary also -# allocates memory in RAM for the parameters/buffers in the checkpoint. -# -# In order to solve these two problems, we can use the ``torch.device()`` -# context manager with ``device='meta'`` when we instantiate the ``nn.Module()``. -# -# The `torch.device() `_ -# context manager makes sure that factory calls will be performed as if they -# were passed the specified ``device`` as an argument. Tensors on ``torch.device('meta')`` do not -# carry data. However, they possess all other metadata a tensor carries such as ``.size()``, ``.stride()``, -# ``.requires_grad``, and others. -with torch.device('meta'): - new_m = SomeModule(1000) +# 这将为所有参数/缓冲区分配内存并根据 ``SomeModule.__init__()`` 中定义的默认初始化方案对其进行初始化, +# 当我们想要加载检查点时,这是浪费的,原因如下: + +# * 初始化内核的结果将被 ``load_state_dict()`` 覆盖而从未被使用,因此初始化是浪费的。 +# * 我们在 RAM 中为这些参数/缓冲区分配了内存,而 ``torch.load`` 保存的状态字典也在 RAM 中为检查点中的参数/缓冲区分配了内存。 + +# 为了解决这两个问题,我们可以在实例化 ``nn.Module()`` 时使用 ``device='meta'`` 的 ``torch.device()`` 上下文管理器。 + +# `torch.device() `_ +# 上下文管理器确保工厂调用将被视为传递了指定的 ``device`` 作为参数。 +# 在 ``torch.device('meta')`` 上的张量不携带数据。 +# 但是,它们具有张量所携带的其他元数据,如 ``.size()``, ``.stride()``, ``.requires_grad`` 等。 +with torch.device("meta"): + new_m = SomeModule(1000) ######################################################## -# Using ``load_state_dict(assign=True)`` +# 使用 ``load_state_dict(assign=True)`` # -------------------------------------- -# Next, we consider the loading of the state dictionary. +# 接下来,我们考虑加载状态字典。 m.load_state_dict(state_dict) ###################################################################################### -# ``nn.Module.load_state_dict()`` is usually implemented via an in-place -# ``param_in_model.copy_(param_in_state_dict)``. This means that the parameter/buffer -# with the corresponding key in the state dictionary is copied into the -# parameter/buffer in the ``nn.Module``. -# -# However, an in-place copy into a tensor on the ``meta`` device is a no-op. -# In order to avoid this, we can pass the ``assign=True`` keyword argument to -# ``load_state_dict()``. -# -# A caveat here is that since optimizers hold a reference to -# ``nn.Module.parameters()``, the optimizer must be initialized after the module -# is loaded from state dict if ``assign=True`` is passed. +# ``nn.Module.load_state_dict()`` 通常是通过 ``param_in_model.copy_(param_in_state_dict)`` 的就地复制实现的。 +# 这意味着状态字典中对应键的参数/缓冲区将被复制到 ``nn.Module`` 中的参数/缓冲区。 + +# 然而,对 ``meta`` 设备上的张量进行就地复制是无操作的。 +# 为了避免这种情况,我们可以在 ``load_state_dict()`` 中传递 ``assign=True`` 关键字参数。 + +# 这里的一个警告是,由于优化器持有对 ``nn.Module.parameters()`` 的引用, +# 如果传递了 ``assign=True``,则必须在从状态字典加载模块后初始化优化器。 -# As of PyTorch 2.3.0, one can use ``torch.__future__.set_swap_module_params_on_conversion`` to -# avoid this caveat. This `recipe `_ -# provides more details. +# 从 PyTorch 2.3.0 开始,可以使用 ``torch.__future__.set_swap_module_params_on_conversion`` 来避免这个警告。 +# 这个 `教程 `_ 提供了更多细节。 new_m.load_state_dict(state_dict, assign=True) -# Before 2.3.0, this MUST be done AFTER the load_state_dict with assign. -# In versions >= 2.3.0, one can consider setting ``torch.__future__.set_swap_module_params_on_conversion`` +# 在 2.3.0 之前,这一步必须在 load_state_dict 使用 assign 之后完成。 +# 在版本 >= 2.3.0 中,可以考虑设置 ``torch.__future__.set_swap_module_params_on_conversion`` opt = torch.optim.SGD(new_m.parameters(), lr=1e-3) ############################################################################### -# Conclusion +# 结论 # ------------- # -# To recap, in this tutorial we learned about ``torch.load(mmap=True)``, the -# ``torch.device()`` context manager with ``device=meta``, and -# ``nn.Module.load_state_dict(assign=True)`` as well as how these tools could -# be used to aid when loading a model from a checkpoint. +# 总结一下,在本教程中,我们学习了 ``torch.load(mmap=True)``、``device='meta'`` 的 ``torch.device()`` 上下文管理器和 ``nn.Module.load_state_dict(assign=True)`` +# 以及如何在从检查点加载模型时使用这些工具来提高效率。 diff --git a/docs/_downloads/f84ef04333132b77a0663044e78e7cbb/dynamic_quantization.ipynb b/docs/_downloads/f84ef04333132b77a0663044e78e7cbb/dynamic_quantization.ipynb index ccd8d74..8d2c578 100644 --- a/docs/_downloads/f84ef04333132b77a0663044e78e7cbb/dynamic_quantization.ipynb +++ b/docs/_downloads/f84ef04333132b77a0663044e78e7cbb/dynamic_quantization.ipynb @@ -15,7 +15,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n# Dynamic Quantization\n\nIn this recipe you will see how to take advantage of Dynamic\nQuantization to accelerate inference on an LSTM-style recurrent neural\nnetwork. This reduces the size of the model weights and speeds up model\nexecution.\n\n## Introduction\n\nThere are a number of trade-offs that can be made when designing neural\nnetworks. During model development and training you can alter the\nnumber of layers and number of parameters in a recurrent neural network\nand trade-off accuracy against model size and/or model latency or\nthroughput. Such changes can take lot of time and compute resources\nbecause you are iterating over the model training. Quantization gives\nyou a way to make a similar trade off between performance and model\naccuracy with a known model after training is completed.\n\nYou can give it a try in a single session and you will certainly reduce\nyour model size significantly and may get a significant latency\nreduction without losing a lot of accuracy.\n\n## What is dynamic quantization?\n\nQuantizing a network means converting it to use a reduced precision\ninteger representation for the weights and/or activations. This saves on\nmodel size and allows the use of higher throughput math operations on\nyour CPU or GPU.\n\nWhen converting from floating point to integer values you are\nessentially multiplying the floating point value by some scale factor\nand rounding the result to a whole number. The various quantization\napproaches differ in the way they approach determining that scale\nfactor.\n\nThe key idea with dynamic quantization as described here is that we are\ngoing to determine the scale factor for activations dynamically based on\nthe data range observed at runtime. This ensures that the scale factor\nis \"tuned\" so that as much signal as possible about each observed\ndataset is preserved.\n\nThe model parameters on the other hand are known during model conversion\nand they are converted ahead of time and stored in INT8 form.\n\nArithmetic in the quantized model is done using vectorized INT8\ninstructions. Accumulation is typically done with INT16 or INT32 to\navoid overflow. This higher precision value is scaled back to INT8 if\nthe next layer is quantized or converted to FP32 for output.\n\nDynamic quantization is relatively free of tuning parameters which makes\nit well suited to be added into production pipelines as a standard part\nof converting LSTM models to deployment.\n\n\n\n

Note

Limitations on the approach taken here\n\n\n This recipe provides a quick introduction to the dynamic quantization\n features in PyTorch and the workflow for using it. Our focus is on\n explaining the specific functions used to convert the model. We will\n make a number of significant simplifications in the interest of brevity\n and clarity

\n\n\n1. You will start with a minimal LSTM network\n2. You are simply going to initialize the network with a random hidden\n state\n3. You are going to test the network with random inputs\n4. You are not going to train the network in this tutorial\n5. You will see that the quantized form of this network is smaller and\n runs faster than the floating point network we started with\n6. You will see that the output values are generally in the same\n ballpark as the output of the FP32 network, but we are not\n demonstrating here the expected accuracy loss on a real trained\n network\n\nYou will see how dynamic quantization is done and be able to see\nsuggestive reductions in memory use and latency times. Providing a\ndemonstration that the technique can preserve high levels of model\naccuracy on a trained LSTM is left to a more advanced tutorial. If you\nwant to move right away to that more rigorous treatment please proceed\nto the [advanced dynamic quantization\ntutorial](https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html)_.\n\n## Steps\n\nThis recipe has 5 steps.\n\n1. Set Up - Here you define a very simple LSTM, import modules, and establish\n some random input tensors.\n\n2. Do the Quantization - Here you instantiate a floating point model and then create quantized\n version of it.\n\n3. Look at Model Size - Here you show that the model size gets smaller.\n\n4. Look at Latency - Here you run the two models and compare model runtime (latency).\n\n5. Look at Accuracy - Here you run the two models and compare outputs.\n\n\n### 1: Set Up\nThis is a straightforward bit of code to set up for the rest of the\nrecipe.\n\nThe unique module we are importing here is torch.quantization which\nincludes PyTorch's quantized operators and conversion functions. We also\ndefine a very simple LSTM model and set up some inputs.\n" + "\n# \u52a8\u6001\u91cf\u5316\n\n\u5728\u8fd9\u4e2a\u793a\u4f8b\u4e2d,\u60a8\u5c06\u770b\u5230\u5982\u4f55\u5229\u7528\u52a8\u6001\u91cf\u5316\u6765\u52a0\u901f LSTM \u98ce\u683c\u7684\u5faa\u73af\u795e\u7ecf\u7f51\u7edc\u7684\u63a8\u7406\u3002\u8fd9\u53ef\u4ee5\u51cf\u5c0f\u6a21\u578b\u6743\u91cd\u7684\u5927\u5c0f,\u5e76\u52a0\u5feb\u6a21\u578b\u6267\u884c\u901f\u5ea6\u3002\n\n## \u4ecb\u7ecd\n\n\u5728\u8bbe\u8ba1\u795e\u7ecf\u7f51\u7edc\u65f6,\u53ef\u4ee5\u505a\u51fa\u591a\u79cd\u6743\u8861\u3002\u5728\u6a21\u578b\u5f00\u53d1\u548c\u8bad\u7ec3\u671f\u95f4,\u60a8\u53ef\u4ee5\u6539\u53d8\u5faa\u73af\u795e\u7ecf\u7f51\u7edc\u4e2d\u7684\u5c42\u6570\u548c\u53c2\u6570\u6570\u91cf,\u5728\u6a21\u578b\u5927\u5c0f\u548c/\u6216\u6a21\u578b\u5ef6\u8fdf\u6216\u541e\u5410\u91cf\u4e0e\u7cbe\u5ea6\u4e4b\u95f4\u8fdb\u884c\u6743\u8861\u3002\u7531\u4e8e\u60a8\u9700\u8981\u91cd\u590d\u6a21\u578b\u8bad\u7ec3\u8fc7\u7a0b,\u56e0\u6b64\u8fd9\u79cd\u6539\u53d8\u9700\u8981\u5927\u91cf\u7684\u65f6\u95f4\u548c\u8ba1\u7b97\u8d44\u6e90\u3002\u91cf\u5316\u4e3a\u60a8\u63d0\u4f9b\u4e86\u4e00\u79cd\u5728\u5df2\u77e5\u6a21\u578b\u4e0a\u5728\u6027\u80fd\u548c\u6a21\u578b\u7cbe\u5ea6\u4e4b\u95f4\u8fdb\u884c\u6743\u8861\u7684\u65b9\u5f0f,\u800c\u65e0\u9700\u91cd\u65b0\u8bad\u7ec3\u6a21\u578b\u3002\n\n\u60a8\u53ef\u4ee5\u5728\u5355\u4e2a\u4f1a\u8bdd\u4e2d\u5c1d\u8bd5\u4e00\u4e0b,\u60a8\u80af\u5b9a\u4f1a\u663e\u8457\u51cf\u5c0f\u6a21\u578b\u5927\u5c0f,\u5e76\u53ef\u80fd\u5728\u4e0d\u4f1a\u635f\u5931\u592a\u591a\u7cbe\u5ea6\u7684\u60c5\u51b5\u4e0b\u83b7\u5f97\u663e\u8457\u7684\u5ef6\u8fdf\u51cf\u5c11\u3002\n\n## \u4ec0\u4e48\u662f\u52a8\u6001\u91cf\u5316?\n\n\u91cf\u5316\u7f51\u7edc\u610f\u5473\u7740\u5c06\u5176\u8f6c\u6362\u4e3a\u4f7f\u7528\u8f83\u4f4e\u7cbe\u5ea6\u7684\u6574\u6570\u8868\u793a\u5f62\u5f0f\u6765\u8868\u793a\u6743\u91cd\u548c/\u6216\u6fc0\u6d3b\u3002\u8fd9\u53ef\u4ee5\u51cf\u5c0f\u6a21\u578b\u5927\u5c0f,\u5e76\u5141\u8bb8\u5728 CPU \u6216 GPU \u4e0a\u4f7f\u7528\u66f4\u9ad8\u541e\u5410\u91cf\u7684\u6570\u5b66\u8fd0\u7b97\u3002\n\n\u4ece\u6d6e\u70b9\u6570\u8f6c\u6362\u4e3a\u6574\u6570\u503c\u65f6,\u60a8\u5b9e\u9645\u4e0a\u662f\u5c06\u6d6e\u70b9\u6570\u4e58\u4ee5\u67d0\u4e2a\u6bd4\u4f8b\u56e0\u5b50,\u7136\u540e\u5c06\u7ed3\u679c\u820d\u5165\u4e3a\u6574\u6570\u3002\u4e0d\u540c\u7684\u91cf\u5316\u65b9\u6cd5\u5728\u786e\u5b9a\u8be5\u6bd4\u4f8b\u56e0\u5b50\u7684\u65b9\u5f0f\u4e0a\u6709\u6240\u4e0d\u540c\u3002\n\n\u8fd9\u91cc\u4ecb\u7ecd\u7684\u52a8\u6001\u91cf\u5316\u7684\u5173\u952e\u601d\u60f3\u662f,\u6211\u4eec\u5c06\u6839\u636e\u8fd0\u884c\u65f6\u89c2\u5bdf\u5230\u7684\u6570\u636e\u8303\u56f4\u52a8\u6001\u786e\u5b9a\u6fc0\u6d3b\u7684\u6bd4\u4f8b\u56e0\u5b50\u3002\u8fd9\u53ef\u786e\u4fdd\u6bd4\u4f8b\u56e0\u5b50\u88ab\"\u8c03\u6574\"\u4e3a\u5c3d\u53ef\u80fd\u4fdd\u7559\u6bcf\u4e2a\u89c2\u5bdf\u5230\u7684\u6570\u636e\u96c6\u7684\u4fe1\u53f7\u3002\n\n\u53e6\u4e00\u65b9\u9762,\u6a21\u578b\u53c2\u6570\u5728\u6a21\u578b\u8f6c\u6362\u671f\u95f4\u662f\u5df2\u77e5\u7684,\u5b83\u4eec\u4f1a\u63d0\u524d\u8f6c\u6362\u5e76\u4ee5 INT8 \u5f62\u5f0f\u5b58\u50a8\u3002\n\n\u91cf\u5316\u6a21\u578b\u4e2d\u7684\u7b97\u672f\u8fd0\u7b97\u4f7f\u7528\u77e2\u91cf\u5316\u7684 INT8 \u6307\u4ee4\u5b8c\u6210\u3002\u7d2f\u52a0\u901a\u5e38\u4f7f\u7528 INT16 \u6216 INT32 \u6765\u907f\u514d\u6ea2\u51fa\u3002\u5982\u679c\u4e0b\u4e00\u5c42\u662f\u91cf\u5316\u7684,\u5219\u5c06\u6b64\u8f83\u9ad8\u7cbe\u5ea6\u503c\u7f29\u653e\u56de INT8;\u5982\u679c\u662f\u8f93\u51fa,\u5219\u5c06\u5176\u8f6c\u6362\u4e3a FP32\u3002\n\n\u52a8\u6001\u91cf\u5316\u76f8\u5bf9\u6765\u8bf4\u6ca1\u6709\u592a\u591a\u9700\u8981\u8c03\u6574\u7684\u53c2\u6570,\u56e0\u6b64\u975e\u5e38\u9002\u5408\u4f5c\u4e3a\u5c06 LSTM \u6a21\u578b\u8f6c\u6362\u4e3a\u90e8\u7f72\u7684\u6807\u51c6\u90e8\u5206\u6dfb\u52a0\u5230\u751f\u4ea7\u7ba1\u9053\u4e2d\u3002\n\n

Note

\u672c\u793a\u4f8b\u4e2d\u91c7\u7528\u7684\u65b9\u6cd5\u7684\u5c40\u9650\u6027\n\n \u672c\u793a\u4f8b\u63d0\u4f9b\u4e86\u5bf9 PyTorch \u4e2d\u52a8\u6001\u91cf\u5316\u529f\u80fd\u7684\u5feb\u901f\u4ecb\u7ecd,\u4ee5\u53ca\u4f7f\u7528\u5b83\u7684\u5de5\u4f5c\u6d41\u7a0b\u3002\u6211\u4eec\u7684\u91cd\u70b9\u662f\u89e3\u91ca\u7528\u4e8e\u8f6c\u6362\u6a21\u578b\u7684\u7279\u5b9a\u51fd\u6570\u3002\u4e3a\u4e86\u7b80\u6d01\u548c\u6e05\u6670,\u6211\u4eec\u505a\u51fa\u4e86\u4e00\u4e9b\u91cd\u5927\u7b80\u5316,\u5305\u62ec:

\n\n1. \u60a8\u5c06\u4ece\u4e00\u4e2a\u6700\u5c0f\u7684 LSTM \u7f51\u7edc\u5f00\u59cb\n2. \u60a8\u53ea\u9700\u7528\u968f\u673a\u9690\u85cf\u72b6\u6001\u521d\u59cb\u5316\u7f51\u7edc\n3. \u60a8\u5c06\u4f7f\u7528\u968f\u673a\u8f93\u5165\u6765\u6d4b\u8bd5\u7f51\u7edc\n4. \u60a8\u4e0d\u4f1a\u5728\u672c\u6559\u7a0b\u4e2d\u8bad\u7ec3\u7f51\u7edc\n5. \u60a8\u5c06\u770b\u5230,\u4e0e\u6211\u4eec\u5f00\u59cb\u65f6\u7684\u6d6e\u70b9\u7f51\u7edc\u76f8\u6bd4,\u91cf\u5316\u540e\u7684\u7f51\u7edc\u66f4\u5c0f\u4e14\u8fd0\u884c\u901f\u5ea6\u66f4\u5feb\n6. \u60a8\u5c06\u770b\u5230,\u91cf\u5316\u7f51\u7edc\u4ea7\u751f\u7684\u8f93\u51fa\u5f20\u91cf\u503c\u4e0e FP32 \u7f51\u7edc\u8f93\u51fa\u7684\u503c\u5728\u540c\u4e00\u6570\u91cf\u7ea7,\u4f46\u6211\u4eec\u5e76\u672a\u5728\u8fd9\u91cc\u5c55\u793a\u8be5\u6280\u672f\u5728\u7ecf\u8fc7\u8bad\u7ec3\u7684 LSTM \u4e0a\u80fd\u591f\u4fdd\u7559\u8f83\u9ad8\u6a21\u578b\u7cbe\u5ea6\u7684\u60c5\u51b5\n\n\u60a8\u5c06\u4e86\u89e3\u5982\u4f55\u8fdb\u884c\u52a8\u6001\u91cf\u5316,\u5e76\u80fd\u591f\u770b\u5230\u5185\u5b58\u4f7f\u7528\u548c\u5ef6\u8fdf\u65f6\u95f4\u7684\u6f5c\u5728\u51cf\u5c0f\u3002\u5173\u4e8e\u8be5\u6280\u672f\u5728\u7ecf\u8fc7\u8bad\u7ec3\u7684 LSTM \u4e0a\u80fd\u591f\u4fdd\u7559\u8f83\u9ad8\u6a21\u578b\u7cbe\u5ea6\u7684\u6f14\u793a,\u5c06\u7559\u5f85\u66f4\u9ad8\u7ea7\u7684\u6559\u7a0b\u3002\u5982\u679c\u60a8\u60f3\u76f4\u63a5\u8fdb\u5165\u66f4\u4e25\u683c\u7684\u5904\u7406,\u8bf7\u7ee7\u7eed\u5b66\u4e60 [\u9ad8\u7ea7\u52a8\u6001\u91cf\u5316\u6559\u7a0b](https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html)_\u3002\n\n## \u6b65\u9aa4\n\n\u672c\u793a\u4f8b\u5305\u542b 5 \u4e2a\u6b65\u9aa4\u3002\n\n1. \u8bbe\u7f6e - \u5728\u8fd9\u91cc,\u60a8\u5b9a\u4e49\u4e00\u4e2a\u975e\u5e38\u7b80\u5355\u7684 LSTM,\u5bfc\u5165\u6a21\u5757,\u5e76\u5efa\u7acb\u4e00\u4e9b\u968f\u673a\u8f93\u5165\u5f20\u91cf\u3002\n\n2. \u6267\u884c\u91cf\u5316 - \u5728\u8fd9\u91cc,\u60a8\u5b9e\u4f8b\u5316\u4e00\u4e2a\u6d6e\u70b9\u6a21\u578b,\u7136\u540e\u521b\u5efa\u5176\u91cf\u5316\u7248\u672c\u3002\n\n3. \u67e5\u770b\u6a21\u578b\u5927\u5c0f - \u5728\u8fd9\u91cc,\u60a8\u663e\u793a\u6a21\u578b\u5927\u5c0f\u53d8\u5c0f\u4e86\u3002\n\n4. \u67e5\u770b\u5ef6\u8fdf - \u5728\u8fd9\u91cc,\u60a8\u8fd0\u884c\u4e24\u4e2a\u6a21\u578b\u5e76\u6bd4\u8f83\u6a21\u578b\u8fd0\u884c\u65f6\u95f4(\u5ef6\u8fdf)\u3002\n\n5. \u67e5\u770b\u7cbe\u5ea6 - \u5728\u8fd9\u91cc,\u60a8\u8fd0\u884c\u4e24\u4e2a\u6a21\u578b\u5e76\u6bd4\u8f83\u8f93\u51fa\u3002\n\n### 1: \u8bbe\u7f6e\n\u8fd9\u662f\u4e00\u6bb5\u76f4\u63a5\u7684\u4ee3\u7801,\u7528\u4e8e\u4e3a\u672c\u793a\u4f8b\u7684\u5176\u4f59\u90e8\u5206\u505a\u51c6\u5907\u3002\n\n\u6211\u4eec\u5728\u8fd9\u91cc\u5bfc\u5165\u7684\u552f\u4e00\u6a21\u5757\u662f torch.quantization,\u5b83\u5305\u542b\u4e86 PyTorch \u7684\u91cf\u5316\u7b97\u5b50\u548c\u8f6c\u6362\u51fd\u6570\u3002\u6211\u4eec\u8fd8\u5b9a\u4e49\u4e86\u4e00\u4e2a\u975e\u5e38\u7b80\u5355\u7684 LSTM \u6a21\u578b,\u5e76\u8bbe\u7f6e\u4e86\u4e00\u4e9b\u8f93\u5165\u3002\n" ] }, { @@ -26,14 +26,14 @@ }, "outputs": [], "source": [ - "# import the modules used here in this recipe\nimport torch\nimport torch.quantization\nimport torch.nn as nn\nimport copy\nimport os\nimport time\n\n# define a very, very simple LSTM for demonstration purposes\n# in this case, we are wrapping ``nn.LSTM``, one layer, no preprocessing or postprocessing\n# inspired by\n# `Sequence Models and Long Short-Term Memory Networks tutorial `__.\nclass lstm_for_demonstration(nn.Module):\n \"\"\"Elementary Long Short Term Memory style model which simply wraps ``nn.LSTM``\n Not to be used for anything other than demonstration.\n \"\"\"\n def __init__(self,in_dim,out_dim,depth):\n super(lstm_for_demonstration,self).__init__()\n self.lstm = nn.LSTM(in_dim,out_dim,depth)\n\n def forward(self,inputs,hidden):\n out,hidden = self.lstm(inputs,hidden)\n return out, hidden\n\n\ntorch.manual_seed(29592) # set the seed for reproducibility\n\n#shape parameters\nmodel_dimension=8\nsequence_length=20\nbatch_size=1\nlstm_depth=1\n\n# random data for input\ninputs = torch.randn(sequence_length,batch_size,model_dimension)\n# hidden is actually is a tuple of the initial hidden state and the initial cell state\nhidden = (torch.randn(lstm_depth,batch_size,model_dimension), torch.randn(lstm_depth,batch_size,model_dimension))" + "# \u5bfc\u5165\u672c\u793a\u4f8b\u4e2d\u4f7f\u7528\u7684\u6a21\u5757\nimport copy\nimport os\nimport time\n\nimport torch\nimport torch.nn as nn\nimport torch.quantization\n\n\n# \u4e3a\u6f14\u793a\u76ee\u7684\u5b9a\u4e49\u4e00\u4e2a\u975e\u5e38\u7b80\u5355\u7684 LSTM\n# \u5728\u8fd9\u79cd\u60c5\u51b5\u4e0b,\u6211\u4eec\u53ea\u662f\u5305\u88c5\u4e86 ``nn.LSTM``\u3001\u4e00\u5c42,\u6ca1\u6709\u9884\u5904\u7406\u6216\u540e\u5904\u7406\n# \u53d7\u5230\u4ee5\u4e0b\u6559\u7a0b\u7684\u542f\u53d1:\n# `\u5e8f\u5217\u6a21\u578b\u548c\u957f\u77ed\u671f\u8bb0\u5fc6\u7f51\u7edc\u6559\u7a0b `_, \u4f5c\u8005 Robert Guthrie\n# \u548c `\u52a8\u6001\u91cf\u5316\u6559\u7a0b `__\u3002\nclass lstm_for_demonstration(nn.Module):\n \"\"\"\u57fa\u672c\u7684\u957f\u77ed\u671f\u8bb0\u5fc6\u98ce\u683c\u6a21\u578b,\u53ea\u662f\u5305\u88c5\u4e86 ``nn.LSTM``\n \u4e0d\u5e94\u7528\u4e8e\u9664\u6f14\u793a\u4e4b\u5916\u7684\u4efb\u4f55\u5176\u4ed6\u7528\u9014\u3002\n \"\"\"\n\n def __init__(self, in_dim, out_dim, depth):\n super(lstm_for_demonstration, self).__init__()\n self.lstm = nn.LSTM(in_dim, out_dim, depth)\n\n def forward(self, inputs, hidden):\n out, hidden = self.lstm(inputs, hidden)\n return out, hidden\n\n\ntorch.manual_seed(29592) # \u8bbe\u7f6e\u79cd\u5b50\u4ee5\u83b7\u5f97\u53ef\u91cd\u590d\u7ed3\u679c\n\n# \u5f62\u72b6\u53c2\u6570\nmodel_dimension = 8\nsequence_length = 20\nbatch_size = 1\nlstm_depth = 1\n\n# \u968f\u673a\u8f93\u5165\u6570\u636e\ninputs = torch.randn(sequence_length, batch_size, model_dimension)\n# hidden \u5b9e\u9645\u4e0a\u662f\u521d\u59cb\u9690\u85cf\u72b6\u6001\u548c\u521d\u59cb\u7ec6\u80de\u72b6\u6001\u7684\u5143\u7ec4\nhidden = (\n torch.randn(lstm_depth, batch_size, model_dimension),\n torch.randn(lstm_depth, batch_size, model_dimension),\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### 2: Do the Quantization\n\nNow we get to the fun part. First we create an instance of the model\ncalled ``float\\_lstm`` then we are going to quantize it. We're going to use\nthe [torch.quantization.quantize_dynamic](https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic)_ function, which takes the model, then a list of the submodules\nwhich we want to\nhave quantized if they appear, then the datatype we are targeting. This\nfunction returns a quantized version of the original model as a new\nmodule.\n\nThat's all it takes.\n\n\n" + "### 2: \u6267\u884c\u91cf\u5316\n\n\u73b0\u5728\u6211\u4eec\u6765\u6267\u884c\u6709\u8da3\u7684\u90e8\u5206\u3002\u9996\u5148,\u6211\u4eec\u521b\u5efa\u4e00\u4e2a\u540d\u4e3a ``float_lstm`` \u7684\u6a21\u578b\u5b9e\u4f8b,\u7136\u540e\u6211\u4eec\u5c06\u5bf9\u5176\u8fdb\u884c\u91cf\u5316\u3002\u6211\u4eec\u5c06\u4f7f\u7528 [torch.quantization.quantize_dynamic](https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic)_ \u51fd\u6570,\u5b83\u63a5\u53d7\u6a21\u578b\u3001\u6211\u4eec\u5e0c\u671b\u91cf\u5316\u7684\u5b50\u6a21\u5757\u5217\u8868(\u5982\u679c\u5b58\u5728)\u4ee5\u53ca\u76ee\u6807\u6570\u636e\u7c7b\u578b\u3002\u6b64\u51fd\u6570\u8fd4\u56de\u539f\u59cb\u6a21\u578b\u7684\u91cf\u5316\u7248\u672c,\u4f5c\u4e3a\u4e00\u4e2a\u65b0\u6a21\u5757\u3002\n\n\u5c31\u8fd9\u4e48\u7b80\u5355\u3002\n\n\n" ] }, { @@ -44,14 +44,14 @@ }, "outputs": [], "source": [ - "# here is our floating point instance\nfloat_lstm = lstm_for_demonstration(model_dimension, model_dimension,lstm_depth)\n\n# this is the call that does the work\nquantized_lstm = torch.quantization.quantize_dynamic(\n float_lstm, {nn.LSTM, nn.Linear}, dtype=torch.qint8\n)\n\n# show the changes that were made\nprint('Here is the floating point version of this module:')\nprint(float_lstm)\nprint('')\nprint('and now the quantized version:')\nprint(quantized_lstm)" + "# \u8fd9\u662f\u6211\u4eec\u7684\u6d6e\u70b9\u5b9e\u4f8b\nfloat_lstm = lstm_for_demonstration(model_dimension, model_dimension, lstm_depth)\n\n# \u8fd9\u662f\u6267\u884c\u91cf\u5316\u7684\u8c03\u7528\nquantized_lstm = torch.quantization.quantize_dynamic(\n float_lstm, {nn.LSTM, nn.Linear}, dtype=torch.qint8\n)\n\n# \u663e\u793a\u6240\u505a\u7684\u66f4\u6539\nprint(\"\u8fd9\u662f\u8be5\u6a21\u5757\u7684\u6d6e\u70b9\u7248\u672c:\")\nprint(float_lstm)\nprint(\"\")\nprint(\"\u73b0\u5728\u662f\u91cf\u5316\u7248\u672c:\")\nprint(quantized_lstm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### 3. Look at Model Size\nWe've quantized the model. What does that get us? Well the first\nbenefit is that we've replaced the FP32 model parameters with INT8\nvalues (and some recorded scale factors). This means about 75% less data\nto store and move around. With the default values the reduction shown\nbelow will be less than 75% but if you increase the model size above\n(for example you can set model dimension to something like 80) this will\nconverge towards 4x smaller as the stored model size dominated more and\nmore by the parameter values.\n\n\n" + "### 3. \u67e5\u770b\u6a21\u578b\u5927\u5c0f\n\u6211\u4eec\u5df2\u7ecf\u91cf\u5316\u4e86\u6a21\u578b\u3002\u8fd9\u7ed9\u6211\u4eec\u5e26\u6765\u4e86\u4ec0\u4e48\u597d\u5904?\u597d\u5904\u4e4b\u4e00\u662f\u6211\u4eec\u7528 INT8 \u503c(\u548c\u4e00\u4e9b\u8bb0\u5f55\u7684\u6bd4\u4f8b\u56e0\u5b50)\u66ff\u6362\u4e86 FP32 \u6a21\u578b\u53c2\u6570\u3002\u8fd9\u610f\u5473\u7740\u5b58\u50a8\u548c\u79fb\u52a8\u6570\u636e\u7684\u5927\u5c0f\u51cf\u5c0f\u4e86\u7ea6 75%\u3002\u4f7f\u7528\u9ed8\u8ba4\u503c\u65f6,\u4e0b\u9762\u663e\u793a\u7684\u51cf\u5c0f\u91cf\u5c06\u5c0f\u4e8e 75%,\u4f46\u5982\u679c\u60a8\u5c06\u6a21\u578b\u5927\u5c0f\u589e\u52a0\u5230\u66f4\u5927\u503c(\u4f8b\u5982\u5c06 model_dimension \u8bbe\u7f6e\u4e3a 80),\u968f\u7740\u5b58\u50a8\u7684\u6a21\u578b\u5927\u5c0f\u8d8a\u6765\u8d8a\u591a\u5730\u7531\u53c2\u6570\u503c\u4e3b\u5bfc,\u51cf\u5c0f\u91cf\u5c06\u8d8b\u8fd1\u4e8e 4 \u500d\u3002\n\n\n" ] }, { @@ -62,14 +62,14 @@ }, "outputs": [], "source": [ - "def print_size_of_model(model, label=\"\"):\n torch.save(model.state_dict(), \"temp.p\")\n size=os.path.getsize(\"temp.p\")\n print(\"model: \",label,' \\t','Size (KB):', size/1e3)\n os.remove('temp.p')\n return size\n\n# compare the sizes\nf=print_size_of_model(float_lstm,\"fp32\")\nq=print_size_of_model(quantized_lstm,\"int8\")\nprint(\"{0:.2f} times smaller\".format(f/q))" + "def print_size_of_model(model, label=\"\"):\n torch.save(model.state_dict(), \"temp.p\")\n size = os.path.getsize(\"temp.p\")\n print(\"\u6a21\u578b: \", label, \" \\t\", \"\u5927\u5c0f (KB):\", size / 1e3)\n os.remove(\"temp.p\")\n return size\n\n\n# \u6bd4\u8f83\u5927\u5c0f\nf = print_size_of_model(float_lstm, \"fp32\")\nq = print_size_of_model(quantized_lstm, \"int8\")\nprint(\"{0:.2f} \u500d\u66f4\u5c0f\".format(f / q))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### 4. Look at Latency\nThe second benefit is that the quantized model will typically run\nfaster. This is due to a combinations of effects including at least:\n\n1. Less time spent moving parameter data in\n2. Faster INT8 operations\n\nAs you will see the quantized version of this super-simple network runs\nfaster. This will generally be true of more complex networks but as they\nsay \"your mileage may vary\" depending on a number of factors including\nthe structure of the model and the hardware you are running on.\n\n\n" + "### 4. \u67e5\u770b\u5ef6\u8fdf\n\u7b2c\u4e8c\u4e2a\u597d\u5904\u662f\u91cf\u5316\u6a21\u578b\u901a\u5e38\u4f1a\u8fd0\u884c\u5f97\u66f4\u5feb\u3002\u8fd9\u662f\u7531\u4e8e\u591a\u79cd\u6548\u679c\u7684\u7ec4\u5408,\u81f3\u5c11\u5305\u62ec:\n\n1. \u51cf\u5c11\u4e86\u79fb\u52a8\u53c2\u6570\u6570\u636e\u6240\u82b1\u8d39\u7684\u65f6\u95f4\n2. INT8 \u64cd\u4f5c\u66f4\u5feb\n\n\u5982\u60a8\u6240\u89c1,\u8fd9\u4e2a\u8d85\u7ea7\u7b80\u5355\u7684\u7f51\u7edc\u7684\u91cf\u5316\u7248\u672c\u8fd0\u884c\u901f\u5ea6\u66f4\u5feb\u3002\u5bf9\u4e8e\u66f4\u590d\u6742\u7684\u7f51\u7edc\u901a\u5e38\u4e5f\u662f\u5982\u6b64,\u4f46\u6b63\u5982\u4ed6\u4eec\u6240\u8bf4,\"\u60a8\u7684\u91cc\u7a0b\u53ef\u80fd\u4f1a\u6709\u6240\u4e0d\u540c\",\u8fd9\u53d6\u51b3\u4e8e\u8bb8\u591a\u56e0\u7d20,\u5305\u62ec\u6a21\u578b\u7684\u7ed3\u6784\u548c\u60a8\u8fd0\u884c\u7684\u786c\u4ef6\u3002\n\n\n" ] }, { @@ -80,7 +80,7 @@ }, "outputs": [], "source": [ - "# compare the performance\nprint(\"Floating point FP32\")" + "# \u6bd4\u8f83\u6027\u80fd\nprint(\"\u6d6e\u70b9 FP32\")" ] }, { @@ -98,7 +98,7 @@ }, "outputs": [], "source": [ - "print(\"Quantized INT8\")" + "print(\"\u91cf\u5316 INT8\")" ] }, { @@ -112,7 +112,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 5: Look at Accuracy\nWe are not going to do a careful look at accuracy here because we are\nworking with a randomly initialized network rather than a properly\ntrained one. However, I think it is worth quickly showing that the\nquantized network does produce output tensors that are \"in the same\nballpark\" as the original one.\n\nFor a more detailed analysis please see the more advanced tutorials\nreferenced at the end of this recipe.\n\n\n" + "### 5: \u67e5\u770b\u7cbe\u5ea6\n\u6211\u4eec\u4e0d\u4f1a\u5728\u8fd9\u91cc\u4ed4\u7ec6\u67e5\u770b\u7cbe\u5ea6,\u56e0\u4e3a\u6211\u4eec\u4f7f\u7528\u7684\u662f\u968f\u673a\u521d\u59cb\u5316\u7684\u7f51\u7edc,\u800c\u4e0d\u662f\u7ecf\u8fc7\u6b63\u786e\u8bad\u7ec3\u7684\u7f51\u7edc\u3002\u4f46\u662f,\u6211\u8ba4\u4e3a\u503c\u5f97\u5feb\u901f\u5c55\u793a\u4e00\u4e0b\u91cf\u5316\u7f51\u7edc\u786e\u5b9e\u4ea7\u751f\u4e86\u4e0e\u539f\u59cb\u7f51\u7edc\"\u540c\u4e00\u6570\u91cf\u7ea7\"\u7684\u8f93\u51fa\u5f20\u91cf\u503c\u3002\n\n\u6709\u5173\u66f4\u8be6\u7ec6\u7684\u5206\u6790,\u8bf7\u53c2\u9605\u672c\u793a\u4f8b\u7ed3\u5c3e\u5904\u5f15\u7528\u7684\u66f4\u9ad8\u7ea7\u6559\u7a0b\u3002\n\n\n" ] }, { @@ -123,14 +123,14 @@ }, "outputs": [], "source": [ - "# run the float model\nout1, hidden1 = float_lstm(inputs, hidden)\nmag1 = torch.mean(abs(out1)).item()\nprint('mean absolute value of output tensor values in the FP32 model is {0:.5f} '.format(mag1))\n\n# run the quantized model\nout2, hidden2 = quantized_lstm(inputs, hidden)\nmag2 = torch.mean(abs(out2)).item()\nprint('mean absolute value of output tensor values in the INT8 model is {0:.5f}'.format(mag2))\n\n# compare them\nmag3 = torch.mean(abs(out1-out2)).item()\nprint('mean absolute value of the difference between the output tensors is {0:.5f} or {1:.2f} percent'.format(mag3,mag3/mag1*100))" + "# \u8fd0\u884c\u6d6e\u70b9\u6a21\u578b\nout1, hidden1 = float_lstm(inputs, hidden)\nmag1 = torch.mean(abs(out1)).item()\nprint(\"FP32 \u6a21\u578b\u4e2d\u8f93\u51fa\u5f20\u91cf\u503c\u7684\u7edd\u5bf9\u503c\u5747\u503c\u4e3a {0:.5f} \".format(mag1))\n\n# \u8fd0\u884c\u91cf\u5316\u6a21\u578b\nout2, hidden2 = quantized_lstm(inputs, hidden)\nmag2 = torch.mean(abs(out2)).item()\nprint(\"INT8 \u6a21\u578b\u4e2d\u8f93\u51fa\u5f20\u91cf\u503c\u7684\u7edd\u5bf9\u503c\u5747\u503c\u4e3a {0:.5f}\".format(mag2))\n\n# \u6bd4\u8f83\u5b83\u4eec\nmag3 = torch.mean(abs(out1 - out2)).item()\nprint(\n \"\u8f93\u51fa\u5f20\u91cf\u4e4b\u95f4\u5dee\u503c\u7684\u7edd\u5bf9\u503c\u5747\u503c\u4e3a {0:.5f}\uff0c\u6216\u5360 {1:.2f} \u767e\u5206\u6bd4\".format(\n mag3, mag3 / mag1 * 100\n )\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Learn More\nWe've explained what dynamic quantization is, what benefits it brings,\nand you have used the ``torch.quantization.quantize_dynamic()`` function\nto quickly quantize a simple LSTM model.\n\nThis was a fast and high level treatment of this material; for more\ndetail please continue learning with [(beta) Dynamic Quantization on an LSTM Word Language Model Tutorial](https://pytorch.org/tutorials/advanced/dynamic\\_quantization\\_tutorial.html).\n\n\n## Additional Resources\n\n* [Quantization API Documentaion](https://pytorch.org/docs/stable/quantization.html)\n* [(beta) Dynamic Quantization on BERT](https://pytorch.org/tutorials/intermediate/dynamic\\_quantization\\_bert\\_tutorial.html)\n* [(beta) Dynamic Quantization on an LSTM Word Language Model](https://pytorch.org/tutorials/advanced/dynamic\\_quantization\\_tutorial.html)\n* [Introduction to Quantization on PyTorch](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/)\n\n\n" + "## \u4e86\u89e3\u66f4\u591a\n\u6211\u4eec\u5df2\u7ecf\u89e3\u91ca\u4e86\u4ec0\u4e48\u662f\u52a8\u6001\u91cf\u5316,\u5b83\u5e26\u6765\u4e86\u4ec0\u4e48\u597d\u5904,\u60a8\u5df2\u7ecf\u4f7f\u7528 ``torch.quantization.quantize_dynamic()`` \u51fd\u6570\u5feb\u901f\u91cf\u5316\u4e86\u4e00\u4e2a\u7b80\u5355\u7684 LSTM \u6a21\u578b\u3002\n\n\u8fd9\u662f\u5bf9\u8be5\u6750\u6599\u7684\u5feb\u901f\u548c\u9ad8\u7ea7\u5904\u7406;\u8981\u4e86\u89e3\u66f4\u591a\u8be6\u7ec6\u4fe1\u606f,\u8bf7\u7ee7\u7eed\u5b66\u4e60 [(beta) \u52a8\u6001\u91cf\u5316 LSTM \u8bcd\u8bed\u8a00\u6a21\u578b\u6559\u7a0b](https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html)\u3002\n\n\n## \u5176\u4ed6\u8d44\u6e90\n\n* [\u91cf\u5316 API \u6587\u6863](https://pytorch.org/docs/stable/quantization.html)\n* [(beta) \u52a8\u6001\u91cf\u5316 BERT](https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html)\n* [(beta) \u52a8\u6001\u91cf\u5316 LSTM \u8bcd\u8bed\u8a00\u6a21\u578b](https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html)\n* [PyTorch \u91cf\u5316\u4ecb\u7ecd](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/)\n\n\n" ] } ], diff --git a/docs/_downloads/fcca435d443f10eec1be8769b2b3a010/module_load_state_dict_tips.ipynb b/docs/_downloads/fcca435d443f10eec1be8769b2b3a010/module_load_state_dict_tips.ipynb index 0519cbb..2ef76b8 100644 --- a/docs/_downloads/fcca435d443f10eec1be8769b2b3a010/module_load_state_dict_tips.ipynb +++ b/docs/_downloads/fcca435d443f10eec1be8769b2b3a010/module_load_state_dict_tips.ipynb @@ -15,14 +15,25 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n# Tips for Loading an ``nn.Module`` from a Checkpoint\n**Author:** [Mikayla Gawarecki](https://github.com/mikaylagawarecki)\n\nIf you're loading a checkpoint and want to reduce compute and memory as much as possible,\nthis tutorial shares some recommended practices. In particular, we will discuss\n\n1. The ``mmap`` keyword argument on ``torch.load``\n2. The ``torch.device()`` context manager\n3. The ``assign`` keyword argument on ``nn.Module.load_state_dict()``\n\n

Note

This recipe requires PyTorch 2.1.0 or later.

\n" + "\n# \u4ece\u68c0\u67e5\u70b9\u52a0\u8f7d ``nn.Module`` \u7684\u6280\u5de7\n**\u4f5c\u8005:** [Mikayla Gawarecki](https://github.com/mikaylagawarecki)\n\n\u5982\u679c\u4f60\u8981\u52a0\u8f7d\u4e00\u4e2a\u68c0\u67e5\u70b9\u5e76\u5e0c\u671b\u5c3d\u53ef\u80fd\u51cf\u5c11\u8ba1\u7b97\u548c\u5185\u5b58\u7684\u4f7f\u7528\uff0c\u672c\u6559\u7a0b\u5c06\u5206\u4eab\u4e00\u4e9b\u63a8\u8350\u7684\u505a\u6cd5\u3002\u7279\u522b\u662f\u6211\u4eec\u5c06\u8ba8\u8bba\u4ee5\u4e0b\u51e0\u70b9:\n\n1. ``torch.load`` \u4e2d\u7684 ``mmap`` \u5173\u952e\u5b57\u53c2\u6570\n2. ``torch.device()`` \u4e0a\u4e0b\u6587\u7ba1\u7406\u5668\n3. ``nn.Module.load_state_dict()`` \u4e2d\u7684 ``assign`` \u5173\u952e\u5b57\u53c2\u6570\n\n

Note

\u672c\u6559\u7a0b\u9700\u8981 PyTorch 2.1.0 \u6216\u66f4\u9ad8\u7248\u672c\u3002

\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Let us consider a simple ``nn.Module`` that contains a list of Linear layers:\n\n" + "\u8ba9\u6211\u4eec\u8003\u8651\u4e00\u4e2a\u7b80\u5355\u7684 ``nn.Module``\uff0c\u5b83\u5305\u542b\u4e00\u4e2a\u7ebf\u6027\u5c42\u5217\u8868:\n\n" ] }, { @@ -33,14 +44,14 @@ }, "outputs": [], "source": [ - "import torch\nfrom torch import nn\nimport time\n\nclass SomeModule(torch.nn.Module):\n def __init__(self, size):\n super().__init__()\n self.linears = nn.ModuleList([nn.Linear(size, size) for i in range(10)])\n\n def forward(self, x):\n return self.linears(x)\n\n\nm = SomeModule(1000)\ntorch.save(m.state_dict(), 'checkpoint.pth')" + "import torch\nfrom torch import nn\n\n\nclass SomeModule(torch.nn.Module):\n def __init__(self, size):\n super().__init__()\n self.linears = nn.ModuleList([nn.Linear(size, size) for i in range(10)])\n\n def forward(self, x):\n return self.linears(x)\n\n\nm = SomeModule(1000)\ntorch.save(m.state_dict(), \"checkpoint.pth\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The following snippet demonstrates the use of the the ``mmap`` keyword argument\nto ``torch.load``, the ``torch.device()`` context manager and the ``assign``\nkeyword argument to ``nn.Module.load_state_dict()``.\n\n" + "\u4ee5\u4e0b\u4ee3\u7801\u7247\u6bb5\u6f14\u793a\u4e86\u5982\u4f55\u4f7f\u7528 ``torch.load`` \u4e2d\u7684 ``mmap`` \u5173\u952e\u5b57\u53c2\u6570\u3001``torch.device()`` \u4e0a\u4e0b\u6587\u7ba1\u7406\u5668\u548c ``nn.Module.load_state_dict()`` \u4e2d\u7684 ``assign`` \u5173\u952e\u5b57\u53c2\u6570\u3002\n\n" ] }, { @@ -51,14 +62,14 @@ }, "outputs": [], "source": [ - "state_dict = torch.load('checkpoint.pth', mmap=True)\nwith torch.device('meta'):\n meta_m = SomeModule(1000)\nmeta_m.load_state_dict(state_dict, assign=True)" + "state_dict = torch.load(\"checkpoint.pth\", mmap=True)\nwith torch.device(\"meta\"):\n meta_m = SomeModule(1000)\nmeta_m.load_state_dict(state_dict, assign=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Compare the snippet below to the one above:\n\n" + "\u5c06\u4e0b\u9762\u7684\u4ee3\u7801\u7247\u6bb5\u4e0e\u4e0a\u9762\u7684\u8fdb\u884c\u6bd4\u8f83:\n\n" ] }, { @@ -69,21 +80,21 @@ }, "outputs": [], "source": [ - "state_dict = torch.load('checkpoint.pth')\nm = SomeModule(1000)\nm.load_state_dict(state_dict)" + "state_dict = torch.load(\"checkpoint.pth\")\nm = SomeModule(1000)\nm.load_state_dict(state_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The second example does not use any of the features listed above and will be\nless compute and memory efficient for loading a checkpoint. In the following\nsections, we will discuss each of the features in further detail.\n\n" + "\u7b2c\u4e8c\u4e2a\u793a\u4f8b\u6ca1\u6709\u4f7f\u7528\u4e0a\u9762\u5217\u51fa\u7684\u4efb\u4f55\u7279\u6027\uff0c\u56e0\u6b64\u5728\u52a0\u8f7d\u68c0\u67e5\u70b9\u65f6\u8ba1\u7b97\u548c\u5185\u5b58\u6548\u7387\u4f1a\u8f83\u4f4e\u3002\u5728\u4e0b\u9762\u7684\u90e8\u5206\u4e2d\uff0c\u6211\u4eec\u5c06\u8be6\u7ec6\u8ba8\u8bba\u6bcf\u4e2a\u7279\u6027\u3002\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Using ``torch.load(mmap=True)``\nFirst, let us consider what happens when we load the checkpoint with ``torch.load``.\nWhen we save a checkpoint with ``torch.save``, tensor storages are tagged with the device they are\nsaved on. With ``torch.load``, tensor storages will be loaded to the device\nthey were tagged with (unless this behavior is overridden using the\n``map_location`` flag). For ease of explanation, let us assume that the tensors\nwere saved on CPU. This means that on the first line all tensor storages will be\nloaded into CPU RAM, which can be undesirable when:\n\n* CPU RAM is smaller than the size of the checkpoint.\n* Waiting for the entire checkpoint to be loaded into RAM before performing, for example, some per-tensor processing.\n\n" + "## \u4f7f\u7528 ``torch.load(mmap=True)``\n\u9996\u5148\uff0c\u8ba9\u6211\u4eec\u8003\u8651\u4f7f\u7528 ``torch.load`` \u52a0\u8f7d\u68c0\u67e5\u70b9\u65f6\u4f1a\u53d1\u751f\u4ec0\u4e48\u3002\n\u5f53\u6211\u4eec\u4f7f\u7528 ``torch.save`` \u4fdd\u5b58\u68c0\u67e5\u70b9\u65f6\uff0c\u5f20\u91cf\u5b58\u50a8\u4f1a\u88ab\u6807\u8bb0\u4e3a\u4fdd\u5b58\u65f6\u6240\u5728\u7684\u8bbe\u5907\u3002\n\u4f7f\u7528 ``torch.load`` \u65f6\uff0c\u5f20\u91cf\u5b58\u50a8\u5c06\u88ab\u52a0\u8f7d\u5230\u5b83\u4eec\u88ab\u6807\u8bb0\u7684\u8bbe\u5907\u4e0a(\u9664\u975e\u4f7f\u7528 ``map_location`` \u6807\u5fd7\u8986\u76d6\u6b64\u884c\u4e3a)\u3002\n\u4e3a\u4e86\u89e3\u91ca\u65b9\u4fbf\uff0c\u6211\u4eec\u5047\u8bbe\u5f20\u91cf\u662f\u4fdd\u5b58\u5728 CPU \u4e0a\u7684\u3002\u8fd9\u610f\u5473\u7740\u5728\u7b2c\u4e00\u884c\u4e2d\uff0c\u6240\u6709\u5f20\u91cf\u5b58\u50a8\u5c06\u88ab\u52a0\u8f7d\u5230 CPU \u5185\u5b58\u4e2d\uff0c\u5728\u4ee5\u4e0b\u60c5\u51b5\u4e0b\u8fd9\u662f\u4e0d\u53ef\u53d6\u7684:\n\n" ] }, { @@ -94,14 +105,14 @@ }, "outputs": [], "source": [ - "start_time = time.time()\nstate_dict = torch.load('checkpoint.pth')\nend_time = time.time()\nprint(f\"loading time without mmap={end_time - start_time}\")" + "# * CPU \u5185\u5b58\u5c0f\u4e8e\u68c0\u67e5\u70b9\u7684\u5927\u5c0f\u3002\n# * \u5728\u6267\u884c\u4e00\u4e9b\u6bcf\u5f20\u91cf\u5904\u7406\u4e4b\u524d\u7b49\u5f85\u6574\u4e2a\u68c0\u67e5\u70b9\u88ab\u52a0\u8f7d\u5230\u5185\u5b58\u4e2d\u3002\n\nstart_time = time.time()\nstate_dict = torch.load(\"checkpoint.pth\")\nend_time = time.time()\nprint(f\"\u4e0d\u4f7f\u7528 mmap \u7684\u52a0\u8f7d\u65f6\u95f4={end_time - start_time}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The ``mmap`` keyword argument to ``torch.load`` attempts to solve the above two\nproblems. As its name implies, the ``mmap`` keyword argument to ``torch.load``\nmakes use of an [mmap call](https://man7.org/linux/man-pages/man2/mmap.2.html)\nwhich maps a file on disk into virtual memory and lets the OS handle loading and\nunloading into physical memory automatically. When this flag is passed, tensor\nstorages will be memory-mapped.\n\n" + "``torch.load`` \u4e2d\u7684 ``mmap`` \u5173\u952e\u5b57\u53c2\u6570\u8bd5\u56fe\u89e3\u51b3\u4e0a\u8ff0\u4e24\u4e2a\u95ee\u9898\u3002\n\u987e\u540d\u601d\u4e49\uff0c``torch.load`` \u4e2d\u7684 ``mmap`` \u5173\u952e\u5b57\u53c2\u6570\u4f7f\u7528\u4e86 [mmap \u8c03\u7528](https://man7.org/linux/man-pages/man2/mmap.2.html),\n\u5b83\u5c06\u78c1\u76d8\u4e0a\u7684\u6587\u4ef6\u6620\u5c04\u5230\u865a\u62df\u5185\u5b58\u4e2d,\u5e76\u8ba9\u64cd\u4f5c\u7cfb\u7edf\u81ea\u52a8\u5904\u7406\u52a0\u8f7d\u548c\u5378\u8f7d\u5230\u7269\u7406\u5185\u5b58\u3002\n\u5f53\u4f20\u9012\u6b64\u6807\u5fd7\u65f6,\u5f20\u91cf\u5b58\u50a8\u5c06\u88ab\u5185\u5b58\u6620\u5c04\u3002\n\n" ] }, { @@ -112,14 +123,14 @@ }, "outputs": [], "source": [ - "start_time = time.time()\nstate_dict = torch.load('checkpoint.pth', mmap=True)\nend_time = time.time()\nprint(f\"loading time with mmap={end_time - start_time}\")" + "start_time = time.time()\nstate_dict = torch.load(\"checkpoint.pth\", mmap=True)\nend_time = time.time()\nprint(f\"\u4f7f\u7528 mmap \u7684\u52a0\u8f7d\u65f6\u95f4={end_time - start_time}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "As mentioned above, one can use this argument to do per-tensor processing on a\ncheckpoint without loading all tensor storages into CPU memory upfront. For example:\n\n" + "\u5982\u4e0a\u6240\u8ff0,\u53ef\u4ee5\u4f7f\u7528\u6b64\u53c2\u6570\u5728\u4e0d\u5c06\u6240\u6709\u5f20\u91cf\u5b58\u50a8\u52a0\u8f7d\u5230 CPU \u5185\u5b58\u4e2d\u7684\u60c5\u51b5\u4e0b\u5bf9\u68c0\u67e5\u70b9\u6267\u884c\u6bcf\u5f20\u91cf\u5904\u7406\u3002\u4f8b\u5982:\n\n" ] }, { @@ -130,14 +141,14 @@ }, "outputs": [], "source": [ - "def my_special_routine(t, device):\n # this could be a much fancier operation\n return t.to(dtype=torch.bfloat16, device=device)\n\ndef my_processing_function(key, device):\n t = state_dict[key]\n processed_t = my_special_routine(t, device)\n del t\n state_dict[key] = processed_t\n\nfor key in state_dict.keys():\n device = torch.device('cuda')\n my_processing_function(key, device)" + "def my_special_routine(t, device):\n # \u8fd9\u53ef\u80fd\u662f\u4e00\u4e2a\u66f4\u590d\u6742\u7684\u64cd\u4f5c\n return t.to(dtype=torch.bfloat16, device=device)\n\n\ndef my_processing_function(key, device):\n t = state_dict[key]\n processed_t = my_special_routine(t, device)\n del t\n state_dict[key] = processed_t\n\n\nfor key in state_dict.keys():\n device = torch.device(\"cuda\")\n my_processing_function(key, device)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Using ``torch.device('meta')``\nNext, let's consider the creation of the module.\n\n" + "## \u4f7f\u7528 ``torch.device('meta')``\n\u63a5\u4e0b\u6765,\u8ba9\u6211\u4eec\u8003\u8651\u6a21\u5757\u7684\u521b\u5efa\u3002\n\n" ] }, { @@ -155,7 +166,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This allocates memory for all parameters/buffers and initializes them per\nthe default initialization schemes defined in ``SomeModule.__init__()``, which\nis wasteful when we want to load a checkpoint for the following reasons:\n\n* The result of the initialization kernels will be overwritten by ``load_state_dict()`` without ever being used, so\n initialization is wasteful.\n* We are allocating memory for these parameters/buffers in RAM while ``torch.load`` of the saved state dictionary also\n allocates memory in RAM for the parameters/buffers in the checkpoint.\n\nIn order to solve these two problems, we can use the ``torch.device()``\ncontext manager with ``device='meta'`` when we instantiate the ``nn.Module()``.\n\nThe [torch.device()](https://pytorch.org/docs/main/tensor_attributes.html#torch-device)\ncontext manager makes sure that factory calls will be performed as if they\nwere passed the specified ``device`` as an argument. Tensors on ``torch.device('meta')`` do not\ncarry data. However, they possess all other metadata a tensor carries such as ``.size()``, ``.stride()``,\n``.requires_grad``, and others.\n\n" + "\u8fd9\u5c06\u4e3a\u6240\u6709\u53c2\u6570/\u7f13\u51b2\u533a\u5206\u914d\u5185\u5b58\u5e76\u6839\u636e ``SomeModule.__init__()`` \u4e2d\u5b9a\u4e49\u7684\u9ed8\u8ba4\u521d\u59cb\u5316\u65b9\u6848\u5bf9\u5176\u8fdb\u884c\u521d\u59cb\u5316,\n\u5f53\u6211\u4eec\u60f3\u8981\u52a0\u8f7d\u68c0\u67e5\u70b9\u65f6,\u8fd9\u662f\u6d6a\u8d39\u7684,\u539f\u56e0\u5982\u4e0b:\n\n" ] }, { @@ -166,14 +177,14 @@ }, "outputs": [], "source": [ - "with torch.device('meta'):\n new_m = SomeModule(1000)" + "# * \u521d\u59cb\u5316\u5185\u6838\u7684\u7ed3\u679c\u5c06\u88ab ``load_state_dict()`` \u8986\u76d6\u800c\u4ece\u672a\u88ab\u4f7f\u7528,\u56e0\u6b64\u521d\u59cb\u5316\u662f\u6d6a\u8d39\u7684\u3002\n# * \u6211\u4eec\u5728 RAM \u4e2d\u4e3a\u8fd9\u4e9b\u53c2\u6570/\u7f13\u51b2\u533a\u5206\u914d\u4e86\u5185\u5b58,\u800c ``torch.load`` \u4fdd\u5b58\u7684\u72b6\u6001\u5b57\u5178\u4e5f\u5728 RAM \u4e2d\u4e3a\u68c0\u67e5\u70b9\u4e2d\u7684\u53c2\u6570/\u7f13\u51b2\u533a\u5206\u914d\u4e86\u5185\u5b58\u3002\n\n# \u4e3a\u4e86\u89e3\u51b3\u8fd9\u4e24\u4e2a\u95ee\u9898,\u6211\u4eec\u53ef\u4ee5\u5728\u5b9e\u4f8b\u5316 ``nn.Module()`` \u65f6\u4f7f\u7528 ``device='meta'`` \u7684 ``torch.device()`` \u4e0a\u4e0b\u6587\u7ba1\u7406\u5668\u3002\n\n# `torch.device() `_\n# \u4e0a\u4e0b\u6587\u7ba1\u7406\u5668\u786e\u4fdd\u5de5\u5382\u8c03\u7528\u5c06\u88ab\u89c6\u4e3a\u4f20\u9012\u4e86\u6307\u5b9a\u7684 ``device`` \u4f5c\u4e3a\u53c2\u6570\u3002\n# \u5728 ``torch.device('meta')`` \u4e0a\u7684\u5f20\u91cf\u4e0d\u643a\u5e26\u6570\u636e\u3002\n# \u4f46\u662f,\u5b83\u4eec\u5177\u6709\u5f20\u91cf\u6240\u643a\u5e26\u7684\u5176\u4ed6\u5143\u6570\u636e,\u5982 ``.size()``, ``.stride()``, ``.requires_grad`` \u7b49\u3002\nwith torch.device(\"meta\"):\n new_m = SomeModule(1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Using ``load_state_dict(assign=True)``\nNext, we consider the loading of the state dictionary.\n\n" + "## \u4f7f\u7528 ``load_state_dict(assign=True)``\n\u63a5\u4e0b\u6765,\u6211\u4eec\u8003\u8651\u52a0\u8f7d\u72b6\u6001\u5b57\u5178\u3002\n\n" ] }, { @@ -191,7 +202,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "``nn.Module.load_state_dict()`` is usually implemented via an in-place\n``param_in_model.copy_(param_in_state_dict)``. This means that the parameter/buffer\nwith the corresponding key in the state dictionary is copied into the\nparameter/buffer in the ``nn.Module``.\n\nHowever, an in-place copy into a tensor on the ``meta`` device is a no-op.\nIn order to avoid this, we can pass the ``assign=True`` keyword argument to\n``load_state_dict()``.\n\nA caveat here is that since optimizers hold a reference to\n``nn.Module.parameters()``, the optimizer must be initialized after the module\nis loaded from state dict if ``assign=True`` is passed.\n\n" + "``nn.Module.load_state_dict()`` \u901a\u5e38\u662f\u901a\u8fc7 ``param_in_model.copy_(param_in_state_dict)`` \u7684\u5c31\u5730\u590d\u5236\u5b9e\u73b0\u7684\u3002\n\u8fd9\u610f\u5473\u7740\u72b6\u6001\u5b57\u5178\u4e2d\u5bf9\u5e94\u952e\u7684\u53c2\u6570/\u7f13\u51b2\u533a\u5c06\u88ab\u590d\u5236\u5230 ``nn.Module`` \u4e2d\u7684\u53c2\u6570/\u7f13\u51b2\u533a\u3002\n\n" ] }, { @@ -202,14 +213,14 @@ }, "outputs": [], "source": [ - "# As of PyTorch 2.3.0, one can use ``torch.__future__.set_swap_module_params_on_conversion`` to\n# avoid this caveat. This `recipe `_\n# provides more details.\n\nnew_m.load_state_dict(state_dict, assign=True)\n# Before 2.3.0, this MUST be done AFTER the load_state_dict with assign.\n# In versions >= 2.3.0, one can consider setting ``torch.__future__.set_swap_module_params_on_conversion``\nopt = torch.optim.SGD(new_m.parameters(), lr=1e-3)" + "# \u7136\u800c,\u5bf9 ``meta`` \u8bbe\u5907\u4e0a\u7684\u5f20\u91cf\u8fdb\u884c\u5c31\u5730\u590d\u5236\u662f\u65e0\u64cd\u4f5c\u7684\u3002\n# \u4e3a\u4e86\u907f\u514d\u8fd9\u79cd\u60c5\u51b5,\u6211\u4eec\u53ef\u4ee5\u5728 ``load_state_dict()`` \u4e2d\u4f20\u9012 ``assign=True`` \u5173\u952e\u5b57\u53c2\u6570\u3002\n\n# \u8fd9\u91cc\u7684\u4e00\u4e2a\u8b66\u544a\u662f,\u7531\u4e8e\u4f18\u5316\u5668\u6301\u6709\u5bf9 ``nn.Module.parameters()`` \u7684\u5f15\u7528,\n# \u5982\u679c\u4f20\u9012\u4e86 ``assign=True``,\u5219\u5fc5\u987b\u5728\u4ece\u72b6\u6001\u5b57\u5178\u52a0\u8f7d\u6a21\u5757\u540e\u521d\u59cb\u5316\u4f18\u5316\u5668\u3002\n\n# \u4ece PyTorch 2.3.0 \u5f00\u59cb,\u53ef\u4ee5\u4f7f\u7528 ``torch.__future__.set_swap_module_params_on_conversion`` \u6765\u907f\u514d\u8fd9\u4e2a\u8b66\u544a\u3002\n# \u8fd9\u4e2a `\u6559\u7a0b `_ \u63d0\u4f9b\u4e86\u66f4\u591a\u7ec6\u8282\u3002\n\nnew_m.load_state_dict(state_dict, assign=True)\n# \u5728 2.3.0 \u4e4b\u524d,\u8fd9\u4e00\u6b65\u5fc5\u987b\u5728 load_state_dict \u4f7f\u7528 assign \u4e4b\u540e\u5b8c\u6210\u3002\n# \u5728\u7248\u672c >= 2.3.0 \u4e2d,\u53ef\u4ee5\u8003\u8651\u8bbe\u7f6e ``torch.__future__.set_swap_module_params_on_conversion``\nopt = torch.optim.SGD(new_m.parameters(), lr=1e-3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Conclusion\n\nTo recap, in this tutorial we learned about ``torch.load(mmap=True)``, the\n``torch.device()`` context manager with ``device=meta``, and\n``nn.Module.load_state_dict(assign=True)`` as well as how these tools could\nbe used to aid when loading a model from a checkpoint.\n\n" + "## \u7ed3\u8bba\n\n\u603b\u7ed3\u4e00\u4e0b,\u5728\u672c\u6559\u7a0b\u4e2d,\u6211\u4eec\u5b66\u4e60\u4e86 ``torch.load(mmap=True)``\u3001``device='meta'`` \u7684 ``torch.device()`` \u4e0a\u4e0b\u6587\u7ba1\u7406\u5668\u548c ``nn.Module.load_state_dict(assign=True)``\n\u4ee5\u53ca\u5982\u4f55\u5728\u4ece\u68c0\u67e5\u70b9\u52a0\u8f7d\u6a21\u578b\u65f6\u4f7f\u7528\u8fd9\u4e9b\u5de5\u5177\u6765\u63d0\u9ad8\u6548\u7387\u3002\n\n" ] } ], diff --git a/docs/_sources/recipes/profile_with_itt.rst.txt b/docs/_sources/recipes/profile_with_itt.rst.txt index 7ddb1ab..0d8e794 100644 --- a/docs/_sources/recipes/profile_with_itt.rst.txt +++ b/docs/_sources/recipes/profile_with_itt.rst.txt @@ -1,112 +1,112 @@ -Profiling PyTorch workloads with The Instrumentation and Tracing Technology (ITT) API +使用 Instrumentation and Tracing Technology (ITT) API 分析 PyTorch 工作负载 ===================================================================================== -In this recipe, you will learn: +在本教程中,您将学习: -* What is Intel® VTune™ Profiler -* What is Instrumentation and Tracing Technology (ITT) API -* How to visualize PyTorch model hierarchy in Intel® VTune™ Profiler -* A short sample code showcasing how to use PyTorch ITT APIs +* 什么是 Intel® VTune™ Profiler +* 什么是 Instrumentation and Tracing Technology (ITT) API +* 如何在 Intel® VTune™ Profiler 中可视化 PyTorch 模型层次结构 +* 一个简短的示例代码,展示如何使用 PyTorch ITT API -Requirements +要求 ------------ -* PyTorch 1.13 or later +* PyTorch 1.13 或更高版本 * Intel® VTune™ Profiler -The instructions for installing PyTorch are available at `pytorch.org `__. +安装 PyTorch 的说明可在 `pytorch.org `__ 上找到。 -What is Intel® VTune™ Profiler +什么是 Intel® VTune™ Profiler ------------------------------ -Intel® VTune™ Profiler is a performance analysis tool for serial and multithreaded applications. For those who are familiar with Intel Architecture, Intel® VTune™ Profiler provides a rich set of metrics to help users understand how the application executed on Intel platforms, and thus have an idea where the performance bottleneck is. +Intel® VTune™ Profiler 是一款用于串行和多线程应用程序的性能分析工具。对于熟悉 Intel 架构的人来说,Intel® VTune™ Profiler 提供了丰富的指标集,帮助用户了解应用程序在 Intel 平台上的执行情况,从而了解性能瓶颈所在。 -More detailed information, including a Getting Started guide, are available `on the Intel website `__. +更多详细信息,包括入门指南,可在 `Intel 网站 `__ 上找到。 -What is Instrumentation and Tracing Technology (ITT) API +什么是 Instrumentation and Tracing Technology (ITT) API -------------------------------------------------------- -`The Instrumentation and Tracing Technology API (ITT API) `_ provided by the Intel® VTune™ Profiler enables target application to generate and control the collection of trace data during its execution. +`Instrumentation and Tracing Technology API (ITT API) `_ 由 Intel® VTune™ Profiler 提供,使目标应用程序能够在执行期间生成和控制跟踪数据的收集。 -The advantage of ITT feature is to label time span of individual PyTorch operators, as well as customized regions, on Intel® VTune™ Profiler GUI. When users find anything abnormal, it will be very helpful to locate which operator behaved unexpectedly. +ITT 功能的优势在于能够在 Intel® VTune™ Profiler GUI 上标记单个 PyTorch 算子和自定义区域的时间跨度。当用户发现任何异常时,这将非常有助于定位哪个算子表现异常。 .. note:: - The ITT API had been integrated into PyTorch since 1.13. Users don't need to invoke the original ITT C/C++ APIs, but only need to invoke the Python APIs in PyTorch. More detailed information can be found at `PyTorch Docs `__. + ITT API 已在 PyTorch 1.13 中集成。用户无需调用原始的 ITT C/C++ API,只需调用 PyTorch 中的 Python API 即可。更多详细信息可在 `PyTorch 文档 `__ 中找到。 -How to visualize PyTorch model hierarchy in Intel® VTune™ Profiler +如何在 Intel® VTune™ Profiler 中可视化 PyTorch 模型层次结构 ------------------------------------------------------------------ -Two types of usage are provided in PyTorch: +PyTorch 提供了两种使用方式: -1. Implicit invocation: By default, all operators that are registered by following the PyTorch operator registration mechanism will be labeled by ITT feature automatically when its feature is enabled. +1. 隐式调用: 默认情况下,所有通过 PyTorch 算子注册机制注册的算子在启用 ITT 功能时都会自动标记。 -2. Explicit invocation: If customized labeling is needed, users can use APIs mentioned at `PyTorch Docs `__ explicitly to label a desired range. +2. 显式调用: 如果需要自定义标记,用户可以在 `PyTorch 文档 `__ 中使用显式 API 对所需范围进行标记。 -To enable explicit invocation, code which are expected to be labeled should be invoked under a `torch.autograd.profiler.emit_itt()` scope. For example: +要启用显式调用,需要在 `torch.autograd.profiler.emit_itt()` 作用域下调用预期标记的代码。例如: .. code:: python3 with torch.autograd.profiler.emit_itt(): -Launch Intel® VTune™ Profiler +启动 Intel® VTune™ Profiler ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -To verify the functionality, you need to start an Intel® VTune™ Profiler instance. Please check the `Intel® VTune™ Profiler User Guide `__ for steps to launch Intel® VTune™ Profiler. +要验证功能,您需要启动一个 Intel® VTune™ Profiler 实例。启动 Intel® VTune™ Profiler 的步骤请查看 `Intel® VTune™ Profiler 用户指南 `__。 -Once you get the Intel® VTune™ Profiler GUI launched, you should see a user interface as below: +一旦启动了 Intel® VTune™ Profiler GUI,您应该会看到如下用户界面: .. figure:: /_static/img/itt_tutorial/vtune_start.png :width: 100% :align: center -Three sample results are available on the left side navigation bar under `sample (matrix)` project. If you do not want profiling results appear in this default sample project, you can create a new project via the button `New Project...` under the blue `Configure Analysis...` button. To start a new profiling, click the blue `Configure Analysis...` button to initiate configuration of the profiling. +左侧导航栏下的 `sample (matrix)` 项目中有三个示例结果。如果您不希望分析结果出现在此默认示例项目中,可以通过蓝色 `Configure Analysis...` 按钮下的 `New Project...` 按钮创建一个新项目。要启动新的分析,请单击蓝色的 `Configure Analysis...` 按钮以开始配置分析。 -Configure Profiling +配置分析 ~~~~~~~~~~~~~~~~~~~ -Once you click the `Configure Analysis...` button, you should see the screen below: +单击 `Configure Analysis...` 按钮后,您应该会看到如下界面: .. figure:: /_static/img/itt_tutorial/vtune_config.png :width: 100% :align: center -The right side of the windows is split into 3 parts: `WHERE` (top left), `WHAT` (bottom left), and `HOW` (right). With `WHERE`, you can assign a machine where you want to run the profiling on. With `WHAT`, you can set the path of the application that you want to profile. To profile a PyTorch script, it is recommended to wrap all manual steps, including activating a Python environment and setting required environment variables, into a bash script, then profile this bash script. In the screenshot above, we wrapped all steps into the `launch.sh` bash script and profile `bash` with the parameter to be ``. On the right side `HOW`, you can choose whatever type that you would like to profile. Intel® VTune™ Profiler provides a bunch of profiling types that you can choose from. Details can be found at `Intel® VTune™ Profiler User Guide `__. +窗口的右侧分为三部分: `WHERE`(左上角)、`WHAT`(左下角)和 `HOW`(右侧)。在 `WHERE` 中,您可以指定要在哪台机器上运行分析。在 `WHAT` 中,您可以设置要分析的应用程序的路径。要分析 PyTorch 脚本,建议将所有手动步骤(包括激活 Python 环境和设置所需环境变量)封装到一个 bash 脚本中,然后对该 bash 脚本进行分析。在上面的截图中,我们将所有步骤封装到 `launch.sh` bash 脚本中,并将 `bash` 的参数设置为 `` 的路径。在右侧的 `HOW` 中,您可以选择要分析的类型。Intel® VTune™ Profiler 提供了多种可选的分析类型。详情请查看 `Intel® VTune™ Profiler 用户指南 `__。 -Read Profiling Result +读取分析结果 ~~~~~~~~~~~~~~~~~~~~~ -With a successful profiling with ITT, you can open `Platform` tab of the profiling result to see labels in the Intel® VTune™ Profiler timeline. +成功进行了带有 ITT 的分析后,您可以打开分析结果的 `Platform` 选项卡,在 Intel® VTune™ Profiler 时间线上查看标记。 .. figure:: /_static/img/itt_tutorial/vtune_timeline.png :width: 100% :align: center -The timeline shows the main thread as a `python` thread on the top, and individual OpenMP threads below. Labeled PyTorch operators and customized regions are shown in the main thread row. All operators starting with `aten::` are operators labeled implicitly by the ITT feature in PyTorch. Labels `iteration_N` are explicitly labeled with specific APIs `torch.profiler.itt.range_push()`, `torch.profiler.itt.range_pop()` or `torch.profiler.itt.range()` scope. Please check the sample code in the next section for details. +时间线显示了顶部的主线程作为 `python` 线程,下面是各个 OpenMP 线程。标记的 PyTorch 算子和自定义区域显示在主线程行中。所有以 `aten::` 开头的算子都是由 PyTorch 中的 ITT 功能隐式标记的。标签 `iteration_N` 是使用特定的 API `torch.profiler.itt.range_push()`、`torch.profiler.itt.range_pop()` 或 `torch.profiler.itt.range()` 作用域显式标记的。请查看下一节中的示例代码以了解详情。 .. note:: - Red boxes marked with `convolution` and `reorder` are labeled from Intel® oneAPI Deep Neural Network Library (oneDNN). + 时间线中标记为 `convolution` 和 `reorder` 的红色框是由 Intel® oneAPI Deep Neural Network Library (oneDNN) 标记的。 -As illustrated on the right side navigation bar, brown portions in the timeline rows show CPU usage of individual threads. The percerntage of height of a thread row that the brown portion occupies at a timestamp aligns with that of the CPU usage in that thread at that timestamp. Thus, it is intuitive from this timeline to understand the followings: +如右侧导航栏所示,时间线行中的棕色部分显示了各个线程的 CPU 使用情况。在某个时间点,棕色部分在线程行中所占的高度百分比与该线程在该时间点的 CPU 使用率相对应。因此,从这个时间线可以直观地了解以下几点: -1. How well CPU cores are utilized on each thread. -2. How balance CPU cores are utilized on all threads. Do all threads have good CPU usage? -3. How well OpenMP threads are synchronized. Are there jitters when starting OpenMP threads or OpenMP threads finish. +1. 每个线程的 CPU 核心利用率如何。 +2. 所有线程的 CPU 核心利用率是否平衡。所有线程的 CPU 使用情况是否良好? +3. OpenMP 线程是否同步良好。启动 OpenMP 线程或 OpenMP 线程完成时是否存在抖动? -Of course there are much more enriched sets of profiling features that Intel® VTune™ Profiler provides to help you understand a performance issue. When you understand the root cause of a performance issue, you can get it fixed. More detailed usage instructions are available at `Intel® VTune™ Profiler User Guide `__. +当然,Intel® VTune™ Profiler 还提供了更多丰富的分析功能,帮助您了解性能问题的根源。一旦您了解了性能问题的根源,就可以加以修复。更多详细的使用说明可在 `Intel® VTune™ Profiler 用户指南 `__ 中找到。 -A short sample code showcasing how to use PyTorch ITT APIs +一个简短的示例代码,展示如何使用 PyTorch ITT API ---------------------------------------------------------- -The sample code below is the script that was used for profiling in the screenshots above. +下面的示例代码就是在上面的截图中用于分析的脚本。 -The topology is formed by two operators, `Conv2d` and `Linear`. Three iterations of inference were performed. Each iteration was labeled by PyTorch ITT APIs as text string `iteration_N`. Either pair of `torch.profile.itt.range_push` and `torch.profile.itt.range_pop` or `torch.profile.itt.range` scope does the customized labeling feature. +该拓扑由两个算子 `Conv2d` 和 `Linear` 组成。进行了三次推理迭代,每次迭代都使用 PyTorch ITT API 标记为文本字符串 `iteration_N`。无论是使用 `torch.profile.itt.range_push` 和 `torch.profile.itt.range_pop` 的配对,还是使用 `torch.profile.itt.range` 作用域,都可以实现自定义标记功能。 .. code:: python3 @@ -132,12 +132,12 @@ The topology is formed by two operators, `Conv2d` and `Linear`. Three iterations x = torch.rand(10, 3, 244, 244) with torch.autograd.profiler.emit_itt(): for i in range(3) - # Labeling a region with pair of range_push and range_pop + # 使用 range_push 和 range_pop 配对标记区域 #torch.profiler.itt.range_push(f'iteration_{i}') #m(x) #torch.profiler.itt.range_pop() - # Labeling a region with range scope + # 使用 range 作用域标记区域 with torch.profiler.itt.range(f'iteration_{i}'): m(x) @@ -145,7 +145,7 @@ The topology is formed by two operators, `Conv2d` and `Linear`. Three iterations main() -The `launch.sh` bash script, mentioned in the Intel® VTune™ Profiler GUI screenshot, to wrap all manual steps is shown below. +下面是在 Intel® VTune™ Profiler GUI 截图中提到的 `launch.sh` bash 脚本,用于封装所有手动步骤。 .. code:: bash @@ -153,8 +153,8 @@ The `launch.sh` bash script, mentioned in the Intel® VTune™ Profiler GUI scre #!/bin/bash - # Retrieve the directory path where the path contains both the sample.py and launch.sh so that this bash script can be invoked from any directory + # 获取包含 sample.py 和 launch.sh 的目录路径,以便从任何目录调用此 bash 脚本 BASEFOLDER=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) - + <激活 Python 环境> cd ${BASEFOLDER} python sample.py diff --git a/docs/_sources/recipes/recipes/Captum_Recipe.rst.txt b/docs/_sources/recipes/recipes/Captum_Recipe.rst.txt index 914c72b..fcfe8f4 100644 --- a/docs/_sources/recipes/recipes/Captum_Recipe.rst.txt +++ b/docs/_sources/recipes/recipes/Captum_Recipe.rst.txt @@ -18,150 +18,143 @@ .. _sphx_glr_recipes_recipes_Captum_Recipe.py: -Model Interpretability using Captum +使用 Captum 进行模型可解释性 =================================== -.. GENERATED FROM PYTHON SOURCE LINES 9-22 +.. GENERATED FROM PYTHON SOURCE LINES 7-16 -Captum helps you understand how the data features impact your model -predictions or neuron activations, shedding light on how your model -operates. +Captum 可以帮助您了解数据特征如何影响模型的预测或神经元激活,从而揭示模型的工作原理。 -Using Captum, you can apply a wide range of state-of-the-art feature -attribution algorithms such as \ ``Guided GradCam``\ and -\ ``Integrated Gradients``\ in a unified way. +使用 Captum,您可以统一地应用广泛的最先进的特征归因算法,如 ``Guided GradCam`` 和 ``Integrated Gradients``。 -In this recipe you will learn how to use Captum to: +在本教程中,您将学习如何使用 Captum: -- Attribute the predictions of an image classifier to their corresponding image features. -- Visualize the attribution results. +- 将图像分类器的预测归因于相应的图像特征。 +- 可视化归因结果。 -.. GENERATED FROM PYTHON SOURCE LINES 25-28 +.. GENERATED FROM PYTHON SOURCE LINES 18-21 -Before you begin +开始之前 ---------------- -.. GENERATED FROM PYTHON SOURCE LINES 31-36 +.. GENERATED FROM PYTHON SOURCE LINES 23-26 -Make sure Captum is installed in your active Python environment. Captum -is available both on GitHub, as a ``pip`` package, or as a ``conda`` -package. For detailed instructions, consult the installation guide at -https://captum.ai/ +确保在您的活跃 Python 环境中安装了 Captum。Captum 可以在 GitHub 上获取,也可以作为 ``pip`` 包或 ``conda`` 包获取。 +有关详细说明,请查阅安装指南 https://captum.ai/ -.. GENERATED FROM PYTHON SOURCE LINES 39-43 +.. GENERATED FROM PYTHON SOURCE LINES 28-30 -For a model, we use a built-in image classifier in PyTorch. Captum can -reveal which parts of a sample image support certain predictions made by -the model. +对于模型,我们使用 PyTorch 中的内置图像分类器。Captum 可以揭示样本图像的哪些部分支持了模型做出的某些预测。 -.. GENERATED FROM PYTHON SOURCE LINES 43-70 +.. GENERATED FROM PYTHON SOURCE LINES 30-63 .. code-block:: default + from io import BytesIO + import requests import torchvision - from torchvision import models, transforms from PIL import Image - import requests - from io import BytesIO + from torchvision import models, transforms - model = torchvision.models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1).eval() + model = torchvision.models.resnet18( + weights=models.ResNet18_Weights.IMAGENET1K_V1 + ).eval() - response = requests.get("https://image.freepik.com/free-photo/two-beautiful-puppies-cat-dog_58409-6024.jpg") + response = requests.get( + "https://image.freepik.com/free-photo/two-beautiful-puppies-cat-dog_58409-6024.jpg" + ) img = Image.open(BytesIO(response.content)) - center_crop = transforms.Compose([ - transforms.Resize(256), - transforms.CenterCrop(224), - ]) - - normalize = transforms.Compose([ - transforms.ToTensor(), # converts the image to a tensor with values between 0 and 1 - transforms.Normalize( # normalize to follow 0-centered imagenet pixel RGB distribution - mean=[0.485, 0.456, 0.406], - std=[0.229, 0.224, 0.225] - ) - ]) + center_crop = transforms.Compose( + [ + transforms.Resize(256), + transforms.CenterCrop(224), + ] + ) + + normalize = transforms.Compose( + [ + transforms.ToTensor(), # 将图像转换为值在 0 到 1 之间的张量 + transforms.Normalize( # 归一化以遵循 0 均值的 ImageNet 像素 RGB 分布 + mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] + ), + ] + ) input_img = normalize(center_crop(img)).unsqueeze(0) +.. GENERATED FROM PYTHON SOURCE LINES 64-67 -.. GENERATED FROM PYTHON SOURCE LINES 71-74 - -Computing Attribution +计算归因 --------------------- -.. GENERATED FROM PYTHON SOURCE LINES 77-83 +.. GENERATED FROM PYTHON SOURCE LINES 69-73 -Among the top-3 predictions of the models are classes 208 and 283 which -correspond to dog and cat. +在模型的前 3 个预测中,类别 208 和 283 分别对应于狗和猫。 -Let us attribute each of these predictions to the corresponding part of -the input, using Captum’s \ ``Occlusion``\ algorithm. +让我们使用 Captum 的 ``Occlusion`` 算法将这些预测归因于输入的相应部分。 -.. GENERATED FROM PYTHON SOURCE LINES 83-108 +.. GENERATED FROM PYTHON SOURCE LINES 73-101 .. code-block:: default - from captum.attr import Occlusion + from captum.attr import Occlusion occlusion = Occlusion(model) - strides = (3, 9, 9) # smaller = more fine-grained attribution but slower - target=208, # Labrador index in ImageNet - sliding_window_shapes=(3,45, 45) # choose size enough to change object appearance - baselines = 0 # values to occlude the image with. 0 corresponds to gray + strides = (3, 9, 9) # 步长越小,归因越细粒度,但速度越慢 + target = (208,) # ImageNet 中的拉布拉多索引 + sliding_window_shapes = (3, 45, 45) # 选择足以改变对象外观的大小 + baselines = 0 # 用于遮挡图像的值。0 对应灰色 - attribution_dog = occlusion.attribute(input_img, - strides = strides, - target=target, - sliding_window_shapes=sliding_window_shapes, - baselines=baselines) + attribution_dog = occlusion.attribute( + input_img, + strides=strides, + target=target, + sliding_window_shapes=sliding_window_shapes, + baselines=baselines, + ) - target=283, # Persian cat index in ImageNet - attribution_cat = occlusion.attribute(input_img, - strides = strides, - target=target, - sliding_window_shapes=sliding_window_shapes, - baselines=0) + target = (283,) # ImageNet 中的波斯猫索引 + attribution_cat = occlusion.attribute( + input_img, + strides=strides, + target=target, + sliding_window_shapes=sliding_window_shapes, + baselines=0, + ) +.. GENERATED FROM PYTHON SOURCE LINES 102-108 -.. GENERATED FROM PYTHON SOURCE LINES 109-119 +除了 ``Occlusion`` 之外,Captum 还提供了许多算法,如 ``Integrated Gradients``、``Deconvolution``、 +``GuidedBackprop``、``Guided GradCam``、``DeepLift`` 和 ``GradientShap``。所有这些算法都是 ``Attribution`` 的子类, +在初始化时需要将您的模型作为可调用的 ``forward_func``传入,并具有 ``attribute(...)`` 方法,该方法以统一的格式返回归因结果。 -Besides ``Occlusion``, Captum features many algorithms such as -\ ``Integrated Gradients``\ , \ ``Deconvolution``\ , -\ ``GuidedBackprop``\ , \ ``Guided GradCam``\ , \ ``DeepLift``\ , and -\ ``GradientShap``\ . All of these algorithms are subclasses of -``Attribution`` which expects your model as a callable ``forward_func`` -upon initialization and has an ``attribute(...)`` method which returns -the attribution result in a unified format. +让我们可视化计算出的图像归因结果。 -Let us visualize the computed attribution results in case of images. +.. GENERATED FROM PYTHON SOURCE LINES 110-113 -.. GENERATED FROM PYTHON SOURCE LINES 122-125 - -Visualizing the Results +可视化结果 ----------------------- -.. GENERATED FROM PYTHON SOURCE LINES 128-132 +.. GENERATED FROM PYTHON SOURCE LINES 115-117 -Captum’s \ ``visualization``\ utility provides out-of-the-box methods -to visualize attribution results both for pictorial and for textual -inputs. +Captum 的 ``visualization`` 实用程序提供了开箱即用的方法,用于可视化图像和文本输入的归因结果。 -.. GENERATED FROM PYTHON SOURCE LINES 132-164 +.. GENERATED FROM PYTHON SOURCE LINES 117-154 .. code-block:: default @@ -169,61 +162,62 @@ inputs. import numpy as np from captum.attr import visualization as viz - # Convert the compute attribution tensor into an image-like numpy array - attribution_dog = np.transpose(attribution_dog.squeeze().cpu().detach().numpy(), (1,2,0)) + # 将计算出的归因张量转换为类似图像的 numpy 数组 + attribution_dog = np.transpose( + attribution_dog.squeeze().cpu().detach().numpy(), (1, 2, 0) + ) vis_types = ["heat_map", "original_image"] - vis_signs = ["all", "all"] # "positive", "negative", or "all" to show both - # positive attribution indicates that the presence of the area increases the prediction score - # negative attribution indicates distractor areas whose absence increases the score - - _ = viz.visualize_image_attr_multiple(attribution_dog, - np.array(center_crop(img)), - vis_types, - vis_signs, - ["attribution for dog", "image"], - show_colorbar = True - ) + vis_signs = ["all", "all"] # "positive"、"negative" 或 "all" 以显示两者 + # 正归因表示该区域的存在会增加预测分数 + # 负归因表示该区域的缺失会增加预测分数 + _ = viz.visualize_image_attr_multiple( + attribution_dog, + np.array(center_crop(img)), + vis_types, + vis_signs, + ["attribution for dog", "image"], + show_colorbar=True, + ) - attribution_cat = np.transpose(attribution_cat.squeeze().cpu().detach().numpy(), (1,2,0)) - _ = viz.visualize_image_attr_multiple(attribution_cat, - np.array(center_crop(img)), - ["heat_map", "original_image"], - ["all", "all"], # positive/negative attribution or all - ["attribution for cat", "image"], - show_colorbar = True - ) + attribution_cat = np.transpose( + attribution_cat.squeeze().cpu().detach().numpy(), (1, 2, 0) + ) + _ = viz.visualize_image_attr_multiple( + attribution_cat, + np.array(center_crop(img)), + ["heat_map", "original_image"], + ["all", "all"], # 正/负归因或全部 + ["attribution for cat", "image"], + show_colorbar=True, + ) -.. GENERATED FROM PYTHON SOURCE LINES 165-169 +.. GENERATED FROM PYTHON SOURCE LINES 155-158 -If your data is textual, ``visualization.visualize_text()`` offers a -dedicated view to explore attribution on top of the input text. Find out -more at http://captum.ai/tutorials/IMDB_TorchText_Interpret +如果您的数据是文本,``visualization.visualize_text()`` 提供了一个专用视图,用于探索输入文本的归因。 +更多信息请访问 http://captum.ai/tutorials/IMDB_TorchText_Interpret -.. GENERATED FROM PYTHON SOURCE LINES 172-175 +.. GENERATED FROM PYTHON SOURCE LINES 160-163 -Final Notes +最后注意 ----------- -.. GENERATED FROM PYTHON SOURCE LINES 178-191 +.. GENERATED FROM PYTHON SOURCE LINES 165-175 -Captum can handle most model types in PyTorch across modalities -including vision, text, and more. With Captum you can: \* Attribute a -specific output to the model input as illustrated above. \* Attribute a -specific output to a hidden-layer neuron (see Captum API reference). \* -Attribute a hidden-layer neuron response to the model input (see Captum -API reference). +Captum 可以处理 PyTorch 中包括视觉、文本等各种模态的大多数模型类型。使用 Captum 您可以: +* 将特定输出归因于模型输入,如上所示。 +* 将特定输出归因于隐藏层神经元(参见 Captum API 参考)。 +* 将隐藏层神经元响应归因于模型输入(参见 Captum API 参考)。 -For complete API of the supported methods and a list of tutorials, -consult our website http://captum.ai +有关支持方法的完整 API 和教程列表,请查阅我们的网站 http://captum.ai -Another useful post by Gilbert Tanner: +Gilbert Tanner 的另一篇有用文章: https://gilberttanner.com/blog/interpreting-pytorch-models-with-captum diff --git a/docs/_sources/recipes/recipes/benchmark.rst.txt b/docs/_sources/recipes/recipes/benchmark.rst.txt index b2c609a..a96ca7e 100644 --- a/docs/_sources/recipes/recipes/benchmark.rst.txt +++ b/docs/_sources/recipes/recipes/benchmark.rst.txt @@ -20,58 +20,48 @@ PyTorch Benchmark ==================================== -This recipe provides a quick-start guide to using PyTorch -``benchmark`` module to measure and compare code performance. +本教程提供了使用 PyTorch ``benchmark`` 模块来测量和比较代码性能的快速入门指南。 -Introduction +介绍 ------------ -Benchmarking is an important step in writing code. It helps -us validate that our code meets performance expectations, -compare different approaches to solving the same problem and -prevent performance regressions. - -There are many options when it comes to benchmarking PyTorch code -including the Python builtin ``timeit`` module. However, benchmarking -PyTorch code has many caveats that can be easily overlooked such as -managing the number of threads and synchronizing CUDA devices. Moreover, -generating Tensor inputs for benchmarking can be quite tedious. - -This recipe demonstrates how to use PyTorch ``benchmark`` module to avoid -common mistakes while making it easier to compare performance of -different code, generate input for benchmarking and more. - -Setup +基准测试是编写代码时的一个重要步骤。它帮助我们验证代码是否满足性能预期,比较解决同一问题的不同方法,并防止性能裂化。 + +对于基准测试 PyTorch 代码有许多选择,包括 Python 内置的 ``timeit`` 模块。 +然而,基准测试 PyTorch 代码有许多容易被忽视的注意事项,例如管理线程数量和同步 CUDA 设备。 +此外,为基准测试生成张量输入可能相当繁琐。 + +本教程演示了如何使用 PyTorch ``benchmark`` 模块来避免常见错误,同时更容易比较不同代码的性能、为基准测试生成输入等。 + +设置 ----- -Before we begin, install ``torch`` if it isn’t already available. +在开始之前,如果尚未安装 ``torch``,请先安装。 :: pip install torch -.. GENERATED FROM PYTHON SOURCE LINES 36-56 +.. GENERATED FROM PYTHON SOURCE LINES 27-45 -Steps +具体步骤 ----- -1. Defining functions to benchmark -2. Benchmarking with ``timeit.Timer`` -3. Benchmarking with ``torch.utils.benchmark.Timer`` -4. Benchmarking with ``Blocked Autorange`` -5. Comparing benchmark results -6. Saving/Loading benchmark results -7. Generating inputs with ``Fuzzed Parameters`` -8. Collecting instruction counts with ``Callgrind`` +1. 定义要基准测试的函数 +2. 使用 ``timeit.Timer`` 进行基准测试 +3. 使用 ``torch.utils.benchmark.Timer`` 进行基准测试 +4. 使用 ``Blocked Autorange`` 进行基准测试 +5. 比较基准测试结果 +6. 保存/加载基准测试结果 +7. 使用 ``Fuzzed Parameters`` 生成输入 +8. 使用 ``Callgrind`` 收集指令计数 -1. Defining functions to benchmark +1. 定义要基准测试的函数 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -As of the time of this writing, `torch.dot `__ -does not support batched mode, so we will compare two approaches to -implementing it using existing ``torch`` operators: one approach uses a -combination of ``mul`` and ``sum`` while the other reduces the problem to ``bmm``. +在撰写本文时, `torch.dot `__ +不支持批量模式,因此我们将比较使用现有 ``torch`` 运算符实现它的两种方法:一种方法使用 ``mul`` 和 ``sum`` 的组合,另一种方法使用 ``bmm``。 -.. GENERATED FROM PYTHON SOURCE LINES 56-79 +.. GENERATED FROM PYTHON SOURCE LINES 45-68 .. code-block:: default @@ -80,12 +70,12 @@ combination of ``mul`` and ``sum`` while the other reduces the problem to ``bmm` def batched_dot_mul_sum(a, b): - '''Computes batched dot by multiplying and summing''' + """Computes batched dot by multiplying and summing""" return a.mul(b).sum(-1) def batched_dot_bmm(a, b): - '''Computes batched dot by reducing to ``bmm``''' + """Computes batched dot by reducing to ``bmm``""" a = a.reshape(-1, 1, a.shape[-1]) b = b.reshape(-1, b.shape[-1], 1) return torch.bmm(a, b).flatten(-3) @@ -99,17 +89,15 @@ combination of ``mul`` and ``sum`` while the other reduces the problem to ``bmm` -.. GENERATED FROM PYTHON SOURCE LINES 80-87 +.. GENERATED FROM PYTHON SOURCE LINES 69-74 -2. Benchmarking with ``timeit.Timer`` +2. 使用 ``timeit.Timer`` 进行基准测试 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +首先,让我们使用 Python 内置的 ``timeit`` 模块对代码进行基准测试。 +我们在这里保持基准测试代码简单,以便我们可以比较 ``timeit`` 和 ``torch.utils.benchmark`` 的默认设置。 -First, let's benchmark the code using Python's builtin ``timeit`` module. -We keep the benchmark code simple here so we can compare the defaults -of ``timeit`` and ``torch.utils.benchmark``. - -.. GENERATED FROM PYTHON SOURCE LINES 87-103 +.. GENERATED FROM PYTHON SOURCE LINES 74-92 .. code-block:: default @@ -117,20 +105,22 @@ of ``timeit`` and ``torch.utils.benchmark``. import timeit t0 = timeit.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals={'x': x}) + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals={"x": x}, + ) t1 = timeit.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals={'x': x}) + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals={"x": x}, + ) - print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us') - print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us') + print(f"mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us") + print(f"bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us") -.. GENERATED FROM PYTHON SOURCE LINES 104-110 +.. GENERATED FROM PYTHON SOURCE LINES 93-99 .. code-block:: none :caption: Output @@ -139,18 +129,15 @@ of ``timeit`` and ``torch.utils.benchmark``. bmm(x, x): 70.0 us -.. GENERATED FROM PYTHON SOURCE LINES 113-121 +.. GENERATED FROM PYTHON SOURCE LINES 102-107 -3. Benchmarking with ``torch.utils.benchmark.Timer`` +3. 使用 ``torch.utils.benchmark.Timer`` 进行基准测试 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +PyTorch ``benchmark``模块的设计使得对于那些曾经使用过 ``timeit`` 模块的人来说,它看起来很熟悉。 +然而,它的默认设置使得它更容易且更安全地用于对 PyTorch 代码进行基准测试。 +首先让我们对比一下基本API的使用。 -PyTorch ``benchmark`` module was designed to be familiar to those who -have used the ``timeit`` module before. However, its defaults make it -easier and safer to use for benchmarking PyTorch code. Let's first -compare the same basic API as above. - - -.. GENERATED FROM PYTHON SOURCE LINES 121-137 +.. GENERATED FROM PYTHON SOURCE LINES 107-125 .. code-block:: default @@ -158,20 +145,22 @@ compare the same basic API as above. import torch.utils.benchmark as benchmark t0 = benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals={'x': x}) + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals={"x": x}, + ) t1 = benchmark.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals={'x': x}) + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals={"x": x}, + ) print(t0.timeit(100)) print(t1.timeit(100)) -.. GENERATED FROM PYTHON SOURCE LINES 138-152 +.. GENERATED FROM PYTHON SOURCE LINES 126-140 .. code-block:: none :caption: Output @@ -188,53 +177,50 @@ compare the same basic API as above. 1 measurement, 100 runs , 1 thread -.. GENERATED FROM PYTHON SOURCE LINES 154-169 +.. GENERATED FROM PYTHON SOURCE LINES 142-152 -Even though the APIs are the same for the basic functionality, there -are some important differences. ``benchmark.Timer.timeit()`` returns the -time per run as opposed to the total runtime like ``timeit.Timer.timeit()`` -does. PyTorch ``benchmark`` module also provides formatted string -representations for printing the results. +虽然基本功能的API是相同的,但是还是有一些重要的区别。 +``benchmark.Timer.timeit()``返回的是每次运行的时间,而不是 ``timeit.Timer.timeit()`` 返回的总运行时间。 +PyTorch ``benchmark``模块还提供了格式化的字符串表示,用于打印结果。 -Another important difference, and the reason why the results diverge -is that PyTorch benchmark module runs in a single thread by default. -We can change the number of threads with the ``num_threads`` argument. +另一个重要的区别,也是结果不同的原因,是PyTorch基准测试模块默认在单线程中运行。 +我们可以使用``num_threads``参数来更改线程数量。 -``torch.utils.benchmark.Timer`` takes several additional arguments -including: ``label``, ``sub_label``, ``description`` and ``env`` which change -the __repr__ of the measurement object returned and are used for -grouping the results (more on this later). +``torch.utils.benchmark.Timer``接受几个额外的参数,包括: ``label``、``sub_label``、``description``和``env``, +这些参数会改变返回的测量对象的__repr__,并用于对结果进行分组(稍后会详细介绍)。 -.. GENERATED FROM PYTHON SOURCE LINES 169-192 +.. GENERATED FROM PYTHON SOURCE LINES 152-177 .. code-block:: default num_threads = torch.get_num_threads() - print(f'Benchmarking on {num_threads} threads') + print(f"Benchmarking on {num_threads} threads") t0 = benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals={'x': x}, + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals={"x": x}, num_threads=num_threads, - label='Multithreaded batch dot', - sub_label='Implemented using mul and sum') + label="Multithreaded batch dot", + sub_label="Implemented using mul and sum", + ) t1 = benchmark.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals={'x': x}, + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals={"x": x}, num_threads=num_threads, - label='Multithreaded batch dot', - sub_label='Implemented using bmm') + label="Multithreaded batch dot", + sub_label="Implemented using bmm", + ) print(t0.timeit(100)) print(t1.timeit(100)) -.. GENERATED FROM PYTHON SOURCE LINES 193-207 +.. GENERATED FROM PYTHON SOURCE LINES 178-192 .. code-block:: none :caption: Output @@ -251,42 +237,42 @@ grouping the results (more on this later). 68.21 us 1 measurement, 100 runs , 40 threads -.. GENERATED FROM PYTHON SOURCE LINES 209-217 +.. GENERATED FROM PYTHON SOURCE LINES 194-200 -Running ``benchmark`` with all threads available gives similar results -as the ``timeit`` module. More importantly, which version is faster -depends on how many threads we run the code with. This is why it's -important to benchmark the code with thread settings that are -representative of real use cases. Another important thing to remember -is to synchronize CPU and CUDA when benchmarking on the GPU. Let's run -the above benchmarks again on a CUDA tensor and see what happens. +使用所有可用线程运行 ``benchmark`` 会得到与 ``timeit`` 模块类似的结果。 +更重要的是,哪个版本更快取决于我们使用多少线程运行代码。 +这就是为什么在基准测试时,使用与实际用例相符的线程设置非常重要。 +另一个需要记住的重要事情是,在 GPU 上进行基准测试时,要同步CPU和CUDA。 +让我们再次在CUDA张量上运行上面的基准测试,看看会发生什么。 -.. GENERATED FROM PYTHON SOURCE LINES 217-236 +.. GENERATED FROM PYTHON SOURCE LINES 200-221 .. code-block:: default - x = torch.randn(10000, 1024, device='cuda') + x = torch.randn(10000, 1024, device="cuda") t0 = timeit.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals={'x': x}) + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals={"x": x}, + ) t1 = timeit.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals={'x': x}) + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals={"x": x}, + ) # Ran each twice to show difference before/after warm-up - print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us') - print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us') - print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us') - print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us') + print(f"mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us") + print(f"mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us") + print(f"bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us") + print(f"bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us") -.. GENERATED FROM PYTHON SOURCE LINES 237-245 +.. GENERATED FROM PYTHON SOURCE LINES 222-230 .. code-block:: none :caption: Output @@ -297,27 +283,29 @@ the above benchmarks again on a CUDA tensor and see what happens. bmm(x, x): 22.4 us -.. GENERATED FROM PYTHON SOURCE LINES 245-260 +.. GENERATED FROM PYTHON SOURCE LINES 230-247 .. code-block:: default t0 = benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals={'x': x}) + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals={"x": x}, + ) t1 = benchmark.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals={'x': x}) + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals={"x": x}, + ) # Run only once since benchmark module does warm-up for us print(t0.timeit(100)) print(t1.timeit(100)) -.. GENERATED FROM PYTHON SOURCE LINES 261-275 +.. GENERATED FROM PYTHON SOURCE LINES 248-262 .. code-block:: none :caption: Output @@ -334,39 +322,28 @@ the above benchmarks again on a CUDA tensor and see what happens. 1 measurement, 100 runs , 1 thread -.. GENERATED FROM PYTHON SOURCE LINES 277-288 +.. GENERATED FROM PYTHON SOURCE LINES 264-270 -The results reveal something interesting. The first run of the ``bmm`` -version using the ``timeit`` module takes much longer than the second -run. This is because ``bmm`` calls into `cuBLAS` which needs to be -loaded the first time it's called which takes some time. This is why -it's important to do a warm-up run before benchmarking, luckily for -us, PyTorch's ``benchmark`` module takes care of that. +结果揭示了一些有趣的事情。使用 `timeit` 模块运行 `bmm` 版本的第一次运行比第二次运行慢很多。 +这是因为 `bmm` 需要调用 `cuBLAS`,第一次调用时需要加载它,这需要一些时间。 +这就是为什么在基准测试之前做一次预热运行很重要,幸运的是, PyTorch 的 `benchmark` 模块为我们处理了这个问题。 -The difference in the results between ``timeit`` and ``benchmark`` modules -is because the `timeit` module is not synchronizing CUDA and is thus only -timing the time to launch the kernel. PyTorch's ``benchmark`` module does -the synchronization for us. +`timeit` 模块和 `benchmark` 模块之间结果的差异是因为 `timeit` 模块没有同步 CUDA,因此只计时了启动内核的时间。 +PyTorch 的 `benchmark` 模块为我们做了同步。 -.. GENERATED FROM PYTHON SOURCE LINES 291-306 +.. GENERATED FROM PYTHON SOURCE LINES 273-282 -4. Benchmarking with `Blocked Autorange` +4. 使用 `Blocked Autorange` 进行基准测试 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -While ``timeit.Timer.autorange`` takes a single continuous measurement -of at least 0.2 seconds, `torch.utils.benchmark.blocked_autorange` -takes many measurements whose times total at least 0.2 seconds (which -can be changed by the `min_run_time` parameter) subject to the constraint -that timing overhead is a small fraction of the overall measurement. -This is accomplished by first running with an increasing number of runs -per loop until the runtime is much larger than measurement overhead -(which also serves as a warm up), and then taking measurements until -the target time is reached. This has the useful properties that it wastes -less data and allows us to compute statistics to estimate the reliability -of the measurements. +虽然 `timeit.Timer.autorange` 采取至少 0.2 秒的单次连续测量, +但 `torch.utils.benchmark.blocked_autorange` 采取多次测量,其总时间至少为 0.2 秒(可通过 `min_run_time` 参数更改), +并且测量开销只占总体测量的一小部分。 +这是通过首先以递增的循环次数运行,直到运行时间远大于测量开销(这也起到了热身的作用), +然后进行测量直到达到目标时间。这有一个有用的特性,即它浪费的数据更少,并且允许我们计算统计数据来估计测量的可靠性。 -.. GENERATED FROM PYTHON SOURCE LINES 306-313 +.. GENERATED FROM PYTHON SOURCE LINES 282-289 .. code-block:: default @@ -378,7 +355,7 @@ of the measurements. print(m1) -.. GENERATED FROM PYTHON SOURCE LINES 314-328 +.. GENERATED FROM PYTHON SOURCE LINES 290-304 .. code-block:: none :caption: Output @@ -395,12 +372,11 @@ of the measurements. 2 measurements, 1000 runs per measurement, 1 thread -.. GENERATED FROM PYTHON SOURCE LINES 330-332 +.. GENERATED FROM PYTHON SOURCE LINES 306-307 -We can also inspect the individual statistics from the returned -measurements object. +我们还可以查看返回的测量对象中获得的各个统计数据。 -.. GENERATED FROM PYTHON SOURCE LINES 332-336 +.. GENERATED FROM PYTHON SOURCE LINES 307-311 .. code-block:: default @@ -409,7 +385,7 @@ measurements object. print(f"Median: {m0.median * 1e6:6.2f} us") -.. GENERATED FROM PYTHON SOURCE LINES 337-343 +.. GENERATED FROM PYTHON SOURCE LINES 312-318 .. code-block:: none :caption: Output @@ -418,22 +394,19 @@ measurements object. Median: 231.79 us -.. GENERATED FROM PYTHON SOURCE LINES 345-357 +.. GENERATED FROM PYTHON SOURCE LINES 320-329 -5. Comparing benchmark results +5. 比较基准测试结果 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -So far we've been comparing our two versions of batched dot against a -single input. In practice, we want to try a combination of inputs as -well as different number of threads. The ``Compare`` class helps display -the results of many measurements in a formatted table. It uses the -annotations described above (`label`, `sub_label`, `num_threads`, etc.) as -well as `description` to group and organize the table. Let's use -``Compare`` to see how our functions perform for different input sizes -and number of threads. +到目前为止,我们一直在比较我们的两个批量点积版本对同一输入的表现。 +在实践中,我们希望尝试不同的输入组合以及不同的线程数量。 +`Compare` 类帮助我们以格式化表格的形式显示多个测量结果。 +它使用上面描述的注释( `label`、 `sub_label`、 `num_threads` 等)以及 `description` 来对表格进行分组和组织。 +让我们使用 `Compare` 来看看我们的函数在不同的输入大小和线程数量下的表现如何。 -.. GENERATED FROM PYTHON SOURCE LINES 357-393 +.. GENERATED FROM PYTHON SOURCE LINES 329-369 .. code-block:: default @@ -447,40 +420,44 @@ and number of threads. for b, n in product(sizes, sizes): # label and sub_label are the rows # description is the column - label = 'Batched dot' - sub_label = f'[{b}, {n}]' + label = "Batched dot" + sub_label = f"[{b}, {n}]" x = torch.ones((b, n)) for num_threads in [1, 4, 16, 32]: - results.append(benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals={'x': x}, - num_threads=num_threads, - label=label, - sub_label=sub_label, - description='mul/sum', - ).blocked_autorange(min_run_time=1)) - results.append(benchmark.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals={'x': x}, - num_threads=num_threads, - label=label, - sub_label=sub_label, - description='bmm', - ).blocked_autorange(min_run_time=1)) + results.append( + benchmark.Timer( + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals={"x": x}, + num_threads=num_threads, + label=label, + sub_label=sub_label, + description="mul/sum", + ).blocked_autorange(min_run_time=1) + ) + results.append( + benchmark.Timer( + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals={"x": x}, + num_threads=num_threads, + label=label, + sub_label=sub_label, + description="bmm", + ).blocked_autorange(min_run_time=1) + ) compare = benchmark.Compare(results) compare.print() -.. GENERATED FROM PYTHON SOURCE LINES 394-470 +.. GENERATED FROM PYTHON SOURCE LINES 370-446 .. code-block:: none :caption: Output [--------------- Batched dot ----------------] - | mul/sum | bmm + | mul/sum | bmm 1 threads: ----------------------------------- [1, 1] | 5.9 | 11.2 [1, 64] | 6.4 | 11.4 @@ -553,16 +530,14 @@ and number of threads. Times are in microseconds (us). -.. GENERATED FROM PYTHON SOURCE LINES 472-478 +.. GENERATED FROM PYTHON SOURCE LINES 448-452 -The results above indicate that the version which reduces to ``bmm`` -is better for larger tensors running on multiple threads, while for -smaller and/or single thread code, the other version is better. +上面的结果表明,对于在多线程上运行的较大张量, `bmm` 的版本效果更好, +而对于较小和/或单线程代码,另一个版本效果更好。 -``Compare`` also provides functions for changing the table format +`Compare` 还提供了用于更改表格格式的函数 - -.. GENERATED FROM PYTHON SOURCE LINES 478-484 +.. GENERATED FROM PYTHON SOURCE LINES 452-458 .. code-block:: default @@ -573,26 +548,22 @@ smaller and/or single thread code, the other version is better. -.. GENERATED FROM PYTHON SOURCE LINES 485-501 +.. GENERATED FROM PYTHON SOURCE LINES 459-471 -6. Saving/Loading benchmark results +6. 保存/加载基准测试结果 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -`Measurements` (and ``CallgrindStats`` which are described in section 8) -can be serialized by the ``pickle`` module. This makes A/B testing easy, as you can collect -measurements from two separate environments, pickle them, and then -load both in a single environment. Timer even takes an `env` -constructor argument so that such A/B testing works seamlessly. +`Measurements` (和第8节中描述的 `CallgrindStats` )可以通过 `pickle` 模块序列化。 +这使得A/B测试变得很容易,因为您可以从两个独立的环境中收集测量结果, +将它们序列化,然后在单个环境中加载两者。Timer甚至接受一个 `env` +构造函数参数,以便这种A/B测试可以无缝衔接。 -Let's imagine that rather than two Python functions, the add/sum -and ``bmm`` approaches were in two different builds of PyTorch. -The example below demonstrates how one might A/B test them. For -simplicity, we only use a subset of shapes, and simply round trip -results through pickle rather than actually using multiple environments -and writing results to disk. +假设 add/sum 和 `bmm` 方法不是两个Python函数,而是 PyTorch 的两个不同版本。 +下面的示例演示了如何进行A/B测试。为了简单起见,我们只使用了一部分数据, +并简单地通过pickle来回传结果,而不是实际使用多个环境并将结果写入磁盘。 -.. GENERATED FROM PYTHON SOURCE LINES 501-525 +.. GENERATED FROM PYTHON SOURCE LINES 471-497 .. code-block:: default @@ -600,16 +571,18 @@ and writing results to disk. import pickle ab_test_results = [] - for env in ('environment A: mul/sum', 'environment B: bmm'): + for env in ("environment A: mul/sum", "environment B: bmm"): for b, n in ((1, 1), (1024, 10000), (10000, 1)): x = torch.ones((b, n)) - dot_fn = (batched_dot_mul_sum if env == 'environment A: mul/sum' else batched_dot_bmm) + dot_fn = ( + batched_dot_mul_sum if env == "environment A: mul/sum" else batched_dot_bmm + ) m = benchmark.Timer( - stmt='batched_dot(x, x)', - globals={'x': x, 'batched_dot': dot_fn}, + stmt="batched_dot(x, x)", + globals={"x": x, "batched_dot": dot_fn}, num_threads=1, - label='Batched dot', - description=f'[{b}, {n}]', + label="Batched dot", + description=f"[{b}, {n}]", env=env, ).blocked_autorange(min_run_time=1) ab_test_results.append(pickle.dumps(m)) @@ -621,7 +594,7 @@ and writing results to disk. compare.print() -.. GENERATED FROM PYTHON SOURCE LINES 526-537 +.. GENERATED FROM PYTHON SOURCE LINES 498-509 .. code-block:: none :caption: Output @@ -635,47 +608,50 @@ and writing results to disk. Times are in microseconds (us). -.. GENERATED FROM PYTHON SOURCE LINES 537-543 +.. GENERATED FROM PYTHON SOURCE LINES 509-515 .. code-block:: default - # And just to show that we can round trip all of the results from earlier: + # 仅为展示可以将之前所有的结果通过 pickle 进行回传: round_tripped_results = pickle.loads(pickle.dumps(results)) - assert(str(benchmark.Compare(results)) == str(benchmark.Compare(round_tripped_results))) + assert str(benchmark.Compare(results)) == str(benchmark.Compare(round_tripped_results)) -.. GENERATED FROM PYTHON SOURCE LINES 544-555 +.. GENERATED FROM PYTHON SOURCE LINES 516-524 -7. Generating inputs with `Fuzzed Parameters` +7. 使用 `Fuzzed Parameters` 生成输入 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -As we've seen in the previous section, there can be some stark -performance differences depending on the input tensors. Hence, it -is a good idea to run benchmarks on a number of different inputs. -However, creating all these input tensors can be tedious which is -where ``torch.utils.benchmark.Fuzzer`` and related classes come in. -Let's take a look at how we can use the ``Fuzzer`` to create some test -cases for the benchmark. +正如我们在上一节中看到的,根据输入张量的不同,性能差异可能会很大。 +因此,在多个不同的输入上运行基准测试是一个好主意。 +但是,创建所有这些输入张量可能会很麻烦,这就是 `torch.utils.benchmark.Fuzzer` +和相关类的用武之地。让我们看看如何使用 `Fuzzer` 来创建一些用于基准测试的测试用例。 -.. GENERATED FROM PYTHON SOURCE LINES 555-596 +.. GENERATED FROM PYTHON SOURCE LINES 524-575 .. code-block:: default - from torch.utils.benchmark import Fuzzer, FuzzedParameter, FuzzedTensor, ParameterAlias + from torch.utils.benchmark import FuzzedParameter, FuzzedTensor, Fuzzer, ParameterAlias - # Generates random tensors with 128 to 10000000 elements and sizes k0 and k1 chosen from a - # ``loguniform`` distribution in [1, 10000], 40% of which will be discontiguous on average. + # 生成随机张量,元素数量在 128 到 10000000 之间,大小 k0 和 k1 从 [1, 10000] 的 `loguniform` 分布中选择, + # 其中平均 40% 将是不连续的。 example_fuzzer = Fuzzer( - parameters = [ - FuzzedParameter('k0', minval=1, maxval=10000, distribution='loguniform'), - FuzzedParameter('k1', minval=1, maxval=10000, distribution='loguniform'), + parameters=[ + FuzzedParameter("k0", minval=1, maxval=10000, distribution="loguniform"), + FuzzedParameter("k1", minval=1, maxval=10000, distribution="loguniform"), ], - tensors = [ - FuzzedTensor('x', size=('k0', 'k1'), min_elements=128, max_elements=10000000, probability_contiguous=0.6) + tensors=[ + FuzzedTensor( + "x", + size=("k0", "k1"), + min_elements=128, + max_elements=10000000, + probability_contiguous=0.6, + ) ], seed=0, ) @@ -683,36 +659,40 @@ cases for the benchmark. results = [] for tensors, tensor_params, params in example_fuzzer.take(10): # description is the column label - sub_label=f"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}" - results.append(benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals=tensors, - label='Batched dot', - sub_label=sub_label, - description='mul/sum', - ).blocked_autorange(min_run_time=1)) - results.append(benchmark.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals=tensors, - label='Batched dot', - sub_label=sub_label, - description='bmm', - ).blocked_autorange(min_run_time=1)) + sub_label = f"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}" + results.append( + benchmark.Timer( + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals=tensors, + label="Batched dot", + sub_label=sub_label, + description="mul/sum", + ).blocked_autorange(min_run_time=1) + ) + results.append( + benchmark.Timer( + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals=tensors, + label="Batched dot", + sub_label=sub_label, + description="bmm", + ).blocked_autorange(min_run_time=1) + ) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.print() -.. GENERATED FROM PYTHON SOURCE LINES 597-616 +.. GENERATED FROM PYTHON SOURCE LINES 576-595 .. code-block:: none :caption: Output [--------------------- Batched dot ---------------------] - | mul/sum | bmm + | mul/sum | bmm 1 threads: ---------------------------------------------- 725 x 257 | 87 | 180 49 x 383 | 15 | 30 @@ -725,19 +705,17 @@ cases for the benchmark. 78 x 5 (discontiguous) | 9 | 20 187 x 1 | 12 | 10 - Times are in microseconds (us). + Times are in microseconds (us). -.. GENERATED FROM PYTHON SOURCE LINES 618-624 +.. GENERATED FROM PYTHON SOURCE LINES 597-601 -There is a lot of flexibility for defining your own ``fuzzers`` which -is great for creating a powerful set of inputs to benchmark. But to -make things even simpler, PyTorch benchmark module comes with some -built-in ``fuzzers`` for common benchmarking needs. Let's take a look at -how we can use one of these built-in ``fuzzers``. +定义自己的 `fuzzers` 有很大的灵活性,这对于创建强大的输入集进行基准测试非常有用。 +但为了让事情变得更简单, PyTorch 基准测试模块为常见的基准测试需求提供了一些内置的 `fuzzers`。 +让我们看看如何使用其中一个内置的 `fuzzers` 。 -.. GENERATED FROM PYTHON SOURCE LINES 624-652 +.. GENERATED FROM PYTHON SOURCE LINES 601-633 .. code-block:: default @@ -746,23 +724,27 @@ how we can use one of these built-in ``fuzzers``. results = [] for tensors, tensor_params, params in binary.BinaryOpFuzzer(seed=0).take(10): - sub_label=f"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}" - results.append(benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='from __main__ import batched_dot_mul_sum', - globals=tensors, - label='Batched dot', - sub_label=sub_label, - description='mul/sum', - ).blocked_autorange(min_run_time=1)) - results.append(benchmark.Timer( - stmt='batched_dot_bmm(x, x)', - setup='from __main__ import batched_dot_bmm', - globals=tensors, - label='Batched dot', - sub_label=sub_label, - description='bmm', - ).blocked_autorange(min_run_time=1)) + sub_label = f"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}" + results.append( + benchmark.Timer( + stmt="batched_dot_mul_sum(x, x)", + setup="from __main__ import batched_dot_mul_sum", + globals=tensors, + label="Batched dot", + sub_label=sub_label, + description="mul/sum", + ).blocked_autorange(min_run_time=1) + ) + results.append( + benchmark.Timer( + stmt="batched_dot_bmm(x, x)", + setup="from __main__ import batched_dot_bmm", + globals=tensors, + label="Batched dot", + sub_label=sub_label, + description="bmm", + ).blocked_autorange(min_run_time=1) + ) compare = benchmark.Compare(results) compare.trim_significant_figures() @@ -770,13 +752,13 @@ how we can use one of these built-in ``fuzzers``. compare.print() -.. GENERATED FROM PYTHON SOURCE LINES 653-672 +.. GENERATED FROM PYTHON SOURCE LINES 634-653 .. code-block:: none :caption: Output [----------------------- Batched dot ------------------------] - | mul/sum | bmm + | mul/sum | bmm 1 threads: --------------------------------------------------- 64 x 473 (discontiguous) | 10000 | 40000 16384 x 12642115 (discontiguous) | 31 | 78 @@ -792,33 +774,27 @@ how we can use one of these built-in ``fuzzers``. Times are in microseconds (us). -.. GENERATED FROM PYTHON SOURCE LINES 674-697 +.. GENERATED FROM PYTHON SOURCE LINES 655-672 -8. Collecting instruction counts with ``Callgrind`` +8. 使用 `Callgrind` 收集指令计数 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -One of the challenges of optimizing code is the variation and opacity of -wall time. There are many sources of non-determinism, from adaptive clock -speeds to resource contention with other processes. Furthermore, end-to-end -time gives no insight into where time is being spent, which is really what -we're interested in when optimizing code. +优化代码的一个挑战是时间的变化和不透明性。有许多不确定性的来源, +从自适应时钟速度到与其他进程的资源争用。此外,端到端时间并不能揭示时间花费在哪里, +而这正是我们在优化代码时感兴趣的。 -A complementary approach is to also collect instruction counts. These counts -are a proxy metric and do not capture all aspects of performance -(e.g. memory or I/O bound tasks), however they do have several useful -properties. Instruction counts are reproducible, insensitive to environmental -variation, and offer fine grained insight into where a program is spending -cycles. +一种补充方法是也收集指令计数。这些计数是一种代理指标,并不能捕获性能的所有方面 +(例如内存或I/O绑定任务),但它们确实具有一些有用的特性。指令计数是可重复的, +不受环境变化的影响,并且可以提供对程序在哪里花费周期的细粒度洞察。 -To see the utility of instruction counts, let us look at how we might -reduce the overhead of `batched_dot_mul_sum`. The obvious solution is to -move it to C++, so we avoid going between Python and C++ multiple times. +为了看到指令计数的实用性,让我们看看如何减少 `batched_dot_mul_sum` 的开销。 +显而易见的解决方案是将其移至 C++ ,这样我们就可以避免在 Python 和 C++ 之间多次来回切换。 -Fortunately, the source is nearly identical. One question that we have to ask -in C++ is whether we should take arguments by value or reference. +幸运的是,源代码几乎是相同的。在 C++ 中我们必须问的一个问题是, +我们是通过值还是引用来传递参数。 -.. GENERATED FROM PYTHON SOURCE LINES 697-774 +.. GENERATED FROM PYTHON SOURCE LINES 672-757 .. code-block:: default @@ -842,25 +818,26 @@ in C++ is whether we should take arguments by value or reference. """ - # PyTorch makes it easy to test our C++ implementations by providing a utility - # to JIT compile C++ source into Python extensions: + # PyTorch 提供一个实用程序来 JIT 编译 C++ 源代码为 Python 扩展, + # 使得测试我们的 C++ 实现变得很容易: import os + from torch.utils import cpp_extension + cpp_lib = cpp_extension.load_inline( - name='cpp_lib', + name="cpp_lib", cpp_sources=batched_dot_src, - extra_cflags=['-O3'], + extra_cflags=["-O3"], extra_include_paths=[ - # `load_inline` needs to know where to find ``pybind11`` headers. - os.path.join(os.getenv('CONDA_PREFIX'), 'include') + # `load_inline`需要知道`pybind11`头文件的位置。 + os.path.join(os.getenv("CONDA_PREFIX"), "include") ], - functions=['batched_dot_mul_sum_v0', 'batched_dot_mul_sum_v1'] + functions=["batched_dot_mul_sum_v0", "batched_dot_mul_sum_v1"], ) - # `load_inline` will create a shared object that is loaded into Python. When we collect - # instruction counts Timer will create a subprocess, so we need to re-import it. The - # import process is slightly more complicated for C extensions, but that's all we're - # doing here. + # `load_inline` 将创建一个共享对象,并加载到Python中。当我们收集指令计数时, + # Timer将创建一个子进程,因此我们需要重新导入它。对于C扩展,导入过程略有不同, + # 但这就是我们在这里所做的。 module_import_str = f"""\ # https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-full-path import importlib.util @@ -869,38 +846,45 @@ in C++ is whether we should take arguments by value or reference. spec.loader.exec_module(cpp_lib)""" import textwrap + + def pretty_print(result): """Import machinery for ``cpp_lib.so`` can get repetitive to look at.""" - print(repr(result).replace(textwrap.indent(module_import_str, " "), " import cpp_lib")) + print( + repr(result).replace( + textwrap.indent(module_import_str, " "), " import cpp_lib" + ) + ) t_baseline = benchmark.Timer( - stmt='batched_dot_mul_sum(x, x)', - setup='''\ + stmt="batched_dot_mul_sum(x, x)", + setup="""\ from __main__ import batched_dot_mul_sum - x = torch.randn(2, 2)''') + x = torch.randn(2, 2)""", + ) t0 = benchmark.Timer( - stmt='cpp_lib.batched_dot_mul_sum_v0(x, x)', - setup=f'''\ + stmt="cpp_lib.batched_dot_mul_sum_v0(x, x)", + setup=f"""\ {module_import_str} - x = torch.randn(2, 2)''') + x = torch.randn(2, 2)""", + ) t1 = benchmark.Timer( - stmt='cpp_lib.batched_dot_mul_sum_v1(x, x)', - setup=f'''\ + stmt="cpp_lib.batched_dot_mul_sum_v1(x, x)", + setup=f"""\ {module_import_str} - x = torch.randn(2, 2)''') + x = torch.randn(2, 2)""", + ) - # Moving to C++ did indeed reduce overhead, but it's hard to tell which - # calling convention is more efficient. v1 (call with references) seems to - # be a bit faster, but it's within measurement error. + # 转移到 C++ 确实减少了开销,但很难判断哪种调用约定更有效。v1(使用引用调用)似乎稍快一些,但在测量误差范围内。 pretty_print(t_baseline.blocked_autorange()) pretty_print(t0.blocked_autorange()) pretty_print(t1.blocked_autorange()) -.. GENERATED FROM PYTHON SOURCE LINES 775-803 +.. GENERATED FROM PYTHON SOURCE LINES 758-786 .. code-block:: none :caption: Output @@ -931,31 +915,26 @@ in C++ is whether we should take arguments by value or reference. 1 measurement, 100000 runs , 1 thread -.. GENERATED FROM PYTHON SOURCE LINES 803-843 +.. GENERATED FROM PYTHON SOURCE LINES 786-820 .. code-block:: default - # Let's use ``Callgrind`` to determine which is better. + # 让我们使用 ``Callgrind`` 来确定哪种方式更好。 stats_v0 = t0.collect_callgrind() stats_v1 = t1.collect_callgrind() pretty_print(stats_v0) pretty_print(stats_v1) - # `.as_standardized` removes file names and some path prefixes, and makes - # it easier to read the function symbols. + # `.as_standardized` 移除了文件名和某些路径前缀,使函数符号更易读。 stats_v0 = stats_v0.as_standardized() stats_v1 = stats_v1.as_standardized() - # `.delta` diffs the instruction counts, and `.denoise` removes several - # functions in the Python interpreter that are known to have significant - # jitter. + # `.delta` 对指令计数进行差分, `.denoise` 则移除了 Python 解释器中已知存在显著抖动的几个函数。 delta = stats_v1.delta(stats_v0).denoise() - # `.transform` is a convenience API for transforming function names. It is - # useful for increasing cancelation when ``diff-ing`` instructions, as well as - # just generally improving readability. + # `.transform` 是一个转换函数名的便利 API。它在进行 ``diff-ing`` 时很有用,因为可以增加抵消,同时也能提高可读性。 replacements = ( ("???:void pybind11", "pybind11"), ("batched_dot_mul_sum_v0", "batched_dot_mul_sum_v1"), @@ -966,17 +945,16 @@ in C++ is whether we should take arguments by value or reference. for before, after in replacements: delta = delta.transform(lambda l: l.replace(before, after)) - # We can use print options to control how much of the function to display. + # 我们可以使用打印选项来控制显示函数的多少内容。 torch.set_printoptions(linewidth=160) - # Once parsed, the instruction counts make clear that passing `a` and `b` - # by reference is more efficient as it skips some ``c10::TensorImpl`` bookkeeping - # for the intermediate Tensors, and is also works better with ``pybind11``. This - # is consistent with our noisy wall time observations. + # 解析后,指令计数清楚地表明,通过引用传递 `a` 和 `b` 更有效, + # 因为它跳过了一些 `c10::TensorImpl` 中间张量的簿记操作,并且与 `pybind11` 也更兼容。 + # 这与我们有噪声时间观察结果一致。 print(delta) -.. GENERATED FROM PYTHON SOURCE LINES 844-879 +.. GENERATED FROM PYTHON SOURCE LINES 821-856 .. code-block:: @@ -1014,12 +992,12 @@ in C++ is whether we should take arguments by value or reference. Total: -13693 -.. GENERATED FROM PYTHON SOURCE LINES 882-889 +.. GENERATED FROM PYTHON SOURCE LINES 859-866 -Learn More +学习更多 ---------- -Take a look at these other recipes to continue your learning: +查看其他教程继续学习: - `PyTorch Profiler `_ diff --git a/docs/_sources/recipes/recipes/dynamic_quantization.rst.txt b/docs/_sources/recipes/recipes/dynamic_quantization.rst.txt index bc9eec2..e5c7382 100644 --- a/docs/_sources/recipes/recipes/dynamic_quantization.rst.txt +++ b/docs/_sources/recipes/recipes/dynamic_quantization.rst.txt @@ -18,338 +18,271 @@ .. _sphx_glr_recipes_recipes_dynamic_quantization.py: -Dynamic Quantization +动态量化 ==================== -In this recipe you will see how to take advantage of Dynamic -Quantization to accelerate inference on an LSTM-style recurrent neural -network. This reduces the size of the model weights and speeds up model -execution. +在这个示例中,您将看到如何利用动态量化来加速 LSTM 风格的循环神经网络的推理。这可以减小模型权重的大小,并加快模型执行速度。 -Introduction +介绍 ------------- -There are a number of trade-offs that can be made when designing neural -networks. During model development and training you can alter the -number of layers and number of parameters in a recurrent neural network -and trade-off accuracy against model size and/or model latency or -throughput. Such changes can take lot of time and compute resources -because you are iterating over the model training. Quantization gives -you a way to make a similar trade off between performance and model -accuracy with a known model after training is completed. +在设计神经网络时,可以做出多种权衡。在模型开发和训练期间,您可以改变循环神经网络中的层数和参数数量,在模型大小和/或模型延迟或吞吐量与精度之间进行权衡。由于您需要重复模型训练过程,因此这种改变需要大量的时间和计算资源。量化为您提供了一种在已知模型上在性能和模型精度之间进行权衡的方式,而无需重新训练模型。 -You can give it a try in a single session and you will certainly reduce -your model size significantly and may get a significant latency -reduction without losing a lot of accuracy. +您可以在单个会话中尝试一下,您肯定会显著减小模型大小,并可能在不会损失太多精度的情况下获得显著的延迟减少。 -What is dynamic quantization? +什么是动态量化? ----------------------------- -Quantizing a network means converting it to use a reduced precision -integer representation for the weights and/or activations. This saves on -model size and allows the use of higher throughput math operations on -your CPU or GPU. +量化网络意味着将其转换为使用较低精度的整数表示形式来表示权重和/或激活。这可以减小模型大小,并允许在 CPU 或 GPU 上使用更高吞吐量的数学运算。 -When converting from floating point to integer values you are -essentially multiplying the floating point value by some scale factor -and rounding the result to a whole number. The various quantization -approaches differ in the way they approach determining that scale -factor. +从浮点数转换为整数值时,您实际上是将浮点数乘以某个比例因子,然后将结果舍入为整数。不同的量化方法在确定该比例因子的方式上有所不同。 -The key idea with dynamic quantization as described here is that we are -going to determine the scale factor for activations dynamically based on -the data range observed at runtime. This ensures that the scale factor -is "tuned" so that as much signal as possible about each observed -dataset is preserved. +这里介绍的动态量化的关键思想是,我们将根据运行时观察到的数据范围动态确定激活的比例因子。这可确保比例因子被"调整"为尽可能保留每个观察到的数据集的信号。 -The model parameters on the other hand are known during model conversion -and they are converted ahead of time and stored in INT8 form. +另一方面,模型参数在模型转换期间是已知的,它们会提前转换并以 INT8 形式存储。 -Arithmetic in the quantized model is done using vectorized INT8 -instructions. Accumulation is typically done with INT16 or INT32 to -avoid overflow. This higher precision value is scaled back to INT8 if -the next layer is quantized or converted to FP32 for output. +量化模型中的算术运算使用矢量化的 INT8 指令完成。累加通常使用 INT16 或 INT32 来避免溢出。如果下一层是量化的,则将此较高精度值缩放回 INT8;如果是输出,则将其转换为 FP32。 -Dynamic quantization is relatively free of tuning parameters which makes -it well suited to be added into production pipelines as a standard part -of converting LSTM models to deployment. +动态量化相对来说没有太多需要调整的参数,因此非常适合作为将 LSTM 模型转换为部署的标准部分添加到生产管道中。 +.. note:: + 本示例中采用的方法的局限性 + 本示例提供了对 PyTorch 中动态量化功能的快速介绍,以及使用它的工作流程。我们的重点是解释用于转换模型的特定函数。为了简洁和清晰,我们做出了一些重大简化,包括: -.. note:: - Limitations on the approach taken here - - - This recipe provides a quick introduction to the dynamic quantization - features in PyTorch and the workflow for using it. Our focus is on - explaining the specific functions used to convert the model. We will - make a number of significant simplifications in the interest of brevity - and clarity - - -1. You will start with a minimal LSTM network -2. You are simply going to initialize the network with a random hidden - state -3. You are going to test the network with random inputs -4. You are not going to train the network in this tutorial -5. You will see that the quantized form of this network is smaller and - runs faster than the floating point network we started with -6. You will see that the output values are generally in the same - ballpark as the output of the FP32 network, but we are not - demonstrating here the expected accuracy loss on a real trained - network - -You will see how dynamic quantization is done and be able to see -suggestive reductions in memory use and latency times. Providing a -demonstration that the technique can preserve high levels of model -accuracy on a trained LSTM is left to a more advanced tutorial. If you -want to move right away to that more rigorous treatment please proceed -to the `advanced dynamic quantization -tutorial `__. - -Steps -------------- +1. 您将从一个最小的 LSTM 网络开始 +2. 您只需用随机隐藏状态初始化网络 +3. 您将使用随机输入来测试网络 +4. 您不会在本教程中训练网络 +5. 您将看到,与我们开始时的浮点网络相比,量化后的网络更小且运行速度更快 +6. 您将看到,量化网络产生的输出张量值与 FP32 网络输出的值在同一数量级,但我们并未在这里展示该技术在经过训练的 LSTM 上能够保留较高模型精度的情况 -This recipe has 5 steps. +您将了解如何进行动态量化,并能够看到内存使用和延迟时间的潜在减小。关于该技术在经过训练的 LSTM 上能够保留较高模型精度的演示,将留待更高级的教程。如果您想直接进入更严格的处理,请继续学习 `高级动态量化教程 `__。 + +步骤 +------------- -1. Set Up - Here you define a very simple LSTM, import modules, and establish - some random input tensors. +本示例包含 5 个步骤。 -2. Do the Quantization - Here you instantiate a floating point model and then create quantized - version of it. +1. 设置 - 在这里,您定义一个非常简单的 LSTM,导入模块,并建立一些随机输入张量。 -3. Look at Model Size - Here you show that the model size gets smaller. +2. 执行量化 - 在这里,您实例化一个浮点模型,然后创建其量化版本。 -4. Look at Latency - Here you run the two models and compare model runtime (latency). +3. 查看模型大小 - 在这里,您显示模型大小变小了。 -5. Look at Accuracy - Here you run the two models and compare outputs. +4. 查看延迟 - 在这里,您运行两个模型并比较模型运行时间(延迟)。 +5. 查看精度 - 在这里,您运行两个模型并比较输出。 -1: Set Up +1: 设置 ~~~~~~~~~~~~~~~ -This is a straightforward bit of code to set up for the rest of the -recipe. +这是一段直接的代码,用于为本示例的其余部分做准备。 -The unique module we are importing here is torch.quantization which -includes PyTorch's quantized operators and conversion functions. We also -define a very simple LSTM model and set up some inputs. +我们在这里导入的唯一模块是 torch.quantization,它包含了 PyTorch 的量化算子和转换函数。我们还定义了一个非常简单的 LSTM 模型,并设置了一些输入。 -.. GENERATED FROM PYTHON SOURCE LINES 119-160 +.. GENERATED FROM PYTHON SOURCE LINES 64-111 .. code-block:: default - # import the modules used here in this recipe - import torch - import torch.quantization - import torch.nn as nn + # 导入本示例中使用的模块 import copy import os import time - # define a very, very simple LSTM for demonstration purposes - # in this case, we are wrapping ``nn.LSTM``, one layer, no preprocessing or postprocessing - # inspired by - # `Sequence Models and Long Short-Term Memory Networks tutorial `__. + import torch + import torch.nn as nn + import torch.quantization + + + # 为演示目的定义一个非常简单的 LSTM + # 在这种情况下,我们只是包装了 ``nn.LSTM``、一层,没有预处理或后处理 + # 受到以下教程的启发: + # `序列模型和长短期记忆网络教程 `_, 作者 Robert Guthrie + # 和 `动态量化教程 `__。 class lstm_for_demonstration(nn.Module): - """Elementary Long Short Term Memory style model which simply wraps ``nn.LSTM`` - Not to be used for anything other than demonstration. - """ - def __init__(self,in_dim,out_dim,depth): - super(lstm_for_demonstration,self).__init__() - self.lstm = nn.LSTM(in_dim,out_dim,depth) + """基本的长短期记忆风格模型,只是包装了 ``nn.LSTM`` + 不应用于除演示之外的任何其他用途。 + """ - def forward(self,inputs,hidden): - out,hidden = self.lstm(inputs,hidden) - return out, hidden + def __init__(self, in_dim, out_dim, depth): + super(lstm_for_demonstration, self).__init__() + self.lstm = nn.LSTM(in_dim, out_dim, depth) + def forward(self, inputs, hidden): + out, hidden = self.lstm(inputs, hidden) + return out, hidden - torch.manual_seed(29592) # set the seed for reproducibility - #shape parameters - model_dimension=8 - sequence_length=20 - batch_size=1 - lstm_depth=1 + torch.manual_seed(29592) # 设置种子以获得可重复结果 - # random data for input - inputs = torch.randn(sequence_length,batch_size,model_dimension) - # hidden is actually is a tuple of the initial hidden state and the initial cell state - hidden = (torch.randn(lstm_depth,batch_size,model_dimension), torch.randn(lstm_depth,batch_size,model_dimension)) + # 形状参数 + model_dimension = 8 + sequence_length = 20 + batch_size = 1 + lstm_depth = 1 + + # 随机输入数据 + inputs = torch.randn(sequence_length, batch_size, model_dimension) + # hidden 实际上是初始隐藏状态和初始细胞状态的元组 + hidden = ( + torch.randn(lstm_depth, batch_size, model_dimension), + torch.randn(lstm_depth, batch_size, model_dimension), + ) -.. GENERATED FROM PYTHON SOURCE LINES 161-174 +.. GENERATED FROM PYTHON SOURCE LINES 112-119 -2: Do the Quantization +2: 执行量化 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Now we get to the fun part. First we create an instance of the model -called ``float\_lstm`` then we are going to quantize it. We're going to use -the `torch.quantization.quantize_dynamic `__ function, which takes the model, then a list of the submodules -which we want to -have quantized if they appear, then the datatype we are targeting. This -function returns a quantized version of the original model as a new -module. +现在我们来执行有趣的部分。首先,我们创建一个名为 ``float_lstm`` 的模型实例,然后我们将对其进行量化。我们将使用 `torch.quantization.quantize_dynamic `__ 函数,它接受模型、我们希望量化的子模块列表(如果存在)以及目标数据类型。此函数返回原始模型的量化版本,作为一个新模块。 -That's all it takes. +就这么简单。 -.. GENERATED FROM PYTHON SOURCE LINES 174-191 +.. GENERATED FROM PYTHON SOURCE LINES 119-136 .. code-block:: default - # here is our floating point instance - float_lstm = lstm_for_demonstration(model_dimension, model_dimension,lstm_depth) + # 这是我们的浮点实例 + float_lstm = lstm_for_demonstration(model_dimension, model_dimension, lstm_depth) - # this is the call that does the work + # 这是执行量化的调用 quantized_lstm = torch.quantization.quantize_dynamic( float_lstm, {nn.LSTM, nn.Linear}, dtype=torch.qint8 ) - # show the changes that were made - print('Here is the floating point version of this module:') + # 显示所做的更改 + print("这是该模块的浮点版本:") print(float_lstm) - print('') - print('and now the quantized version:') + print("") + print("现在是量化版本:") print(quantized_lstm) -.. GENERATED FROM PYTHON SOURCE LINES 192-203 +.. GENERATED FROM PYTHON SOURCE LINES 137-141 -3. Look at Model Size +3. 查看模型大小 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -We've quantized the model. What does that get us? Well the first -benefit is that we've replaced the FP32 model parameters with INT8 -values (and some recorded scale factors). This means about 75% less data -to store and move around. With the default values the reduction shown -below will be less than 75% but if you increase the model size above -(for example you can set model dimension to something like 80) this will -converge towards 4x smaller as the stored model size dominated more and -more by the parameter values. +我们已经量化了模型。这给我们带来了什么好处?好处之一是我们用 INT8 值(和一些记录的比例因子)替换了 FP32 模型参数。这意味着存储和移动数据的大小减小了约 75%。使用默认值时,下面显示的减小量将小于 75%,但如果您将模型大小增加到更大值(例如将 model_dimension 设置为 80),随着存储的模型大小越来越多地由参数值主导,减小量将趋近于 4 倍。 -.. GENERATED FROM PYTHON SOURCE LINES 203-217 +.. GENERATED FROM PYTHON SOURCE LINES 141-157 .. code-block:: default + def print_size_of_model(model, label=""): torch.save(model.state_dict(), "temp.p") - size=os.path.getsize("temp.p") - print("model: ",label,' \t','Size (KB):', size/1e3) - os.remove('temp.p') + size = os.path.getsize("temp.p") + print("模型: ", label, " \t", "大小 (KB):", size / 1e3) + os.remove("temp.p") return size - # compare the sizes - f=print_size_of_model(float_lstm,"fp32") - q=print_size_of_model(quantized_lstm,"int8") - print("{0:.2f} times smaller".format(f/q)) + + # 比较大小 + f = print_size_of_model(float_lstm, "fp32") + q = print_size_of_model(quantized_lstm, "int8") + print("{0:.2f} 倍更小".format(f / q)) -.. GENERATED FROM PYTHON SOURCE LINES 218-231 +.. GENERATED FROM PYTHON SOURCE LINES 158-167 -4. Look at Latency +4. 查看延迟 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The second benefit is that the quantized model will typically run -faster. This is due to a combinations of effects including at least: +第二个好处是量化模型通常会运行得更快。这是由于多种效果的组合,至少包括: -1. Less time spent moving parameter data in -2. Faster INT8 operations +1. 减少了移动参数数据所花费的时间 +2. INT8 操作更快 -As you will see the quantized version of this super-simple network runs -faster. This will generally be true of more complex networks but as they -say "your mileage may vary" depending on a number of factors including -the structure of the model and the hardware you are running on. +如您所见,这个超级简单的网络的量化版本运行速度更快。对于更复杂的网络通常也是如此,但正如他们所说,"您的里程可能会有所不同",这取决于许多因素,包括模型的结构和您运行的硬件。 -.. GENERATED FROM PYTHON SOURCE LINES 231-235 +.. GENERATED FROM PYTHON SOURCE LINES 167-171 .. code-block:: default - # compare the performance - print("Floating point FP32") + # 比较性能 + print("浮点 FP32") -.. GENERATED FROM PYTHON SOURCE LINES 236-239 +.. GENERATED FROM PYTHON SOURCE LINES 172-175 .. code-block:: python %timeit float_lstm.forward(inputs, hidden) -.. GENERATED FROM PYTHON SOURCE LINES 239-242 +.. GENERATED FROM PYTHON SOURCE LINES 175-178 .. code-block:: default - print("Quantized INT8") + print("量化 INT8") -.. GENERATED FROM PYTHON SOURCE LINES 243-246 +.. GENERATED FROM PYTHON SOURCE LINES 179-182 .. code-block:: python %timeit quantized_lstm.forward(inputs,hidden) -.. GENERATED FROM PYTHON SOURCE LINES 249-260 +.. GENERATED FROM PYTHON SOURCE LINES 185-191 -5: Look at Accuracy +5: 查看精度 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -We are not going to do a careful look at accuracy here because we are -working with a randomly initialized network rather than a properly -trained one. However, I think it is worth quickly showing that the -quantized network does produce output tensors that are "in the same -ballpark" as the original one. +我们不会在这里仔细查看精度,因为我们使用的是随机初始化的网络,而不是经过正确训练的网络。但是,我认为值得快速展示一下量化网络确实产生了与原始网络"同一数量级"的输出张量值。 -For a more detailed analysis please see the more advanced tutorials -referenced at the end of this recipe. +有关更详细的分析,请参阅本示例结尾处引用的更高级教程。 -.. GENERATED FROM PYTHON SOURCE LINES 260-276 +.. GENERATED FROM PYTHON SOURCE LINES 191-211 .. code-block:: default - # run the float model + # 运行浮点模型 out1, hidden1 = float_lstm(inputs, hidden) mag1 = torch.mean(abs(out1)).item() - print('mean absolute value of output tensor values in the FP32 model is {0:.5f} '.format(mag1)) + print("FP32 模型中输出张量值的绝对值均值为 {0:.5f} ".format(mag1)) - # run the quantized model + # 运行量化模型 out2, hidden2 = quantized_lstm(inputs, hidden) mag2 = torch.mean(abs(out2)).item() - print('mean absolute value of output tensor values in the INT8 model is {0:.5f}'.format(mag2)) - - # compare them - mag3 = torch.mean(abs(out1-out2)).item() - print('mean absolute value of the difference between the output tensors is {0:.5f} or {1:.2f} percent'.format(mag3,mag3/mag1*100)) + print("INT8 模型中输出张量值的绝对值均值为 {0:.5f}".format(mag2)) + + # 比较它们 + mag3 = torch.mean(abs(out1 - out2)).item() + print( + "输出张量之间差值的绝对值均值为 {0:.5f},或占 {1:.2f} 百分比".format( + mag3, mag3 / mag1 * 100 + ) + ) -.. GENERATED FROM PYTHON SOURCE LINES 277-295 +.. GENERATED FROM PYTHON SOURCE LINES 212-227 -Learn More +了解更多 ------------ -We've explained what dynamic quantization is, what benefits it brings, -and you have used the ``torch.quantization.quantize_dynamic()`` function -to quickly quantize a simple LSTM model. +我们已经解释了什么是动态量化,它带来了什么好处,您已经使用 ``torch.quantization.quantize_dynamic()`` 函数快速量化了一个简单的 LSTM 模型。 -This was a fast and high level treatment of this material; for more -detail please continue learning with `(beta) Dynamic Quantization on an LSTM Word Language Model Tutorial `_. +这是对该材料的快速和高级处理;要了解更多详细信息,请继续学习 `(beta) 动态量化 LSTM 词语言模型教程 `_。 -Additional Resources +其他资源 -------------------- -* `Quantization API Documentaion `_ -* `(beta) Dynamic Quantization on BERT `_ -* `(beta) Dynamic Quantization on an LSTM Word Language Model `_ -* `Introduction to Quantization on PyTorch `_ +* `量化 API 文档 `_ +* `(beta) 动态量化 BERT `_ +* `(beta) 动态量化 LSTM 词语言模型 `_ +* `PyTorch 量化介绍 `_ diff --git a/docs/_sources/recipes/recipes/index.rst.txt b/docs/_sources/recipes/recipes/index.rst.txt index 3102833..3b93220 100644 --- a/docs/_sources/recipes/recipes/index.rst.txt +++ b/docs/_sources/recipes/recipes/index.rst.txt @@ -91,18 +91,18 @@ PyTorch Recipes .. raw:: html -
+
.. only:: html .. image:: /recipes/recipes/images/thumb/sphx_glr_tensorboard_with_pytorch_thumb.png - :alt: How to use TensorBoard with PyTorch + :alt: 如何在PyTorch中使用TensorBoard :ref:`sphx_glr_recipes_recipes_tensorboard_with_pytorch.py` .. raw:: html -
How to use TensorBoard with PyTorch
+
如何在PyTorch中使用TensorBoard
@@ -159,52 +159,52 @@ PyTorch Recipes .. raw:: html -
+
.. only:: html - .. image:: /recipes/recipes/images/thumb/sphx_glr_warmstarting_model_using_parameters_from_a_different_model_thumb.png - :alt: PyTorch 使用不同模型的参数对模型进行热启动 + .. image:: /recipes/recipes/images/thumb/sphx_glr_reasoning_about_shapes_thumb.png + :alt: 在PyTorch中推理形状 - :ref:`sphx_glr_recipes_recipes_warmstarting_model_using_parameters_from_a_different_model.py` + :ref:`sphx_glr_recipes_recipes_reasoning_about_shapes.py` .. raw:: html -
PyTorch 使用不同模型的参数对模型进行热启动
+
在PyTorch中推理形状
.. raw:: html -
+
.. only:: html - .. image:: /recipes/recipes/images/thumb/sphx_glr_saving_and_loading_a_general_checkpoint_thumb.png - :alt: PyTorch 保存和加载通用检查点 + .. image:: /recipes/recipes/images/thumb/sphx_glr_warmstarting_model_using_parameters_from_a_different_model_thumb.png + :alt: PyTorch 使用不同模型的参数对模型进行热启动 - :ref:`sphx_glr_recipes_recipes_saving_and_loading_a_general_checkpoint.py` + :ref:`sphx_glr_recipes_recipes_warmstarting_model_using_parameters_from_a_different_model.py` .. raw:: html -
PyTorch 保存和加载通用检查点
+
PyTorch 使用不同模型的参数对模型进行热启动
.. raw:: html -
+
.. only:: html - .. image:: /recipes/recipes/images/thumb/sphx_glr_reasoning_about_shapes_thumb.png - :alt: Reasoning about Shapes in PyTorch + .. image:: /recipes/recipes/images/thumb/sphx_glr_saving_and_loading_a_general_checkpoint_thumb.png + :alt: PyTorch 保存和加载通用检查点 - :ref:`sphx_glr_recipes_recipes_reasoning_about_shapes.py` + :ref:`sphx_glr_recipes_recipes_saving_and_loading_a_general_checkpoint.py` .. raw:: html -
Reasoning about Shapes in PyTorch
+
PyTorch 保存和加载通用检查点
@@ -278,103 +278,103 @@ PyTorch Recipes .. raw:: html -
+
.. only:: html - .. image:: /recipes/recipes/images/thumb/sphx_glr_module_load_state_dict_tips_thumb.png - :alt: Tips for Loading an ``nn.Module`` from a Checkpoint + .. image:: /recipes/recipes/images/thumb/sphx_glr_timer_quick_start_thumb.png + :alt: Timer快速入门 - :ref:`sphx_glr_recipes_recipes_module_load_state_dict_tips.py` + :ref:`sphx_glr_recipes_recipes_timer_quick_start.py` .. raw:: html -
Tips for Loading an ``nn.Module`` from a Checkpoint
+
Timer快速入门
.. raw:: html -
+
.. only:: html - .. image:: /recipes/recipes/images/thumb/sphx_glr_timer_quick_start_thumb.png - :alt: Timer快速入门 + .. image:: /recipes/recipes/images/thumb/sphx_glr_Captum_Recipe_thumb.png + :alt: 使用 Captum 进行模型可解释性 - :ref:`sphx_glr_recipes_recipes_timer_quick_start.py` + :ref:`sphx_glr_recipes_recipes_Captum_Recipe.py` .. raw:: html -
Timer快速入门
+
使用 Captum 进行模型可解释性
.. raw:: html -
+
.. only:: html - .. image:: /recipes/recipes/images/thumb/sphx_glr_zeroing_out_gradients_thumb.png - :alt: PyTorch 中清零梯度 + .. image:: /recipes/recipes/images/thumb/sphx_glr_dynamic_quantization_thumb.png + :alt: 动态量化 - :ref:`sphx_glr_recipes_recipes_zeroing_out_gradients.py` + :ref:`sphx_glr_recipes_recipes_dynamic_quantization.py` .. raw:: html -
PyTorch 中清零梯度
+
动态量化
.. raw:: html -
+
.. only:: html - .. image:: /recipes/recipes/images/thumb/sphx_glr_dynamic_quantization_thumb.png - :alt: Dynamic Quantization + .. image:: /recipes/recipes/images/thumb/sphx_glr_zeroing_out_gradients_thumb.png + :alt: PyTorch 中清零梯度 - :ref:`sphx_glr_recipes_recipes_dynamic_quantization.py` + :ref:`sphx_glr_recipes_recipes_zeroing_out_gradients.py` .. raw:: html -
Dynamic Quantization
+
PyTorch 中清零梯度
.. raw:: html -
+
.. only:: html - .. image:: /recipes/recipes/images/thumb/sphx_glr_Captum_Recipe_thumb.png - :alt: Model Interpretability using Captum + .. image:: /recipes/recipes/images/thumb/sphx_glr_module_load_state_dict_tips_thumb.png + :alt: 从检查点加载 ``nn.Module`` 的技巧 - :ref:`sphx_glr_recipes_recipes_Captum_Recipe.py` + :ref:`sphx_glr_recipes_recipes_module_load_state_dict_tips.py` .. raw:: html -
Model Interpretability using Captum
+
从检查点加载 ``nn.Module`` 的技巧
.. raw:: html -
+
.. only:: html .. image:: /recipes/recipes/images/thumb/sphx_glr_swap_tensors_thumb.png - :alt: Extension points in ``nn.Module`` for ``load_state_dict`` and tensor subclasses + :alt: 在 ``nn.Module`` 中为 ``load_state_dict`` 和张量子类提供扩展点 :ref:`sphx_glr_recipes_recipes_swap_tensors.py` .. raw:: html -
Extension points in ``nn.Module`` for ``load_state_dict`` and tensor subclasses
+
在 ``nn.Module`` 中为 ``load_state_dict`` 和张量子类提供扩展点
@@ -414,7 +414,7 @@ PyTorch Recipes .. raw:: html -
+
.. only:: html @@ -442,18 +442,18 @@ PyTorch Recipes /recipes/recipes/saving_and_loading_models_for_inference /recipes/recipes/what_is_state_dict /recipes/recipes/loading_data_recipe + /recipes/recipes/reasoning_about_shapes /recipes/recipes/warmstarting_model_using_parameters_from_a_different_model /recipes/recipes/saving_and_loading_a_general_checkpoint - /recipes/recipes/reasoning_about_shapes /recipes/recipes/save_load_across_devices /recipes/recipes/defining_a_neural_network /recipes/recipes/saving_multiple_models_in_one_file /recipes/recipes/tuning_guide - /recipes/recipes/module_load_state_dict_tips /recipes/recipes/timer_quick_start - /recipes/recipes/zeroing_out_gradients - /recipes/recipes/dynamic_quantization /recipes/recipes/Captum_Recipe + /recipes/recipes/dynamic_quantization + /recipes/recipes/zeroing_out_gradients + /recipes/recipes/module_load_state_dict_tips /recipes/recipes/swap_tensors /recipes/recipes/profiler_recipe /recipes/recipes/amp_recipe diff --git a/docs/_sources/recipes/recipes/module_load_state_dict_tips.rst.txt b/docs/_sources/recipes/recipes/module_load_state_dict_tips.rst.txt index ff5df56..4a0970f 100644 --- a/docs/_sources/recipes/recipes/module_load_state_dict_tips.rst.txt +++ b/docs/_sources/recipes/recipes/module_load_state_dict_tips.rst.txt @@ -18,31 +18,38 @@ .. _sphx_glr_recipes_recipes_module_load_state_dict_tips.py: -Tips for Loading an ``nn.Module`` from a Checkpoint +从检查点加载 ``nn.Module`` 的技巧 =================================================== -**Author:** `Mikayla Gawarecki `_ +**作者:** `Mikayla Gawarecki `_ -If you're loading a checkpoint and want to reduce compute and memory as much as possible, -this tutorial shares some recommended practices. In particular, we will discuss +如果你要加载一个检查点并希望尽可能减少计算和内存的使用,本教程将分享一些推荐的做法。特别是我们将讨论以下几点: -1. The ``mmap`` keyword argument on ``torch.load`` -2. The ``torch.device()`` context manager -3. The ``assign`` keyword argument on ``nn.Module.load_state_dict()`` +1. ``torch.load`` 中的 ``mmap`` 关键字参数 +2. ``torch.device()`` 上下文管理器 +3. ``nn.Module.load_state_dict()`` 中的 ``assign`` 关键字参数 .. note:: - This recipe requires PyTorch 2.1.0 or later. + 本教程需要 PyTorch 2.1.0 或更高版本。 -.. GENERATED FROM PYTHON SOURCE LINES 20-21 +.. GENERATED FROM PYTHON SOURCE LINES 15-18 + +.. code-block:: default + + + import time -Let us consider a simple ``nn.Module`` that contains a list of Linear layers: -.. GENERATED FROM PYTHON SOURCE LINES 21-37 +.. GENERATED FROM PYTHON SOURCE LINES 19-20 + +让我们考虑一个简单的 ``nn.Module``,它包含一个线性层列表: + +.. GENERATED FROM PYTHON SOURCE LINES 20-36 .. code-block:: default import torch from torch import nn - import time + class SomeModule(torch.nn.Module): def __init__(self, size): @@ -54,164 +61,152 @@ Let us consider a simple ``nn.Module`` that contains a list of Linear layers: m = SomeModule(1000) - torch.save(m.state_dict(), 'checkpoint.pth') + torch.save(m.state_dict(), "checkpoint.pth") -.. GENERATED FROM PYTHON SOURCE LINES 38-41 +.. GENERATED FROM PYTHON SOURCE LINES 37-38 -The following snippet demonstrates the use of the the ``mmap`` keyword argument -to ``torch.load``, the ``torch.device()`` context manager and the ``assign`` -keyword argument to ``nn.Module.load_state_dict()``. +以下代码片段演示了如何使用 ``torch.load`` 中的 ``mmap`` 关键字参数、``torch.device()`` 上下文管理器和 ``nn.Module.load_state_dict()`` 中的 ``assign`` 关键字参数。 -.. GENERATED FROM PYTHON SOURCE LINES 41-47 +.. GENERATED FROM PYTHON SOURCE LINES 38-44 .. code-block:: default - state_dict = torch.load('checkpoint.pth', mmap=True) - with torch.device('meta'): - meta_m = SomeModule(1000) + state_dict = torch.load("checkpoint.pth", mmap=True) + with torch.device("meta"): + meta_m = SomeModule(1000) meta_m.load_state_dict(state_dict, assign=True) -.. GENERATED FROM PYTHON SOURCE LINES 48-49 +.. GENERATED FROM PYTHON SOURCE LINES 45-46 -Compare the snippet below to the one above: +将下面的代码片段与上面的进行比较: -.. GENERATED FROM PYTHON SOURCE LINES 49-54 +.. GENERATED FROM PYTHON SOURCE LINES 46-51 .. code-block:: default - state_dict = torch.load('checkpoint.pth') + state_dict = torch.load("checkpoint.pth") m = SomeModule(1000) m.load_state_dict(state_dict) -.. GENERATED FROM PYTHON SOURCE LINES 55-58 +.. GENERATED FROM PYTHON SOURCE LINES 52-53 -The second example does not use any of the features listed above and will be -less compute and memory efficient for loading a checkpoint. In the following -sections, we will discuss each of the features in further detail. +第二个示例没有使用上面列出的任何特性,因此在加载检查点时计算和内存效率会较低。在下面的部分中,我们将详细讨论每个特性。 -.. GENERATED FROM PYTHON SOURCE LINES 60-72 +.. GENERATED FROM PYTHON SOURCE LINES 55-61 -Using ``torch.load(mmap=True)`` +使用 ``torch.load(mmap=True)`` ------------------------------- -First, let us consider what happens when we load the checkpoint with ``torch.load``. -When we save a checkpoint with ``torch.save``, tensor storages are tagged with the device they are -saved on. With ``torch.load``, tensor storages will be loaded to the device -they were tagged with (unless this behavior is overridden using the -``map_location`` flag). For ease of explanation, let us assume that the tensors -were saved on CPU. This means that on the first line all tensor storages will be -loaded into CPU RAM, which can be undesirable when: +首先,让我们考虑使用 ``torch.load`` 加载检查点时会发生什么。 +当我们使用 ``torch.save`` 保存检查点时,张量存储会被标记为保存时所在的设备。 +使用 ``torch.load`` 时,张量存储将被加载到它们被标记的设备上(除非使用 ``map_location`` 标志覆盖此行为)。 +为了解释方便,我们假设张量是保存在 CPU 上的。这意味着在第一行中,所有张量存储将被加载到 CPU 内存中,在以下情况下这是不可取的: -* CPU RAM is smaller than the size of the checkpoint. -* Waiting for the entire checkpoint to be loaded into RAM before performing, for example, some per-tensor processing. - -.. GENERATED FROM PYTHON SOURCE LINES 72-78 +.. GENERATED FROM PYTHON SOURCE LINES 61-70 .. code-block:: default + # * CPU 内存小于检查点的大小。 + # * 在执行一些每张量处理之前等待整个检查点被加载到内存中。 + start_time = time.time() - state_dict = torch.load('checkpoint.pth') + state_dict = torch.load("checkpoint.pth") end_time = time.time() - print(f"loading time without mmap={end_time - start_time}") + print(f"不使用 mmap 的加载时间={end_time - start_time}") -.. GENERATED FROM PYTHON SOURCE LINES 79-85 +.. GENERATED FROM PYTHON SOURCE LINES 71-75 -The ``mmap`` keyword argument to ``torch.load`` attempts to solve the above two -problems. As its name implies, the ``mmap`` keyword argument to ``torch.load`` -makes use of an `mmap call `_ -which maps a file on disk into virtual memory and lets the OS handle loading and -unloading into physical memory automatically. When this flag is passed, tensor -storages will be memory-mapped. +``torch.load`` 中的 ``mmap`` 关键字参数试图解决上述两个问题。 +顾名思义,``torch.load`` 中的 ``mmap`` 关键字参数使用了 `mmap 调用 `_, +它将磁盘上的文件映射到虚拟内存中,并让操作系统自动处理加载和卸载到物理内存。 +当传递此标志时,张量存储将被内存映射。 -.. GENERATED FROM PYTHON SOURCE LINES 85-91 +.. GENERATED FROM PYTHON SOURCE LINES 75-82 .. code-block:: default start_time = time.time() - state_dict = torch.load('checkpoint.pth', mmap=True) + state_dict = torch.load("checkpoint.pth", mmap=True) end_time = time.time() - print(f"loading time with mmap={end_time - start_time}") + print(f"使用 mmap 的加载时间={end_time - start_time}") + -.. GENERATED FROM PYTHON SOURCE LINES 92-94 +.. GENERATED FROM PYTHON SOURCE LINES 83-84 -As mentioned above, one can use this argument to do per-tensor processing on a -checkpoint without loading all tensor storages into CPU memory upfront. For example: +如上所述,可以使用此参数在不将所有张量存储加载到 CPU 内存中的情况下对检查点执行每张量处理。例如: -.. GENERATED FROM PYTHON SOURCE LINES 94-108 +.. GENERATED FROM PYTHON SOURCE LINES 84-100 .. code-block:: default def my_special_routine(t, device): - # this could be a much fancier operation + # 这可能是一个更复杂的操作 return t.to(dtype=torch.bfloat16, device=device) + def my_processing_function(key, device): t = state_dict[key] processed_t = my_special_routine(t, device) del t state_dict[key] = processed_t + for key in state_dict.keys(): - device = torch.device('cuda') + device = torch.device("cuda") my_processing_function(key, device) -.. GENERATED FROM PYTHON SOURCE LINES 109-112 +.. GENERATED FROM PYTHON SOURCE LINES 101-104 -Using ``torch.device('meta')`` +使用 ``torch.device('meta')`` ------------------------------ -Next, let's consider the creation of the module. +接下来,让我们考虑模块的创建。 -.. GENERATED FROM PYTHON SOURCE LINES 112-114 +.. GENERATED FROM PYTHON SOURCE LINES 104-106 .. code-block:: default m = SomeModule(1000) -.. GENERATED FROM PYTHON SOURCE LINES 115-132 +.. GENERATED FROM PYTHON SOURCE LINES 107-109 -This allocates memory for all parameters/buffers and initializes them per -the default initialization schemes defined in ``SomeModule.__init__()``, which -is wasteful when we want to load a checkpoint for the following reasons: +这将为所有参数/缓冲区分配内存并根据 ``SomeModule.__init__()`` 中定义的默认初始化方案对其进行初始化, +当我们想要加载检查点时,这是浪费的,原因如下: -* The result of the initialization kernels will be overwritten by ``load_state_dict()`` without ever being used, so - initialization is wasteful. -* We are allocating memory for these parameters/buffers in RAM while ``torch.load`` of the saved state dictionary also - allocates memory in RAM for the parameters/buffers in the checkpoint. +.. GENERATED FROM PYTHON SOURCE LINES 109-122 -In order to solve these two problems, we can use the ``torch.device()`` -context manager with ``device='meta'`` when we instantiate the ``nn.Module()``. +.. code-block:: default -The `torch.device() `_ -context manager makes sure that factory calls will be performed as if they -were passed the specified ``device`` as an argument. Tensors on ``torch.device('meta')`` do not -carry data. However, they possess all other metadata a tensor carries such as ``.size()``, ``.stride()``, -``.requires_grad``, and others. -.. GENERATED FROM PYTHON SOURCE LINES 132-135 + # * 初始化内核的结果将被 ``load_state_dict()`` 覆盖而从未被使用,因此初始化是浪费的。 + # * 我们在 RAM 中为这些参数/缓冲区分配了内存,而 ``torch.load`` 保存的状态字典也在 RAM 中为检查点中的参数/缓冲区分配了内存。 -.. code-block:: default + # 为了解决这两个问题,我们可以在实例化 ``nn.Module()`` 时使用 ``device='meta'`` 的 ``torch.device()`` 上下文管理器。 - with torch.device('meta'): - new_m = SomeModule(1000) + # `torch.device() `_ + # 上下文管理器确保工厂调用将被视为传递了指定的 ``device`` 作为参数。 + # 在 ``torch.device('meta')`` 上的张量不携带数据。 + # 但是,它们具有张量所携带的其他元数据,如 ``.size()``, ``.stride()``, ``.requires_grad`` 等。 + with torch.device("meta"): + new_m = SomeModule(1000) -.. GENERATED FROM PYTHON SOURCE LINES 136-139 +.. GENERATED FROM PYTHON SOURCE LINES 123-126 -Using ``load_state_dict(assign=True)`` +使用 ``load_state_dict(assign=True)`` -------------------------------------- -Next, we consider the loading of the state dictionary. +接下来,我们考虑加载状态字典。 -.. GENERATED FROM PYTHON SOURCE LINES 139-142 +.. GENERATED FROM PYTHON SOURCE LINES 126-129 .. code-block:: default @@ -219,45 +214,38 @@ Next, we consider the loading of the state dictionary. m.load_state_dict(state_dict) -.. GENERATED FROM PYTHON SOURCE LINES 143-155 +.. GENERATED FROM PYTHON SOURCE LINES 130-132 -``nn.Module.load_state_dict()`` is usually implemented via an in-place -``param_in_model.copy_(param_in_state_dict)``. This means that the parameter/buffer -with the corresponding key in the state dictionary is copied into the -parameter/buffer in the ``nn.Module``. +``nn.Module.load_state_dict()`` 通常是通过 ``param_in_model.copy_(param_in_state_dict)`` 的就地复制实现的。 +这意味着状态字典中对应键的参数/缓冲区将被复制到 ``nn.Module`` 中的参数/缓冲区。 -However, an in-place copy into a tensor on the ``meta`` device is a no-op. -In order to avoid this, we can pass the ``assign=True`` keyword argument to -``load_state_dict()``. +.. GENERATED FROM PYTHON SOURCE LINES 132-147 -A caveat here is that since optimizers hold a reference to -``nn.Module.parameters()``, the optimizer must be initialized after the module -is loaded from state dict if ``assign=True`` is passed. +.. code-block:: default -.. GENERATED FROM PYTHON SOURCE LINES 155-165 -.. code-block:: default + # 然而,对 ``meta`` 设备上的张量进行就地复制是无操作的。 + # 为了避免这种情况,我们可以在 ``load_state_dict()`` 中传递 ``assign=True`` 关键字参数。 + # 这里的一个警告是,由于优化器持有对 ``nn.Module.parameters()`` 的引用, + # 如果传递了 ``assign=True``,则必须在从状态字典加载模块后初始化优化器。 - # As of PyTorch 2.3.0, one can use ``torch.__future__.set_swap_module_params_on_conversion`` to - # avoid this caveat. This `recipe `_ - # provides more details. + # 从 PyTorch 2.3.0 开始,可以使用 ``torch.__future__.set_swap_module_params_on_conversion`` 来避免这个警告。 + # 这个 `教程 `_ 提供了更多细节。 new_m.load_state_dict(state_dict, assign=True) - # Before 2.3.0, this MUST be done AFTER the load_state_dict with assign. - # In versions >= 2.3.0, one can consider setting ``torch.__future__.set_swap_module_params_on_conversion`` + # 在 2.3.0 之前,这一步必须在 load_state_dict 使用 assign 之后完成。 + # 在版本 >= 2.3.0 中,可以考虑设置 ``torch.__future__.set_swap_module_params_on_conversion`` opt = torch.optim.SGD(new_m.parameters(), lr=1e-3) -.. GENERATED FROM PYTHON SOURCE LINES 166-173 +.. GENERATED FROM PYTHON SOURCE LINES 148-153 -Conclusion +结论 ------------- -To recap, in this tutorial we learned about ``torch.load(mmap=True)``, the -``torch.device()`` context manager with ``device=meta``, and -``nn.Module.load_state_dict(assign=True)`` as well as how these tools could -be used to aid when loading a model from a checkpoint. +总结一下,在本教程中,我们学习了 ``torch.load(mmap=True)``、``device='meta'`` 的 ``torch.device()`` 上下文管理器和 ``nn.Module.load_state_dict(assign=True)`` +以及如何在从检查点加载模型时使用这些工具来提高效率。 .. rst-class:: sphx-glr-timing diff --git a/docs/_sources/recipes/recipes/reasoning_about_shapes.rst.txt b/docs/_sources/recipes/recipes/reasoning_about_shapes.rst.txt index e5b67d8..768c242 100644 --- a/docs/_sources/recipes/recipes/reasoning_about_shapes.rst.txt +++ b/docs/_sources/recipes/recipes/reasoning_about_shapes.rst.txt @@ -18,29 +18,25 @@ .. _sphx_glr_recipes_recipes_reasoning_about_shapes.py: -Reasoning about Shapes in PyTorch +在PyTorch中推理形状 ================================= -When writing models with PyTorch, it is commonly the case that the parameters -to a given layer depend on the shape of the output of the previous layer. For -example, the ``in_features`` of an ``nn.Linear`` layer must match the -``size(-1)`` of the input. For some layers, the shape computation involves -complex equations, for example convolution operations. +在使用PyTorch编写模型时,通常会遇到某一层的参数取决于前一层输出的形状的情况。例如, +``nn.Linear``层的``in_features``必须与输入的``size(-1)``相匹配。对于某些层,形状计算涉及复杂的等式,例如卷积运算。 -One way around this is to run the forward pass with random inputs, but this is -wasteful in terms of memory and compute. +一种解决方法是使用随机输入进行前向传播,但这在内存和计算方面是浪费的。 -Instead, we can make use of the ``meta`` device to determine the output shapes -of a layer without materializing any data. +相反,我们可以使用``meta``设备来确定层的输出形状,而无需实际化任何数据。 -.. GENERATED FROM PYTHON SOURCE LINES 17-31 +.. GENERATED FROM PYTHON SOURCE LINES 12-27 .. code-block:: default - import torch import timeit + import torch + t = torch.rand(2, 3, 10, 10, device="meta") conv = torch.nn.Conv2d(3, 5, 2, device="meta") start = timeit.default_timer() @@ -48,16 +44,15 @@ of a layer without materializing any data. end = timeit.default_timer() print(out) - print(f"Time taken: {end-start}") + print(f"所需时间: {end-start}") -.. GENERATED FROM PYTHON SOURCE LINES 32-34 +.. GENERATED FROM PYTHON SOURCE LINES 28-29 -Observe that since data is not materialized, passing arbitrarily large -inputs will not significantly alter the time taken for shape computation. +观察到,由于没有实际化数据,即使传入任意大的输入,用于形状计算的时间也不会显著改变。 -.. GENERATED FROM PYTHON SOURCE LINES 34-44 +.. GENERATED FROM PYTHON SOURCE LINES 29-39 .. code-block:: default @@ -68,15 +63,15 @@ inputs will not significantly alter the time taken for shape computation. end = timeit.default_timer() print(out) - print(f"Time taken: {end-start}") + print(f"所需时间: {end-start}") -.. GENERATED FROM PYTHON SOURCE LINES 45-46 +.. GENERATED FROM PYTHON SOURCE LINES 40-41 -Consider an arbitrary network such as the following: +考虑以下任意网络: -.. GENERATED FROM PYTHON SOURCE LINES 46-71 +.. GENERATED FROM PYTHON SOURCE LINES 41-66 .. code-block:: default @@ -98,7 +93,7 @@ Consider an arbitrary network such as the following: def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) - x = torch.flatten(x, 1) # flatten all dimensions except batch + x = torch.flatten(x, 1) # 展平除批次维度外的所有维度 x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) @@ -106,22 +101,21 @@ Consider an arbitrary network such as the following: -.. GENERATED FROM PYTHON SOURCE LINES 72-74 +.. GENERATED FROM PYTHON SOURCE LINES 67-68 -We can view the intermediate shapes within an entire network by registering a -forward hook to each layer that prints the shape of the output. +我们可以通过为每一层注册一个前向钩子来打印输出的形状,从而查看整个网络中间层的形状。 -.. GENERATED FROM PYTHON SOURCE LINES 74-89 +.. GENERATED FROM PYTHON SOURCE LINES 68-83 .. code-block:: default + def fw_hook(module, input, output): - print(f"Shape of output to {module} is {output.shape}.") + print(f"{module}的输出形状为{output.shape}。") - # Any tensor created within this torch.device context manager will be - # on the meta device. + # 在此torch.device上下文管理器中创建的任何张量都将在meta设备上。 with torch.device("meta"): net = Net() inp = torch.randn((1024, 3, 32, 32)) diff --git a/docs/_sources/recipes/recipes/swap_tensors.rst.txt b/docs/_sources/recipes/recipes/swap_tensors.rst.txt index 64f007f..2d1f324 100644 --- a/docs/_sources/recipes/recipes/swap_tensors.rst.txt +++ b/docs/_sources/recipes/recipes/swap_tensors.rst.txt @@ -18,102 +18,98 @@ .. _sphx_glr_recipes_recipes_swap_tensors.py: -Extension points in ``nn.Module`` for ``load_state_dict`` and tensor subclasses +在 ``nn.Module`` 中为 ``load_state_dict`` 和张量子类提供扩展点 =============================================================================== -**Author:** `Mikayla Gawarecki `_ +**作者:** `Mikayla Gawarecki `_ -This recipe introduces a new utility function ``torch.utils.swap_tensors`` -as well as two new extension points where it has been integrated in -``nn.Module``: +本教程介绍了一个新的实用函数 ``torch.utils.swap_tensors``, +以及在 ``nn.Module`` 中集成它的两个新扩展点: -* ``nn.Module.to()`` and related methods +* ``nn.Module.to()`` 和相关方法 * ``nn.Module.load_state_dict()`` .. note:: - This recipe requires PyTorch 2.3.0 or later. + 本教程需要 PyTorch 2.3.0 或更高版本。 -.. GENERATED FROM PYTHON SOURCE LINES 18-22 +.. GENERATED FROM PYTHON SOURCE LINES 17-21 ``torch.utils.swap_tensors`` ---------------------------- -``torch.utils.swap_tensors`` (hereafter referred to as ``swap_tensors``) is a -utility function that takes in two Python tensors and swaps them. +``torch.utils.swap_tensors``(以下简称为 ``swap_tensors``) 是一个 +实用函数,它接受两个 Python 张量并交换它们。 -.. GENERATED FROM PYTHON SOURCE LINES 22-31 +.. GENERATED FROM PYTHON SOURCE LINES 21-31 .. code-block:: default import torch import torch.nn as nn + t1 = torch.arange(2) t2 = torch.arange(3) - print(f"Before swapping, t1: {t1}, t2: {t2}") + print(f"交换前, t1: {t1}, t2: {t2}") torch.utils.swap_tensors(t1, t2) - print(f"After swapping, t1: {t1}, t2: {t2}") + print(f"交换后, t1: {t1}, t2: {t2}") -.. GENERATED FROM PYTHON SOURCE LINES 32-45 +.. GENERATED FROM PYTHON SOURCE LINES 32-43 -More specifically, ``swap_tensors`` swaps the Python ``__class__``, ``__dict__`` -and ``__slots__`` of the two tensors, as well as their associated ``at::Tensor``. +更具体地说,``swap_tensors`` 交换了两个张量的 Python ``__class__``、``__dict__`` +和 ``__slots__``,以及它们相关的 ``at::Tensor``。 -Application to ``nn.Module`` +应用于 ``nn.Module`` ---------------------------- -This utility is pertinent to ``nn.Module`` when a Python object outside -of the module holds a reference to parameters of the module. If an ``nn.Module`` -modifies any of its parameters out of place, the object holding references to -the parameters will not see the change. A classic example of this is the -optimizer, which holds a reference to the parameters of the ``nn.Module``. -This leads to a silent correctness issue where the ``optimizer.step()`` will -run without error but the weights of the ``nn.Module`` will not be updated. +当 ``nn.Module`` 之外的 Python 对象持有该模块参数的引用时,此实用函数就很有用。 +如果 ``nn.Module`` 就地修改了任何参数,持有这些参数引用的对象将无法看到更改。 +一个典型的例子是优化器,它持有 ``nn.Module`` 参数的引用。 +这会导致一个潜在的正确性问题,即 ``optimizer.step()`` 会无错误运行, +但 ``nn.Module`` 的权重不会被更新。 -.. GENERATED FROM PYTHON SOURCE LINES 45-54 +.. GENERATED FROM PYTHON SOURCE LINES 43-52 .. code-block:: default mod = torch.nn.Linear(1, 2, bias=False) optimizer = torch.optim.SGD(mod.parameters()) - print(f"weight in mod: {mod.weight}") - print(f"weight in optimizer: {optimizer.param_groups[0]['params']}") + print(f"mod 中的权重: {mod.weight}") + print(f"优化器中的权重: {optimizer.param_groups[0]['params']}") mod.weight = torch.nn.Parameter(2 * mod.weight) - print(f"weight in mod: {mod.weight}") - print(f"weight in optimizer: {optimizer.param_groups[0]['params']}") + print(f"mod 中的权重: {mod.weight}") + print(f"优化器中的权重: {optimizer.param_groups[0]['params']}") -.. GENERATED FROM PYTHON SOURCE LINES 55-77 +.. GENERATED FROM PYTHON SOURCE LINES 53-71 -``nn.Module.to()`` and related methods +``nn.Module.to()`` 和相关方法 -------------------------------------- -This includes methods that change the device of the module (such as ``nn.Module.cpu()``), -methods that change the ``dtype`` of the module (such as ``nn.Module.float()``) -as well as methods that allow the module to be materialized -(such as ``nn.Module.to_empty()``). +这包括改变模块设备的方法(如 ``nn.Module.cpu()``)、 +改变模块 ``dtype`` 的方法(如 ``nn.Module.float()``)、 +以及允许模块实例化的方法(如 ``nn.Module.to_empty()``)。 -At first glance, it might be non-intuitive that these methods are able to -modify the parameters of the module in-place. The existing approach has been -to use a nasty hack dating back from the first days of PyTorch. +乍一看,这些方法能够就地修改模块的参数可能看起来不太直观。 +现有的方法是使用一种追溯到 PyTorch 最初几天的丑陋黑客手段。 -Notably, the existing approach does not work in these cases: +值得注意的是,现有方法在以下情况下无法工作: -* when using ``__torch_dispatch__`` subclasses -* when ``param`` and ``new_param`` do not have the same Python ``type()`` -* For tensors with special C++ representations (such as sparse tensors and ``XLA`` tensors) +* 使用 ``__torch_dispatch__`` 子类 +* ``param`` 和 ``new_param`` 的 Python ``type()`` 不同 +* 对于具有特殊 C++ 表示的张量(如稀疏张量和 ``XLA`` 张量) -In the following part of this recipe, we will define a toy ``__torch_dispatch__`` -subclass ``MyQuantizedLinearWeight`` that represents quantized linear weights. -This subclass will be used for illustration purposes throughout the rest of -the tutorial. For brevity, we omit most of the ``__torch_dispatch__`` -implementation. +在本教程的下一部分,我们将定义一个玩具 ``__torch_dispatch__`` 子类 ``MyQuantizedLinearWeight`` +来表示量化的线性权重。在本教程的剩余部分,我们将使用这个子类进行说明。 +为简洁起见,我们省略了大部分 ``__torch_dispatch__`` 实现。 -.. GENERATED FROM PYTHON SOURCE LINES 77-110 +.. GENERATED FROM PYTHON SOURCE LINES 71-108 .. code-block:: default + aten = torch.ops.aten + class MyQuantizedLinearWeight(torch.Tensor): @staticmethod def __new__(cls, elem, scale): @@ -124,7 +120,8 @@ implementation. layout=elem.layout, device=elem.device, strides=elem.stride(), - storage_offset=elem.storage_offset()) + storage_offset=elem.storage_offset(), + ) def __init__(self, elem: torch.Tensor, scale: float): self.elem = elem @@ -138,46 +135,43 @@ implementation. if func in (aten.detach.default, aten._to_copy.default): new_elem = func(args[0].elem, *args[1:], **kwargs) return cls(new_elem, args[0].scale) - # Implementations for certain ops would be added to ``OP_TABLE``. - # We omit this for brevity. + # 某些操作的实现将添加到 ``OP_TABLE``。 + # 为简洁起见,我们在此省略。 OP_TABLE = dict() if func in OP_TABLE: - return OP_TABLE[func](func, args, kwargs) - raise NotImplementedError(f"Unsupported function {func}") + return OP_TABLE[func](func, args, kwargs) + raise NotImplementedError(f"不支持的函数 {func}") + -.. GENERATED FROM PYTHON SOURCE LINES 111-115 +.. GENERATED FROM PYTHON SOURCE LINES 109-112 -Let us create an ``nn.Linear`` layer of ``dtype`` ``torch.float32`` where the weight is -a ``MyQuantizedLinearWeight`` and try to convert it to ``torch.bfloat16``. -Observe that the weight's ``dtype`` changes as expected. However, the ``dtype`` -of the subclass' payload (``elem``) does not change. +让我们创建一个 ``dtype`` 为 ``torch.float32`` 的 ``nn.Linear`` 层, +其权重是 ``MyQuantizedLinearWeight``。然后尝试将其转换为 ``torch.bfloat16``。 +观察到权重的 ``dtype`` 如预期般改变了。但是子类的有效载荷(``elem``)的 ``dtype`` 没有改变。 -.. GENERATED FROM PYTHON SOURCE LINES 115-125 +.. GENERATED FROM PYTHON SOURCE LINES 112-122 .. code-block:: default m = nn.Linear(3, 5, dtype=torch.float32) m.weight = torch.nn.Parameter(MyQuantizedLinearWeight(m.weight, 0.5)) - print(f"Before: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") + print(f"之前: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") m.bfloat16() - print(f"After: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") + print(f"之后: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") print(f"m.weight.dtype: {m.weight.dtype}") print(f"m.weight.elem.dtype: {m.weight.elem.dtype}") print(f"m.bias.dtype: {m.bias.dtype}") -.. GENERATED FROM PYTHON SOURCE LINES 126-132 +.. GENERATED FROM PYTHON SOURCE LINES 123-126 -To this end, we introduce a global config -``torch.__future__.set_swap_module_params_on_conversion`` that will use -``swap_tensors`` to swap the parameters of the module while preserving -references in place of ``.data`` setting. When this config is set, -``swap_tensors`` will be used during the conversion, which ensures that -the ``dtype`` of the payload is properly converted. +为此,我们引入了一个全局配置 ``torch.__future__.set_swap_module_params_on_conversion`` +它将使用 ``swap_tensors`` 交换模块的参数,同时保留 ``.data`` 设置中的引用。 +设置此配置后,在转换期间将使用 ``swap_tensors``,从而确保有效载荷的 ``dtype`` 正确转换。 -.. GENERATED FROM PYTHON SOURCE LINES 132-144 +.. GENERATED FROM PYTHON SOURCE LINES 126-138 .. code-block:: default @@ -185,61 +179,52 @@ the ``dtype`` of the payload is properly converted. torch.__future__.set_swap_module_params_on_conversion(True) m = nn.Linear(3, 5, dtype=torch.float32) m.weight = torch.nn.Parameter(MyQuantizedLinearWeight(m.weight, 0.5)) - print(f"Before: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") + print(f"之前: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") m.bfloat16() - print(f"After: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") + print(f"之后: id(m.weight)={id(m.weight)}, id(m.bias)={id(m.bias)}") print(f"m.weight.dtype: {m.weight.dtype}") print(f"m.weight.elem.dtype: {m.weight.elem.dtype}") print(f"m.bias.dtype: {m.bias.dtype}") torch.__future__.set_swap_module_params_on_conversion(False) -.. GENERATED FROM PYTHON SOURCE LINES 145-183 +.. GENERATED FROM PYTHON SOURCE LINES 139-167 ``nn.Module.load_state_dict()`` -------------------------------- -Depending on the value of the ``assign`` keyword argument passed -to ``load_state_dict()``, there are two ways to load the ``state_dict``: +根据传递给 ``load_state_dict()`` 的 ``assign`` 关键字参数的值, +有两种方式加载 ``state_dict``: -* ``assign=False``: preserves the properties of ``module.param`` and only takes the values - from ``state_dict['param_name']`` -* ``assign=True``: preserves the properties and values of ``state_dict['param_name']``. +* ``assign=False``: 保留 ``module.param`` 的属性,只从 ``state_dict['param_name']`` 中获取值 +* ``assign=True``: 保留 ``state_dict['param_name']`` 的属性和值。 -Previously, these were implemented with in-place ``copy_`` and ``__setattr__`` respectively. -With the existing implementation, each approach had its own limitations -- ``assign=False`` -imposes the constraint that the type of the parameter in the ``state_dict`` must -be the same as the type of the parameter in the module while ``assign=True`` imposes -the constraint that anything that holds references to the module's parameters must -be initialized after ``nn.Module.load_state_dict()``. +之前,这些分别是通过就地 ``copy_`` 和 ``__setattr__`` 实现的。 +在现有实现中,每种方法都有自己的限制 - ``assign=False`` 要求 ``state_dict`` 中的参数类型 +必须与模块中的参数类型相同,而 ``assign=True`` 要求在 ``nn.Module.load_state_dict()`` 之后 +初始化任何持有模块参数引用的对象。 -Now, we address both constraints by adding a ``swap_tensors`` path to ``load_state_dict()`` -and introducing a new extension point ``torch.Tensor.module_load(self, other, assign=False)``. -When the ``swap_tensors`` path is enabled via the ``__future__`` mentioned above, -we can use a ``__torch_function__`` handler for ``module_load`` to apply a -custom transformation to the value in the ``state_dict``. The result of this -transformation will be swapped with the parameter in the module. +现在,我们通过在 ``load_state_dict()`` 中添加 ``swap_tensors`` 路径并引入新的扩展点 +``torch.Tensor.module_load(self, other, assign=False)`` 来解决这两个限制。 +当启用上述 ``__future__`` 时,我们可以使用 ``module_load`` 的 ``__torch_function__`` 处理程序 +对 ``state_dict`` 中的值应用自定义转换。转换的结果将与模块中的参数交换。 -In the following example, we will use the ``MyQuantizedLinearWeight`` subclass -defined above to illustrate how we can use these features to apply a -custom quantization scheme to the weights of a linear layer when -loading the ``state_dict``. +在下面的示例中,我们将使用上面定义的 ``MyQuantizedLinearWeight`` 子类 +来说明如何使用这些功能在加载 ``state_dict`` 时对线性层的权重应用自定义量化方案。 -Recall that the ``__torch_function__`` handler for ``module_load`` will be -invoked if either ``self`` or ``other`` (in this case ``param`` or -``state_dict[param_key]``) are ``MyQuantizedLinearWeight`` subclasses. +回顾一下,如果 ``self`` 或 ``other``(在本例中是 ``param`` 或 ``state_dict[param_key]``) +是 ``MyQuantizedLinearWeight`` 子类,则会调用 ``module_load`` 的 ``__torch_function__`` 处理程序。 -Assume that we expect the ``state_dict`` to contain plain tensors and the -module to contain ``MyQuantizedLinearWeight`` parameters where we want the -tensors in the ``state_dict`` to be transformed into the subclass. Then we -can define a ``__torch_function__`` handler for ``torch.Tensor.module_load`` -as such: +假设我们期望 ``state_dict`` 包含普通张量,而模块包含 ``MyQuantizedLinearWeight`` 参数, +我们希望将 ``state_dict`` 中的张量转换为子类。那么我们可以为 ``torch.Tensor.module_load`` 定义 +一个 ``__torch_function__`` 处理程序,如下所示: -.. GENERATED FROM PYTHON SOURCE LINES 183-198 +.. GENERATED FROM PYTHON SOURCE LINES 167-184 .. code-block:: default + @classmethod def custom_torch_function(cls, func, types, args=(), kwargs=None): kwargs = {} if kwargs is None else kwargs @@ -250,70 +235,67 @@ as such: return MyQuantizedLinearWeight(src, dest.scale) else: with torch._C.DisableTorchFunctionSubclass(): - return func(*args, **kwargs) + return func(*args, **kwargs) + MyQuantizedLinearWeight.__torch_function__ = custom_torch_function -.. GENERATED FROM PYTHON SOURCE LINES 199-202 +.. GENERATED FROM PYTHON SOURCE LINES 185-187 -First, let us create a skeleton of a model on the meta device to avoid -materializing storages. We convert all weights in the modules to -``MyQuantizedLinearWeight`` subclasses while leaving biases intact. +首先,让我们在 meta 设备上创建一个模型框架,以避免实例化存储。 +我们将模块中的所有权重转换为 ``MyQuantizedLinearWeight`` 子类,同时保留偏置不变。 -.. GENERATED FROM PYTHON SOURCE LINES 202-214 +.. GENERATED FROM PYTHON SOURCE LINES 187-201 .. code-block:: default + def fn(m): if isinstance(m, nn.Linear): requires_grad = m.weight.requires_grad m.weight = torch.nn.Parameter( - MyQuantizedLinearWeight(m.weight, 0.5), requires_grad=requires_grad - ) + MyQuantizedLinearWeight(m.weight, 0.5), requires_grad=requires_grad + ) + with torch.device("meta"): m = nn.Linear(3, 5) m.apply(fn) -.. GENERATED FROM PYTHON SOURCE LINES 215-218 +.. GENERATED FROM PYTHON SOURCE LINES 202-204 -We can then load the ``state_dict``. Observe that we use ``assign=True`` because -for biases, we want to preserve the properties of the tensor in the ``state_dict`` -(for example, we do not want the bias to be on the ``meta`` device after loading). +然后我们可以加载 ``state_dict``。注意我们使用 ``assign=True``,因为对于偏置, +我们希望保留 ``state_dict`` 中张量的属性(例如,我们不希望偏置在加载后位于 ``meta`` 设备上)。 -.. GENERATED FROM PYTHON SOURCE LINES 218-228 +.. GENERATED FROM PYTHON SOURCE LINES 204-214 .. code-block:: default torch.__future__.set_swap_module_params_on_conversion(True) - print(f"Before: id(weight)={id(m.weight)}, id(bias)={id(m.bias)}") - print(f"m.state_dict() before load_state_dict():\n {m.state_dict()}") + print(f"之前: id(weight)={id(m.weight)}, id(bias)={id(m.bias)}") + print(f"load_state_dict() 之前的 m.state_dict():\n {m.state_dict()}") state_dict = nn.Linear(3, 5).state_dict() print(f"state_dict:\n {state_dict}") m.load_state_dict(state_dict, assign=True) - print(f"After: id(weight)={id(m.weight)}, id(bias)={id(m.bias)}") - print(f"m.state_dict() after load_state_dict():\n {m.state_dict()}") + print(f"之后: id(weight)={id(m.weight)}, id(bias)={id(m.bias)}") + print(f"load_state_dict() 之后的 m.state_dict():\n {m.state_dict()}") -.. GENERATED FROM PYTHON SOURCE LINES 229-242 +.. GENERATED FROM PYTHON SOURCE LINES 215-224 -The above is a toy example of how we can use the new extension point in -``nn.Module.load_state_dict()``. One can also imagine alternate scenarios such -as when we have tensor subclasses in the ``state_dict`` and plain ``nn.Parameters``/ -tensors in the module or when both are tensor subclasses. Based on the use -case, we can define the ``__torch_function__`` handler for ``module_load`` -to apply the transforms as needed. +上面是一个如何使用 ``nn.Module.load_state_dict()`` 中的新扩展点的玩具示例。 +我们还可以想象其他场景,例如当 ``state_dict`` 中有张量子类而模块中有普通 ``nn.Parameters``/张量时, +或者两者都是张量子类时。根据使用场景,我们可以定义 ``module_load`` 的 ``__torch_function__`` 处理程序 +来应用所需的转换。 -Conclusion +结论 ---------- -In this recipe, we learned about ``swap_tensors``, the importance -of preserving references for parameters in ``nn.Module`` as well as how to -use the two new extension points that are gated by -``torch.__future__.set_swap_module_params_on_conversion``. +在本教程中,我们学习了 ``swap_tensors``、在 ``nn.Module`` 中保留参数引用的重要性, +以及如何使用由 ``torch.__future__.set_swap_module_params_on_conversion`` 控制的两个新扩展点。 .. rst-class:: sphx-glr-timing diff --git a/docs/_sources/recipes/recipes/tensorboard_with_pytorch.rst.txt b/docs/_sources/recipes/recipes/tensorboard_with_pytorch.rst.txt index 441e887..71fd2a2 100644 --- a/docs/_sources/recipes/recipes/tensorboard_with_pytorch.rst.txt +++ b/docs/_sources/recipes/recipes/tensorboard_with_pytorch.rst.txt @@ -18,71 +18,70 @@ .. _sphx_glr_recipes_recipes_tensorboard_with_pytorch.py: -How to use TensorBoard with PyTorch +如何在PyTorch中使用TensorBoard =================================== -TensorBoard is a visualization toolkit for machine learning experimentation. -TensorBoard allows tracking and visualizing metrics such as loss and accuracy, -visualizing the model graph, viewing histograms, displaying images and much more. -In this tutorial we are going to cover TensorBoard installation, -basic usage with PyTorch, and how to visualize data you logged in TensorBoard UI. +TensorBoard是一个用于机器学习实验的可视化工具包。 +TensorBoard允许跟踪和可视化指标,如损失和准确率, +可视化模型图,查看直方图,显示图像等。 +在本教程中,我们将介绍TensorBoard的安装、 +在PyTorch中的基本用法,以及如何在TensorBoard UI中可视化您记录的数据。 -Installation +安装 ---------------------- -PyTorch should be installed to log models and metrics into TensorBoard log -directory. The following command will install PyTorch 1.4+ via -Anaconda (recommended): +应安装PyTorch以将模型和指标记录到TensorBoard日志 +目录。以下命令将通过Anaconda(推荐)安装PyTorch 1.4+: .. code-block:: sh - $ conda install pytorch torchvision -c pytorch + $ conda install pytorch torchvision -c pytorch -or pip +或者使用pip: .. code-block:: sh $ pip install torch torchvision -.. GENERATED FROM PYTHON SOURCE LINES 30-36 +.. GENERATED FROM PYTHON SOURCE LINES 29-35 -Using TensorBoard in PyTorch +在PyTorch中使用TensorBoard ----------------------------- -Let’s now try using TensorBoard with PyTorch! Before logging anything, -we need to create a ``SummaryWriter`` instance. +现在让我们尝试在PyTorch中使用TensorBoard!在记录任何内容之前, +我们需要创建一个 ``SummaryWriter`` 实例。 -.. GENERATED FROM PYTHON SOURCE LINES 36-41 +.. GENERATED FROM PYTHON SOURCE LINES 35-41 .. code-block:: default import torch from torch.utils.tensorboard import SummaryWriter + writer = SummaryWriter() .. GENERATED FROM PYTHON SOURCE LINES 42-44 -Writer will output to ``./runs/`` directory by default. +写入器默认将输出到 ``./runs/`` 目录。 -.. GENERATED FROM PYTHON SOURCE LINES 47-59 +.. GENERATED FROM PYTHON SOURCE LINES 47-58 -Log scalars +记录标量 ----------- -In machine learning, it’s important to understand key metrics such as -loss and how they change during training. Scalar helps to save -the loss value of each training step, or the accuracy after each epoch. +在机器学习中,了解关键指标(如损失)及其在训练期间的变化非常重要。 +标量可用于保存每个训练步骤的损失值或每个epoch的准确率。 -To log a scalar value, use -``add_scalar(tag, scalar_value, global_step=None, walltime=None)``. -For example, lets create a simple linear regression training, and -log loss value using ``add_scalar`` +要记录标量值,请使用 +``add_scalar(tag, scalar_value, global_step=None, walltime=None)``。 +例如,让我们创建一个简单的线性回归训练,并 +使用 ``add_scalar`` 记录损失值 -.. GENERATED FROM PYTHON SOURCE LINES 59-80 +.. GENERATED FROM PYTHON SOURCE LINES 58-81 .. code-block:: default @@ -92,7 +91,8 @@ log loss value using ``add_scalar`` model = torch.nn.Linear(1, 1) criterion = torch.nn.MSELoss() - optimizer = torch.optim.SGD(model.parameters(), lr = 0.1) + optimizer = torch.optim.SGD(model.parameters(), lr=0.1) + def train_model(iter): for epoch in range(iter): @@ -102,24 +102,25 @@ log loss value using ``add_scalar`` optimizer.zero_grad() loss.backward() optimizer.step() - + + train_model(10) writer.flush() -.. GENERATED FROM PYTHON SOURCE LINES 81-89 +.. GENERATED FROM PYTHON SOURCE LINES 82-90 -Call ``flush()`` method to make sure that all pending events -have been written to disk. +调用 ``flush()`` 方法以确保所有待处理事件 +已写入磁盘。 -See `torch.utils.tensorboard tutorials `_ -to find more TensorBoard visualization types you can log. +请参阅 `torch.utils.tensorboard 教程 `_ +以了解您可以记录的更多TensorBoard可视化类型。 -If you do not need the summary writer anymore, call ``close()`` method. +如果您不再需要摘要写入器,请调用 ``close()`` 方法。 -.. GENERATED FROM PYTHON SOURCE LINES 89-92 +.. GENERATED FROM PYTHON SOURCE LINES 90-93 .. code-block:: default @@ -127,45 +128,43 @@ If you do not need the summary writer anymore, call ``close()`` method. writer.close() -.. GENERATED FROM PYTHON SOURCE LINES 93-122 +.. GENERATED FROM PYTHON SOURCE LINES 94-121 -Run TensorBoard +运行TensorBoard ---------------- -Install TensorBoard through the command line to visualize data you logged +通过命令行安装TensorBoard以可视化您记录的数据 .. code-block:: sh pip install tensorboard -Now, start TensorBoard, specifying the root log directory you used above. -Argument ``logdir`` points to directory where TensorBoard will look to find -event files that it can display. TensorBoard will recursively walk -the directory structure rooted at ``logdir``, looking for ``.*tfevents.*`` files. +现在,启动TensorBoard,指定您之前使用的根日志目录。 +参数 ``logdir`` 指向TensorBoard将查找可显示的事件文件的目录。 +TensorBoard将递归遍历 ``logdir`` 根目录下的目录结构,寻找 ``.*tfevents.*`` 文件。 .. code-block:: sh tensorboard --logdir=runs -Go to the URL it provides OR to `http://localhost:6006/ `_ +转到它提供的URL或 `http://localhost:6006/ `_ .. image:: ../../_static/img/thumbnails/tensorboard_scalars.png :scale: 40 % -This dashboard shows how the loss and accuracy change with every epoch. -You can use it to also track training speed, learning rate, and other -scalar values. It’s helpful to compare these metrics across different -training runs to improve your model. +此仪表板显示了损失和准确率如何随着每个epoch而变化。 +您可以使用它来跟踪训练速度、学习率和其他标量值。 +比较不同训练运行的这些指标有助于改进您的模型。 -.. GENERATED FROM PYTHON SOURCE LINES 125-131 +.. GENERATED FROM PYTHON SOURCE LINES 124-130 -Learn More +了解更多 ---------------------------- -- `torch.utils.tensorboard `_ docs -- `Visualizing models, data, and training with TensorBoard `_ tutorial +- `torch.utils.tensorboard `_ 文档 +- `使用TensorBoard可视化模型、数据和训练 `_ 教程 diff --git a/docs/_sources/recipes/recipes_index.rst.txt b/docs/_sources/recipes/recipes_index.rst.txt index b19ac42..0e56f12 100644 --- a/docs/_sources/recipes/recipes_index.rst.txt +++ b/docs/_sources/recipes/recipes_index.rst.txt @@ -103,7 +103,7 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu .. customcarditem:: :header: PyTorch Benchmark - :card_description: Learn how to use PyTorch's benchmark module to measure and compare the performance of your code + :card_description: 学习如何使用 PyTorch Benchmark 模块来测量和比较代码性能 :image: ../_static/img/thumbnails/cropped/profiler.png :link: ../recipes/recipes/benchmark.html :tags: Basics diff --git a/docs/_sources/recipes/torch_compile_backend_ipex.rst.txt b/docs/_sources/recipes/torch_compile_backend_ipex.rst.txt index 8d38a68..0d8613d 100644 --- a/docs/_sources/recipes/torch_compile_backend_ipex.rst.txt +++ b/docs/_sources/recipes/torch_compile_backend_ipex.rst.txt @@ -1,18 +1,17 @@ -Intel® Extension for PyTorch* Backend +Intel® PyTorch* 扩展后端 ===================================== -To work better with `torch.compile`, Intel® Extension for PyTorch* implements a backend ``ipex``. -It targets to improve hardware resource usage efficiency on Intel platforms for better performance. -The `ipex` backend is implemented with further customizations designed in Intel® Extension for -PyTorch* for the model compilation. +为了更好地与 `torch.compile` 协作,Intel® PyTorch* 扩展实现了一个名为 `ipex` 的后端。 +它旨在提高 Intel 平台上的硬件资源使用效率,从而获得更好的性能。 +`ipex` 后端是通过 Intel® PyTorch* 扩展中进一步的定制设计来实现模型编译的。 -Usage Example +使用示例 ~~~~~~~~~~~~~ -Train FP32 +FP32 训练 ---------- -Check the example below to learn how to utilize the `ipex` backend with `torch.compile` for model training with FP32 data type. +查看下面的示例,了解如何将 `ipex` 后端与 `torch.compile` 一起使用,进行 FP32 数据类型的模型训练。 .. code:: python @@ -44,10 +43,10 @@ Check the example below to learn how to utilize the `ipex` backend with `torch.c optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9) model.train() - #################### code changes #################### + #################### 代码修改 #################### import intel_extension_for_pytorch as ipex - # Invoke the following API optionally, to apply frontend optimizations + # 可选择调用以下 API,应用前端优化 model, optimizer = ipex.optimize(model, optimizer=optimizer) compile_model = torch.compile(model, backend="ipex") @@ -61,10 +60,10 @@ Check the example below to learn how to utilize the `ipex` backend with `torch.c optimizer.step() -Train BF16 +BF16 训练 ---------- -Check the example below to learn how to utilize the `ipex` backend with `torch.compile` for model training with BFloat16 data type. +查看下面的示例,了解如何将 `ipex` 后端与 `torch.compile` 一起使用,进行 BFloat16 数据类型的模型训练。 .. code:: python @@ -96,10 +95,10 @@ Check the example below to learn how to utilize the `ipex` backend with `torch.c optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9) model.train() - #################### code changes #################### + #################### 代码修改 #################### import intel_extension_for_pytorch as ipex - # Invoke the following API optionally, to apply frontend optimizations + # 可选择调用以下 API,应用前端优化 model, optimizer = ipex.optimize(model, dtype=torch.bfloat16, optimizer=optimizer) compile_model = torch.compile(model, backend="ipex") @@ -114,10 +113,10 @@ Check the example below to learn how to utilize the `ipex` backend with `torch.c optimizer.step() -Inference FP32 +FP32 推理 -------------- -Check the example below to learn how to utilize the `ipex` backend with `torch.compile` for model inference with FP32 data type. +查看下面的示例,了解如何将 `ipex` 后端与 `torch.compile` 一起使用,进行 FP32 数据类型的模型推理。 .. code:: python @@ -128,10 +127,10 @@ Check the example below to learn how to utilize the `ipex` backend with `torch.c model.eval() data = torch.rand(1, 3, 224, 224) - #################### code changes #################### + #################### 代码修改 #################### import intel_extension_for_pytorch as ipex - # Invoke the following API optionally, to apply frontend optimizations + # 可选择调用以下 API,应用前端优化 model = ipex.optimize(model, weights_prepack=False) compile_model = torch.compile(model, backend="ipex") @@ -141,10 +140,10 @@ Check the example below to learn how to utilize the `ipex` backend with `torch.c compile_model(data) -Inference BF16 +BF16 推理 -------------- -Check the example below to learn how to utilize the `ipex` backend with `torch.compile` for model inference with BFloat16 data type. +查看下面的示例,了解如何将 `ipex` 后端与 `torch.compile` 一起使用,进行 BFloat16 数据类型的模型推理。 .. code:: python @@ -155,10 +154,10 @@ Check the example below to learn how to utilize the `ipex` backend with `torch.c model.eval() data = torch.rand(1, 3, 224, 224) - #################### code changes #################### + #################### 代码修改 #################### import intel_extension_for_pytorch as ipex - # Invoke the following API optionally, to apply frontend optimizations + # 可选择调用以下 API,应用前端优化 model = ipex.optimize(model, dtype=torch.bfloat16, weights_prepack=False) compile_model = torch.compile(model, backend="ipex") diff --git a/docs/_sources/recipes/torch_logs.rst.txt b/docs/_sources/recipes/torch_logs.rst.txt index c182145..a568748 100644 --- a/docs/_sources/recipes/torch_logs.rst.txt +++ b/docs/_sources/recipes/torch_logs.rst.txt @@ -18,9 +18,9 @@ .. _sphx_glr_recipes_torch_logs.py: -(beta) Using TORCH_LOGS python API with torch.compile +(Beta) 使用 TORCH_LOGS python API 与 torch.compile ========================================================================================== -**Author:** `Michael Lazos `_ +**作者:** `Michael Lazos `_ .. GENERATED FROM PYTHON SOURCE LINES 6-9 @@ -30,101 +30,91 @@ import logging -.. GENERATED FROM PYTHON SOURCE LINES 10-18 +.. GENERATED FROM PYTHON SOURCE LINES 10-17 -This tutorial introduces the ``TORCH_LOGS`` environment variable, as well as the Python API, and -demonstrates how to apply it to observe the phases of ``torch.compile``. +本教程介绍了 ``TORCH_LOGS`` 环境变量以及 Python API,并演示了如何将其应用于观察 ``torch.compile`` 的各个阶段。 .. note:: - This tutorial requires PyTorch 2.2.0 or later. + 本教程需要 PyTorch 2.2.0 或更高版本。 -.. GENERATED FROM PYTHON SOURCE LINES 22-32 +.. GENERATED FROM PYTHON SOURCE LINES 21-28 -Setup +设置 ~~~~~~~~~~~~~~~~~~~~~ -In this example, we'll set up a simple Python function which performs an elementwise -add and observe the compilation process with ``TORCH_LOGS`` Python API. +在这个例子中,我们将设置一个简单的 Python 函数,执行元素级加法,并使用 ``TORCH_LOGS`` Python API 观察编译过程。 .. note:: - There is also an environment variable ``TORCH_LOGS``, which can be used to - change logging settings at the command line. The equivalent environment - variable setting is shown for each example. + 还有一个环境变量 ``TORCH_LOGS``,可用于在命令行中更改日志设置。每个示例都显示了等效的环境变量设置。 -.. GENERATED FROM PYTHON SOURCE LINES 32-81 +.. GENERATED FROM PYTHON SOURCE LINES 28-74 .. code-block:: default import torch - # exit cleanly if we are on a device that doesn't support torch.compile + # 如果设备不支持 torch.compile,则干净地退出 if torch.cuda.get_device_capability() < (7, 0): - print("Skipping because torch.compile is not supported on this device.") + print("跳过,因为此设备不支持 torch.compile。") else: + @torch.compile() def fn(x, y): z = x + y return z + 2 - inputs = (torch.ones(2, 2, device="cuda"), torch.zeros(2, 2, device="cuda")) - - # print separator and reset dynamo - # between each example + # 在每个示例之间打印分隔符并重置 dynamo def separator(name): print(f"==================={name}=========================") torch._dynamo.reset() - - separator("Dynamo Tracing") - # View dynamo tracing - # TORCH_LOGS="+dynamo" + separator("Dynamo 跟踪") + # 查看 dynamo 跟踪 + # TORCH_LOGS="+dynamo" torch._logging.set_logs(dynamo=logging.DEBUG) fn(*inputs) - separator("Traced Graph") - # View traced graph - # TORCH_LOGS="graph" + separator("跟踪的图形") + # 查看跟踪的图形 + # TORCH_LOGS="graph" torch._logging.set_logs(graph=True) fn(*inputs) - separator("Fusion Decisions") - # View fusion decisions - # TORCH_LOGS="fusion" + separator("融合决策") + # 查看融合决策 + # TORCH_LOGS="fusion" torch._logging.set_logs(fusion=True) fn(*inputs) - separator("Output Code") - # View output code generated by inductor - # TORCH_LOGS="output_code" + separator("输出代码") + # 查看 inductor 生成的输出代码 + # TORCH_LOGS="output_code" torch._logging.set_logs(output_code=True) fn(*inputs) separator("") -.. GENERATED FROM PYTHON SOURCE LINES 82-97 +.. GENERATED FROM PYTHON SOURCE LINES 75-87 -Conclusion +结论 ~~~~~~~~~~ -In this tutorial we introduced the TORCH_LOGS environment variable and python API -by experimenting with a small number of the available logging options. -To view descriptions of all available options, run any python script -which imports torch and set TORCH_LOGS to "help". +在本教程中,我们介绍了 TORCH_LOGS 环境变量和 python API,并通过实验了一小部分可用的日志选项。 +要查看所有可用选项的描述,请运行任何导入 torch 的 python 脚本,并将 TORCH_LOGS 设置为 "help"。 -Alternatively, you can view the `torch._logging documentation`_ to see -descriptions of all available logging options. +或者,您可以查看 `torch._logging 文档`_ 以查看所有可用日志选项的描述。 -For more information on torch.compile, see the `torch.compile tutorial`_. +有关 torch.compile 的更多信息,请参阅 `torch.compile 教程`_。 -.. _torch._logging documentation: https://pytorch.org/docs/main/logging.html -.. _torch.compile tutorial: https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html +.. _torch._logging 文档: https://pytorch.org/docs/main/logging.html +.. _torch.compile 教程: https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html .. rst-class:: sphx-glr-timing diff --git a/docs/_sources/recipes/torchscript_inference.rst.txt b/docs/_sources/recipes/torchscript_inference.rst.txt index 2f904e4..50e18e3 100644 --- a/docs/_sources/recipes/torchscript_inference.rst.txt +++ b/docs/_sources/recipes/torchscript_inference.rst.txt @@ -1,13 +1,16 @@ -TorchScript for Deployment +TorchScript 部署 ========================== -In this recipe, you will learn: +在本教程中,您将学习: - What TorchScript is - How to export your trained model in TorchScript format - How to load your TorchScript model in C++ and do inference +- TorchScript 是什么 +- 如何将训练好的模型导出为 TorchScript 格式 +- 如何在 C++ 中加载 TorchScript 模型并进行推理 -Requirements +环境要求 ------------ - PyTorch 1.5 @@ -15,31 +18,25 @@ Requirements - libtorch 1.5 - C++ compiler -The instructions for installing the three PyTorch components are -available at `pytorch.org`_. The C++ compiler will depend on your -platform. +安装这三个 PyTorch 组件的说明可在 `pytorch.org_` 上找到。C++ 编译器则取决于您的平台。 -What is TorchScript? + + +什么是 TorchScript? -------------------- -**TorchScript** is an intermediate representation of a PyTorch model -(subclass of ``nn.Module``) that can then be run in a high-performance -environment like C++. It’s a high-performance subset of Python that is -meant to be consumed by the **PyTorch JIT Compiler,** which performs -run-time optimization on your model’s computation. TorchScript is the -recommended model format for doing scaled inference with PyTorch models. -For more information, see the PyTorch `Introduction to TorchScript -tutorial`_, the `Loading A TorchScript Model in C++ tutorial`_, and the -`full TorchScript documentation`_, all of which are available on -`pytorch.org`_. - -How to Export Your Model +**TorchScript** 是 PyTorch 模型( ``nn.Module`` 的子类)的中间表示,可以在高性能环境(如 C++)中运行。 +它是 Python 的一个高性能子集,旨在被 **PyTorch JIT 编译器** 使用,后者会对模型的计算进行运行时优化。 +TorchScript 是使用 PyTorch 模型进行大规模推理的推荐模型格式。更多信息, +请参阅 `pytorch.org_` 上的 `PyTorch TorchScript 入门教程`、 `在 C++ 中加载 TorchScript 模型教程` +和 `完整的 TorchScript 文档_` 。 + +如何导出模型 ------------------------ -As an example, let’s take a pretrained vision model. All of the -pretrained models in TorchVision are compatible with TorchScript. +作为示例,让我们使用一个预训练的视觉模型。TorchVision 中的所有预训练模型都与 TorchScript 兼容。 -Run the following Python 3 code, either in a script or from the REPL: +运行以下 Python 3 代码,可以在脚本中或从 REPL 中运行: .. code:: python3 @@ -47,9 +44,9 @@ Run the following Python 3 code, either in a script or from the REPL: import torch.nn.functional as F import torchvision.models as models - r18 = models.resnet18(pretrained=True) # We now have an instance of the pretrained model - r18_scripted = torch.jit.script(r18) # *** This is the TorchScript export - dummy_input = torch.rand(1, 3, 224, 224) # We should run a quick test + r18 = models.resnet18(pretrained=True) # 现在我们有一个预训练模型的实例 + r18_scripted = torch.jit.script(r18) # *** 这是 TorchScript 导出 + dummy_input = torch.rand(1, 3, 224, 224) # 快速测试一下 Let’s do a sanity check on the equivalence of the two models: diff --git a/docs/objects.inv b/docs/objects.inv index d7a2820..3f43671 100644 Binary files a/docs/objects.inv and b/docs/objects.inv differ diff --git a/docs/recipes/compiling_optimizer.html b/docs/recipes/compiling_optimizer.html index 7829c9a..b50721b 100644 --- a/docs/recipes/compiling_optimizer.html +++ b/docs/recipes/compiling_optimizer.html @@ -42,7 +42,7 @@ - +