Using PyTorch 1.0 and ONNX with Fabric for Deep Learning

PyTorch is a key part of the IBM open source and product offerings. IBM has contributors supporting the open source PyTorch codebase, and we are adding multi-architecture support in PyTorch by enabling builds for Power architecture. There are other interesting projects that came out of IBM Research like Large Model Support and an open source framework for seq2seq models in PyTorch.

Fabric for Deep Learning supports distributed deep learning training capability found in PyTorch 1.0. FfDL can provision the requested number of nodes and GPUs with a shared file system on Kubernetes that lets each node easily initialize and synchronize with the collective process group. From there, users can update gradients with various point-to-point, collective, or multi-GPU collective communication.

Distributed training leveraging PyTorch 1.0

Fabric for Deep Learning (FfDL) now supports both PyTorch 1.0 and the ONNX model format. We also provide several examples to demonstrate how to get started with defining the PyTorch process group with different types of communication back ends, then train the model with distributed data parallelism.

We've fully tested FfDL with the new PyTorch distributed training mechanisms using GLOO, NCCL, and MPI communication back ends to sync the model parameters.

	GLOO	MPI	NCCL
CPU	✔	✔	x
GPU	✔	✔	✔

PyTorch 1.0 tested examples have been added. Following are PyTorch 1.0 distributed examples with

NCCL communication backend
MPI communication backend
GLOO communication backend

In addition, we also support PyTorch 0.41 distributed training leveraging Uber's Horovod mechanism.

PyTorch distributed with Horovod

Tech Preview for ONNX

FfDL also has added a tech preview for ONNX integration, which is a key feature of PyTorch 1.0.

To save the models in ONNX format, you can run your usual model training functions to train the model and save the model using the native torch.onnx function similar to saving a PyTorch model. This removed the abstractions between converting within the different training and serving frameworks you have in your organization. After you have your model converted to ONNX, you can simply load it to any serving back end and start using the model.

Trained with GLOO backend and exported in ONNX
Trained with MPI backend and exported in ONNX

Complete the pipeline: Deploy your ONNX-based models using Seldon with Intel nGraph

To complete the pipeline, Fabric for Deep Learning has integration with Seldon. Apart from serving PyTorch and TensorFlow models, Seldon recently announced the ability to serve ONNX models with an Intel's nGraph back end, designed to optimize the inferencing performance, using CPUs.

Deploy ONNX models with Seldon and Intel nGraph

With this, we can craft an end-to-end pipeline to convert FfDL-trained models to ONNX and serve it with Seldon. Furthermore, because FfDL can save trained models to Object Storage using the Flex volume on Kubernetes, we have improved the integration with Seldon as well to load the saved model directly from the FLEX volume, which can save the serving image disk space, generalize model wrapper definition, and improve scalability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch.md

PyTorch.md

Using PyTorch 1.0 and ONNX with Fabric for Deep Learning

Distributed training leveraging PyTorch 1.0

Tech Preview for ONNX

Complete the pipeline: Deploy your ONNX-based models using Seldon with Intel nGraph

Files

PyTorch.md

Latest commit

History

PyTorch.md

File metadata and controls

Using PyTorch 1.0 and ONNX with Fabric for Deep Learning

Distributed training leveraging PyTorch 1.0

Tech Preview for ONNX

Complete the pipeline: Deploy your ONNX-based models using Seldon with Intel nGraph