PyTorch DNNL_aarch64 build manual for FUJITSU Software Compiler Package (PyTorch v1.13.1)

Build Instruction for PyTorch on Fujitsu Supercomputer PRIMEHPC FX1000/FX700

1. Introduction

This document contains instructions for installing PyTorch on a Fujitsu Supercomputer PRIMEHPC FX1000 or FX700.
It also provides sample instructions for installing and running several important models optimized for the FX1000 and FX700.

The PyTorch build instructions are designed to be installed on systems with direct access to the Internet (or via a proxy), as well as on systems without an external connection. In this manual, the former is referred to as "online installation" and the latter as "offline installation".

For offline installation, you first download a set of necessary files beforehand on a system connected to the Internet (the "download system"), and then transfer them to the system to be installed (the "target system").

1.1. Terminology

The following terms and abbreviations are used in this manual.

Terms/Abbr.	Meaning
Online Installation	Install PyTorch on a system with direct access to the Internet (or via proxy)
Offline Installation	Install PyTorch on a system that does not have direct access to the Internet
Target system	System on which PyTorch is to be installed
Download system	System for downloading the necessary files in advance for offline installation
TCS	FX1000's job execution scheduler and compiler library environment (Technical Computing Suite)
CP	FX700's compiler library environment (Compiler Package)

2. System Configuration and Prerequisites

2.1. Download System

OS: UNIX or Linux
Following software is available: bash, python, wget, git, unzip, tar, and curl
Accessible to the Target System
Sufficient free space in the file system The amount of downloaded data is 35GB.

Modules Donwload Size

PyTorch source 0.5GB

Extra files needed for PyTorch build 1.5GB

Sample Model OpenNMT 5GB

Sample Model BERT 1GB

Sample Model Mask R-CNN 26GB

Total 35GB

The download directory is under the PyTorch source directory.

2.2. Target System

PRIMEHPC FX1000 or FX700
For FX1000
- RHEL 8.x must be installed
- If you want to use FCC, Computing Suite V4.0L20 must be installed
For FX700
- RHEL 8.x or CentOS 8.x must be installed
- If you want to use FCC, Compiler Package V10L20 must be installed
The following packages and commands should be already installed
make gcc cmake libffi-devel gcc-gfortran numactl git patch unzip tk tcsh tcl lsof python3 pciutils
At least 100GB of free storage space.

Please note that building and running on NFS may cause unexpected problems depending on the performance and configuration of the NFS server.
It is recommended to use locally-attached storage or network storage that is fast enough.

2.3. Directory Structure after Installation

The directory structure after installation looks like this. The directories PREFIX, VENV_PATH, and TCSDS_PATH are specified in the configuration file env.src. This three directories, and PYTORCH_TOP must be independent each other. (Make sure that one directory is not under another directory.)

  PREFIX (where local binaries are stored)
    +- bin (Python, etc.)
    +- lib

  VENV_PATH (location of python modules needed to run PyTorch)
    +- bin (activate)
    +- lib (packages to be installed by pip)

  TCSDS_PATH (Fujitsu compiler, *: already installed before the procedure)
    +- bin (fcc, FCC, etc.)
    +- lib64

  PYTORCH_TOP (complete PyTorch source, transferred from the download system or downloaded from https://www.github.com/fujitsu/pytorch)
    +- aten, c10, torch, caffe2 ...
    +- third_party
    +- scripts
        +- fujitsu (PyTorch build scripts)
            +- down (downloaded files will be stored)
            +- opennmt_build_pack, bert_build_pack, mask_r_cnn_build_pack (Sample models source and trainig data will be extracted under here)

2.4. Proxy Settings

If your environment requires proxy to the external access, please set the following environment variables.
(Replace "user", "pass", "proxy_url", and "port" with the ones appropriate for your environment.)

$ export http_proxy=http://user:pass@proxy_url:port
$ export https_proxy=https://user:pass@proxy_url:port

Note: curl, wget, git, and pip3 recognize the above environment variables, so edit of rc or .gitconfig is unnecessary.

2.5. Handling of Subnormal Number in PyTorch

On both Intel x86 and A64FX, arithmetic of Subnormal number is calculated correctly as it is defined in IEEE-754, which is default, or flush to zero for performance reason. The behavior can be altered on a process basis, but PyTorch doesn't do this, which may cause training or inference speed significant slower, possibly 10 to 100 times than what would normally be expected, depending on the calculation and input data.

For PyTorch optimized to FX1000/700, handling subnormal numbers is same as the OSS version. The instructions and notes for the change will be explained in 3.3-A and 4..

3. Installation procedure

The general installation flow is as follows:

Preparation (Common for online/offline installation)
- 3.1-A. Download the PyTorch Source from Github
- 3.1-B. Edit env.src
Download (Offline installation only)
Build (Common for online/offline installation)
- 3.3-A. Build PyTorch
- 3.3-B. (Optional) Build Sample Models

3.1. Preliminaries (Details)

3.1-A. Download the Source from Github

$ git clone https://github.com/fujitsu/pytorch.git
$ cd pytorch                     # From now on, we'll call this directory PYTORCH_TOP
$ git checkout -b r1.13_for_a64fx origin/r1.13_for_a64fx
$ cd scripts/fujitsu

In the following examples, /home/user/pytorch is used as PYTORCH_TOP.

3.1-B. Edit env.src

'env.src' is configuration file, which is located in $PYTORCH_TOP/scripts/fujitsu.

The configuration is divided into two parts.

Control of the Build

Flag Name	Default Value	Meaning	Remarks
`fjenv_use_venv`	True	Use VENV when true	'false' is not tested.
`fjenv_use_fcc`	True	Use FCC when true, otherwise, use GCC	'false' is not tested.
`fjenv_offline_install`	'false'	'true' for offline installation

Note that these flags are defined as shell variables in 'env.src', but it can also be set as an environment variable outside of 'env.src'. In that case, the environment variable setting takes precedence over the setting in 'env.src'.

Set up the building directory.
For the directory configuration, Refer to the diagram in Chapter 2.3.

Variable name	Meaning	Supplemental information
`PREFIX`	Directory to install the executable generated by this construction procedure.
`VENV_PATH`	name of the directory where VENV is installed	Valid when `use_venv=true`
`TCSDS_PATH`	name of the base directory for TCS and CP (base directory: a directory containing bin, lib, etc.)	Valid when `use_fcc=true`

It is not necessary to alter other settings than mentioned above.

3.2. Download (Detail)

This section is only for offline installation. If you are installing on an Internet-connected system, skip this section and go to 3.3.

3.2-A. Download the Files for PyTorch

Run the shell scripts starting with the number in the scripts/fujitsu directory, one by one in numbering order, with the argument download.

$ pwd
/home/user/pytorch/scripts/fujitsu        # $PYTORCH_TOP/scripts/fujitsu

$ bash 1_python.sh        download        # Download Python
$ bash 3_venv.sh          download        # Download Python modules for PyTorch
$ bash 4_numpy_scipy.sh   download        # Download NumPy and SciPy
$ bash 5_pytorch.sh       download        # Download Modules for PyTorch build
$ bash 6_vision.sh        download        # Download TorchVision
$ bash 7_horovod.sh       download        # Download Horovod
$ bash 8_libtcmalloc.sh   download        # Download tcmalloc

The scripts are designed so that it will not download files that has already been downloaded. If you want to download files again, run each script with clean argument first, and then run it with download. Please note that clean has higher priority than download, so if you specify clean download or download clean, only clean is performed.

3.2-B. (Optional) Download the Files for Sample Models

The sample models are located under opennmt_build_pack, bert_build_pack, and mask_r_cnn_build_pack. Run the shell scripts starting with the number in each directory one by one in numbering order, with the argument download.

For the sample model and training data, the download directory is the model directory, and others than that will be downloaded into $PYTORCH_TOP/scripts/fujitsu/down.

The scripts are designed so that it will not download files that has already been downloaded. If you want to download files again, run each script with clean argument first, and then run it with download.

Note that the training data is not deleted even with clean because it usually takes a lot of time to download or re-create the training data. If you really want to delete the training data, remove the following data directory manually.

$PYTORCH_TOP/scripts/fujitsu/opennmt_build_pack/dataset
$PYTORCH_TOP/scripts/fujitsu/bert_build_pack/hostdata
$PYTORCH_TOP/scripts/fujitsu/mask_r_cnn_build_pack/dataset
$PYTORCH_TOP/scripts/fujitsu/mask_r_cnn_build_pack/weights

3.2-C. Transfer to the Target System

Transfer everthing under $PYTORCH_TOP to the install system.

We do not describe the transfer method, as it depends on your system configuration.
Use scp, ftp, a shared filesystem, or any other method appropriate for your system.

3.3. Build (Detail)

3.3-A. Build PyTorch

Run the shell scripts with name starting number, in numbering order, one after the other.
The following example shows how to install with an interactive shell.

If you are using the job control system, you can, for example, create a batch script that executes a series of build scripts, and then submit the batch scripts. In that case, it is recommended to add special shell command that enables to terminate the script run on error of the build script (such as set -e in bash).

[option] is an option flag to pass to the script. If omitted, the build is executed.

The scripts are designed so that it will not build again when the binary has already exist. If you want to build again, run each script with rebuild argument.

Please do not confuse with clean. If it is specified, then all the download files are deleted, that requires you to download again, and transfer in offline installation.

$ pwd
/home/user/pytorch/scripts/fujitsu        # $PYTORCH_TOP/scripts/fujitsu

$ bash 1_python.sh        [option]        # Build and install Python (10 min.)
$ bash 3_venv.sh          [option]        # Create VENV (1 min.)
$ bash 4_numpy_scipy.sh   [option]        # Build NumPy and SciPy (90 min.)
$ bash 5_pytorch.sh       [option]        # Build PyTorch (120 min.)
$ bash 6_vision.sh        [option]        # Build TorchVision (20 min.)
$ bash 7_horovod.sh       [option]        # Build Horovod (6 min.)
$ bash 8_libtcmalloc.sh   [option]        # Install tcmalloc (1 min.)

Change the default behavior of operaiotns involving subnormal number to flush-to-zero: Change line 99 of 5_pytorch.sh as follows.
Note: The change is effective from import pytorch until the process is terminated; it also affects other python modules than pytorch (e.g. NumPy).

  97 # 'setup.py' in PyTorch ensures that CFLAGS is used for both C and C++ compiler,
  98 # but just in case...
  99 CFLAGS=-O3 CXXFLAGS=-O3 python3 setup.py build -j $MAX_JOBS install
     |
     V
  99 CFLAGS=-Kfast python3 setup.py build -j $MAX_JOBS install

To verify the build, run Resnet50. The following shows how to confirm execution.
Since the execution speed of deep learning models can vary by 10~20%, you can use the execution speed described in this manual as a guide, and if it is within the certain range, your build is OK. Also, keep in mind that the settings of the sample model is not optimal.

$ pwd
/home/user/pytorch/scripts/fujitsu      # $PYTORCH_TOP/scripts/fujitsu

$ bash run1proc.sh                      # 1 node, 1 process, 48 cores, dummy data.
$ bash run1node.sh                      # 1 node, 4 process, 12 cores/process, dummy data.

run1proc.sh outputs the execution time of one step. The following is an example of output from run1proc.sh on FX1000 (2.0GHz: using "freq=2000"). The process time per iteration is expected about 2.5 sec.

$ bash run1proc.sh
>> script option: Namespace(batch=256, itr=20, lr=0.001, momentum=0.9, weight_decay=0.0, type='cpu_mkltensor', trace=False)
## Start Training
[    1] loss: 7.244 time: 3.203 s
[    2] loss: 5.030 time: 2.540 s
[    3] loss: 1.218 time: 2.564 s
[    4] loss: 0.012 time: 2.562 s
[    5] loss: 0.000 time: 2.564 s         # Since we are training with dummy data,
[    6] loss: 0.000 time: 2.551 s         # the loss value goes to zero immediately.
[    7] loss: 0.000 time: 2.566 s
[    8] loss: 0.000 time: 2.545 s
[    9] loss: 0.000 time: 2.542 s
  (snip)

run1node.sh performs data-parallel training in 4 processes. It uses different model than the one used in run1proc.sh, with different parameters. The following is an example output of run1node.sh on FX1000 (2.0GHz: using "freq=2000"). If the total value of 4 processes is more than 100 img/sec, the build is fine.

  (snip)
Running benchmark...
Iter #0: 26.6 img/sec per CPU             # This value is for the rank0 process.
Iter #1: 26.6 img/sec per CPU
Iter #2: 26.5 img/sec per CPU
Iter #3: 26.6 img/sec per CPU
Iter #4: 26.6 img/sec per CPU
Img/sec per CPU: 26.6 +-0.0
Total img/sec on 4 CPU(s): 106.3 +-0.1    # This value is the sum of 4 processes

3.3-B. (Optional) Build Sample Models

The sample models can be found under opennmt_build_pack, bert_build_pack, and mask_r_cnn_build_pack. Run the shell scripts with name starting number in those directories, in numbering order, one after the other.
The detail of the build and verfication is described in below.

Since the execution speed of deep learning models can vary by 10~20%, you can use the execution speed described in this manual as a guide, and if it is within the certain range, your build is OK.

CAUTION: The sample models provided here are slightly modified from the originals for operation checks and performance analysis purposes, such as the random number seed may be fixed for profile collection, or the model may be set to abort after a certain number of runs. So please do not use the model as is for actual learning.

Also, please keep in mind that the settings of the sample model is not optimal.

OpenNMT

Located in the opennmt_build_pack distribution.

https://github.com/OpenNMT/OpenNMT-py/tree/1.1.1
Tag: v1.1.1 (2020/3/20)

$ pwd
/home/user/pytorch/scripts/fujitsu/opennmt_build_pack     # $PYTORCH_TOP/scripts/fujitsu

$ bash 4_sentencepiece.sh  [option]                       # Install Sentencepiece (< 1 min.)
$ bash 8_opennmt.sh        [option]                       # Build and Install OpenNMT (1 min.)
$ bash 9_dataset.sh        [option]                       # Generate Trainig Data (60 min.)
$ bash run1proc.sh                                        # Run (1 node, 1 proc., 24 cores)
$ bash run1node.sh                                        # Run (1 node, 1 proc., 24 cores/proc)

Scripts for two more nodes are not provided. Please create your own based on run1node.sh.

The following is the example of result. (See the line pointed with the arrow sign).

    [2023-03-16 14:21:06,703 INFO] number of examples: 493128
    [2023-03-16 14:21:18,714 INFO] Step    1/   20; acc:   0.00; ppl: 8242.63; xent: 9.02; lr: 0.00000; 1256/2112 tok/s;      2 sec
    [2023-03-16 14:21:20,091 INFO] Step    2/   20; acc:   0.00; ppl: 8214.91; xent: 9.01; lr: 0.00000; 1503/2384 tok/s;      3 sec
            (snip)
    [2023-03-16 14:21:44,641 INFO] Step   19/   20; acc:   0.00; ppl: 8092.40; xent: 9.00; lr: 0.00000; 2140/2556 tok/s;     28 sec
    [2023-03-16 14:21:46,032 INFO] Step   20/   20; acc:   0.00; ppl: 8077.04; xent: 9.00; lr: 0.00000; 2128/2565 tok/s;     29 sec
    [2023-03-16 14:21:46,836 INFO] Saving checkpoint /work/users/isv18/isv1803/check_v113/pytorch/scripts/fujitsu/opennmt_build_pack/save_model_step_20.pt
    [2023-03-16 14:21:54,708 INFO] Total Data loading time (ignore first step): 216.47787 ms
--> [2023-03-16 14:21:54,710 INFO] Average : 1.4552 [sec/3850batch], 2564.7872 [tok/s]
    [2023-03-16 14:21:54,722 INFO] Write to aarch64/run_log.csv

For FX1000 (2.0 GHz: using "freq=2000"), the exptected result of run1proc.sh is about 2600 tok/s, and the expected result of run1node.sh is about 4400 tok/s.

BERT

Located in the bert_build_pack distribution.

https://github.com/huggingface/transformers/tree/v3.4.0
Tag: v3.4.0 (2020/10/20)

Note: Previously, we had provided two tasks, pre-training and fine tuning, but since the arithmetic processing content is almost the same for both, we decided to provide only the pre-training task, which is more computationally challenging one than the other.

$ pwd
/home/user/pytorch/scripts/fujitsu/bert_build_pack        # $PYTORCH_TOP/scripts/fujitsu

$ bash 3_rust.sh           [option]                       # Install Rust. ( 1 min.)
$ bash 4_sentencepiece.sh  [option]                       # Install Sentencepiece. ( 1 min.)
$ bash 5_tokenizers.sh     [option]                       # Install Tokenizer. ( 1 min.)
$ bash 7_transformers.sh   [option]                       # Build and Install BERT. ( 1 min.)
$ bash 8_dataset.sh        [option]                       # Generate Trainig Data ( 1 min.)
$ bash run1proc.sh                                        # Run (1 node, 1 proc., 24 cores)
$ bash run1node.sh                                        # Run (1 node, 2 proc., 24 cores/proc.)

Scripts for two more nodes are not provided. Please create your own based on run1node.sh.

The following is the example of result. (See the line pointed with the arrow sign).

Note that the print the statistics for each iteration is added to confirm the operation of the sample model.

The loss values displayed are the sum of the loss values of each iteration, so they increase monotonically. Also, the average speed shown is for the profile interval. (Default is from the 10th to 18th iteration)

    03/16/2023 14:28:14 - INFO - __main__ -   ***** Running training *****
    03/16/2023 14:28:14 - INFO - __main__ -     Num examples = 4800
    03/16/2023 14:28:14 - INFO - __main__ -     Num Epochs = 1
    03/16/2023 14:28:14 - INFO - __main__ -     Instantaneous batch size per device = 2
    03/16/2023 14:28:14 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 2
    03/16/2023 14:28:14 - INFO - __main__ -     Gradient Accumulation steps = 1
    03/16/2023 14:28:14 - INFO - __main__ -     Total optimization steps = 20
    03/16/2023 14:28:16 - INFO - __main__ -   step =    0, time = 1.199 loss = 3.12283
    03/16/2023 14:28:17 - INFO - __main__ -   step =    1, time = 0.785 loss = 6.41185 # Accumated loss is shown
        (snip)
    03/16/2023 14:28:31 - INFO - __main__ -   step =   18, time = 0.661 loss = 59.91796
    03/16/2023 14:28:32 - INFO - __main__ -   step =   19, time = 0.661 loss = 62.79688
--> 03/16/2023 14:28:32 - INFO - __main__ -   Averages: 0.7534 s/iter 2.6548 batch/s
    03/16/2023 14:28:32 - INFO - __main__ -   Epoch Overhead: 0.0001 s

For FX1000 (2.0 GHz: using "freq=2000"), the expected result of run1proc.sh is about 2.6 batch/s, and the expected result of run1node.sh is about 4.5 batch/s.

Mask R-CNN

Located in the mask_r_cnn_build_pack distribution.

https://github.com/facebookresearch/detectron2/tree/v0.2.1
Tag: v0.2.1 (2020/7/30)

$ pwd
/home/user/pytorch/scripts/fujitsu/mask_r_cnn_build_pack  # $PYTORCH_TOP/scripts/fujitsu

$ bash 5_detectron2.sh     [option]                       # Build and Install Mask R-CNN (10 min.)
$ bash 9_dataset.sh        [option]                       # Generate Trainig Data (60 min.)
$ bash run1proc.sh                                        # Run (1 node, 1 proc., 24 cores)
$ bash run1node.sh                                        # Run (1 node, 2 proc., 24 cores/proc.)

Scripts for two more nodes are not provided. Please create your own based on run1node.sh.

The following is the example of result. (See the line pointed with the arrow sign).

Note that the print the statistics for each iteration is added to confirm the operation of the sample model. The average speed shown is for the profile interval. (Default is from the 10th to 19th iteration)

    [03/16 15:34:04 d2.engine.train_loop]: Starting training from iteration 0
        (snip)
    [03/16 15:34:52 d2.utils.events]:  eta: 0:00:04  iter: 18  total_loss: 2.723  loss_cls: 0.836  loss_box_reg: 0.012  loss_mask: 0.699  loss_rpn_cls: 0.695  loss_rpn_loc: 0.543  time: 2.4641  data_time: 0.3141  lr: 0.000047
    [03/16 15:34:56 d2.utils.events]:  eta: 0:00:02  iter: 19  total_loss: 2.798  loss_cls: 0.915  loss_box_reg: 0.013  loss_mask: 0.698  loss_rpn_cls: 0.661  loss_rpn_loc: 0.636  time: 2.5034  data_time: 0.3192  lr: 0.000050
--> [03/16 15:34:56 d2.engine.train_loop]: Averages: 2.5142 s/iter 0.7955 image/s
    [03/16 15:34:56 d2.engine.hooks]: Overall training speed: 18 iterations in 0:00:45 (2.5034 s / it)

For FX1000 (2.0 GHz: using "freq=2000"), the expected result of run1proc.sh is about 0.79 image/s, and the expected result of run1node.sh is about 1.2 image/s.

4. Troubleshooting

python3 is not working

When all of the following conditions are met, you will get "cannot execute binary file: Exec format error" message.

Offline installation is being performed.
The download system is other than FX1000 or FX700 (e.g. PRIMERGY or other x86 server).
The download system and target system share the network storage, and you are trying to install on it.
You have already built PyTorch and are going to build a sample model later.

In this case, please do one of the following 1 or 2.

Download everything first, then build it.
Separate the download directory and build directory

It takes long time for a step in training, or training time varies depending on the input.

Possibly calculations invoving subnomal numbers have occuured.

In PyTorch, handling of subnormal number is altered by 'torch.set_flush_denormal(True)'.

This affects the behavior of the python process being ran, until the process is terminated. Note that it also affects to modules other than PyTorch (e.g. NumPy).

you can revert to the correct calculation mode by 'torch.set_flush_denormal(False)'.

It takes long time for OpenNMT.

When OpenNMT run in process-thread parallel using OMP_WAIT_POLICY=active, this can degrade performance by a factor of 100. In this case, run without OMP_WAIT_POLICY=active.

5. List of Software Version

Software	Version	License	Remarks
Python	3.9.x (2021/10/4 or thereafter)	GPL	'x' depends on the installation date (use the latest commit in the branch 3.9)
PyTorch	1.13.1 (2022/12/8)	BSD-3
oneDNN	v2.7.0 (2022/7/20)	Apache 2.0	Use same version as TensorFlow (PyTorch v1.13.1 uses oneDNN v2.6)
Horovod	0.26.1 (2022/10/14)	Apache 2.0
NumPy	1.22.x (2021/12/30 or thereafter)	BSD-3	'x' depends on the installation date (use the latest commit in the branch 1.22)
SciPy	1.7.x (2021/6/19 or thereafter)	BSD-3	'x' depends on the installation date (use the latest commit in the branch 1.7)

For other software modules, basically the latest available versions at the time of installation is used.

pip3 list

Package                 Version
----------------------- -------------------
absl-py                 1.4.0
astunparse              1.6.3
attrs                   22.2.0
beniget                 0.4.1
cachetools              5.3.0
certifi                 2022.12.7
cffi                    1.15.1
charset-normalizer      3.0.1
click                   8.1.3
cloudpickle             2.2.1
ConfigArgParse          1.5.3
contourpy               1.0.7
cycler                  0.11.0
Cython                  0.29.33
detectron2              0.2
exceptiongroup          1.1.0
expecttest              0.1.4
filelock                3.9.0
Flask                   2.2.3
fonttools               4.38.0
future                  0.18.3
fvcore                  0.1.5.post20221221
gast                    0.5.3
google-auth             2.16.1
google-auth-oauthlib    0.4.6
grpcio                  1.51.3
horovod                 0.26.1
hypothesis              6.67.1
idna                    3.4
importlib-metadata      6.0.0
importlib-resources     5.12.0
iniconfig               2.0.0
iopath                  0.1.10
itsdangerous            2.1.2
Jinja2                  3.1.2
joblib                  1.2.0
kiwisolver              1.4.4
Markdown                3.4.1
MarkupSafe              2.1.2
matplotlib              3.7.0
mock                    5.0.1
mpmath                  1.2.1
numpy                   1.22.4
oauthlib                3.2.2
OpenNMT-py              1.1.1
packaging               23.0
Pillow                  7.2.0
pip                     22.0.4
pluggy                  1.0.0
ply                     3.11
portalocker             2.7.0
protobuf                3.20.3
psutil                  5.9.4
pyasn1                  0.4.8
pyasn1-modules          0.2.8
pybind11                2.10.3
pycocotools             2.0.6
pycparser               2.21
pydot                   1.4.2
pyonmttok               1.36.0
pyparsing               3.0.9
pytest                  7.2.1
python-dateutil         2.8.2
pythran                 0.12.1
PyYAML                  6.0
regex                   2022.10.31
requests                2.28.2
requests-oauthlib       1.3.1
rsa                     4.9
sacremoses              0.0.53
SciPy                   1.7.3
semantic-version        2.10.0
sentencepiece           0.1.97
setuptools              67.2.0
setuptools-rust         1.5.2
six                     1.16.0
sortedcontainers        2.4.0
sympy                   1.11.1
tabulate                0.9.0
tensorboard             2.12.0
tensorboard-data-server 0.7.0
tensorboard-plugin-wit  1.8.1
termcolor               2.2.0
tokenizers              0.9.2
tomli                   2.0.1
torch                   1.13.0a0+git4e8ea0e
torchtext               0.4.0
torchvision             0.14.1a0+5e8e2f1
tqdm                    4.30.0
transformers            3.4.0
types-dataclasses       0.6.6
typing_extensions       4.4.0
urllib3                 1.26.14
waitress                2.1.2
Werkzeug                2.2.3
wheel                   0.38.4
yacs                    0.1.8
zipp                    3.15.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch DNNL_aarch64 build manual for FUJITSU Software Compiler Package (PyTorch v1.13.1)

Build Instruction for PyTorch on Fujitsu Supercomputer PRIMEHPC FX1000/FX700

Table of Contents

1. Introduction

1.1. Terminology

2. System Configuration and Prerequisites

2.1. Download System

2.2. Target System

2.3. Directory Structure after Installation

2.4. Proxy Settings

2.5. Handling of Subnormal Number in PyTorch

3. Installation procedure

3.1. Preliminaries (Details)

3.1-A. Download the Source from Github

3.1-B. Edit env.src

3.2. Download (Detail)

3.2-A. Download the Files for PyTorch

3.2-B. (Optional) Download the Files for Sample Models

3.2-C. Transfer to the Target System

3.3. Build (Detail)

3.3-A. Build PyTorch

3.3-B. (Optional) Build Sample Models

OpenNMT

BERT

Mask R-CNN

4. Troubleshooting

python3 is not working

It takes long time for a step in training, or training time varies depending on the input.

It takes long time for OpenNMT.

5. List of Software Version

pip3 list

Copyright

Clone this wiki locally

Modules	Donwload Size
PyTorch source	0.5GB
Extra files needed for PyTorch build	1.5GB
Sample Model OpenNMT	5GB
Sample Model BERT	1GB
Sample Model Mask R-CNN	26GB
Total	35GB