Marian v1.11.0

emjotde released this 08 Feb 16:52

· 77 commits to master since this release

[1.11.0] - 2022-02-08

Added

Parallelized data reading with e.g. --data-threads 8
Top-k sampling during decoding with e.g. --output-sampling topk 10
Improved mixed precision training with --fp16
Set FFN width in decoder independently from encoder with e.g. --transformer-dim-ffn 4096 --transformer-decoder-dim-ffn 2048
Adds option --add-lsh to marian-conv which allows the LSH to be memory-mapped.
Early stopping based on first, all, or any validation metrics via --early-stopping-on
Compute 8.6 support if using CUDA>=11.1
Support for RMSNorm as drop-in replace for LayerNorm from Biao Zhang; Rico Sennrich (2019). Root Mean Square Layer Normalization. Enabled in Transformer model via --transformer-postprocess dar instead of dan.
Extend suppression of unwanted output symbols, specifically "\n" from default vocabulary if generated by SentencePiece with byte-fallback. Deactivates with --allow-special
Allow for fine-grained CPU intrinsics overrides when BUILD_ARCH != native e.g. -DBUILD_ARCH=x86-64 -DCOMPILE_AVX512=off
Adds custom bias epilogue kernel.
Adds support for fusing relu and bias addition into gemms when using cuda 11.
Better suppression of unwanted output symbols, specifically "\n" from SentencePiece with byte-fallback. Can be deactivated with --allow-special
Display decoder time statistics with marian-decoder --stat-freq 10 ...
Support for MS-internal binary shortlist
Local/global sharding with MPI training via --sharding local
fp16 support for factors.
Correct training with fp16 via --fp16.
Dynamic cost-scaling with --cost-scaling.
Dynamic gradient-scaling with --dynamic-gradient-scaling.
Add unit tests for binary files.
Fix compilation with OMP
Added --model-mmap option to enable mmap loading for CPU-based translation
Compute aligned memory sizes using exact sizing
Support for loading lexical shortlist from a binary blob
Integrate a shortlist converter (which can convert a text lexical shortlist to a binary shortlist) into marian-conv with --shortlist option

Fixed

Fix AVX2 and AVX512 detection on MacOS
Add GCC11 support into FBGEMM
Added pragma to ignore unused-private-field error on elementType_ on macOS
Do not set guided alignments for case augmented data if vocab is not factored
Various fixes to enable LSH in Quicksand
Added support to MPIWrappest::bcast (and similar) for count of type size_t
Adding new validation metrics when training is restarted and --reset-valid-stalled is used
Missing depth-scaling in transformer FFN
Fixed an issue when loading intgemm16 models from unaligned memory.
Fix building marian with gcc 9.3+ and FBGEMM
Find MKL installed under Ubuntu 20.04 via apt-get
Support for CUDA 11.
General improvements and fixes for MPI handling, was essentially non-functional before (syncing, random seeds, deadlocks during saving, validation etc.)
Allow to compile -DUSE_MPI=on with -DUSE_STATIC_LIBS=on although MPI gets still linked dynamically since it has so many dependencies.
Fix building server with Boost 1.75
Missing implementation for cos/tan expression operator
Fixed loading binary models on architectures where size_t != uint64_t.
Missing float template specialisation for elem::Plus
Broken links to MNIST data sets
Enforce validation for the task alias in training mode.

Changed

MacOS marian uses Apple Accelerate framework by default, as opposed to openblas/mkl.
Optimize LSH for speed by treating is as a shortlist generator. No option changes in decoder
Set REQUIRED_BIAS_ALIGNMENT = 16 in tensors/gpu/prod.cpp to avoid memory-misalignment on certain Ampere GPUs.
For BUILD_ARCH != native enable all intrinsics types by default, can be disabled like this: -DCOMPILE_AVX512=off
Moved FBGEMM pointer to commit c258054 for gcc 9.3+ fix
Change compile options a la -DCOMPILE_CUDA_SM35 to -DCOMPILE_KEPLER, -DCOMPILE_MAXWELL,
-DCOMPILE_PASCAL, -DCOMPILE_VOLTA, -DCOMPILE_TURING and -DCOMPILE_AMPERE
Disable -DCOMPILE_KEPLER, -DCOMPILE_MAXWELL by default.
Dropped support for legacy graph groups.
Developer documentation framework based on Sphinx+Doxygen+Breathe+Exhale
Expresion graph documentation (#788)
Graph operators documentation (#801)
Remove unused variable from expression graph
Factor groups and concatenation: doc/factors.md

Assets 2