Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat:Downloading mined bitext #24

Closed
wants to merge 57 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
26d62ae
[nllb] No Language Left Behind @ 200
vedanuj Jul 6, 2022
aea9a95
Backtransled -> Backtranslated (#4551)
jumasheff Jul 7, 2022
9bb3732
remove unused cluster.data_dir
annasun28 Jul 7, 2022
3884ab3
install.md update fairscale branch
annasun28 Jul 7, 2022
bffdfe6
Merge pull request #4554 from facebookresearch/nllb-oss-update
vedanuj Jul 7, 2022
fcf0d53
README: fix link to flores-200 (#4556)
raphaelmerx Jul 8, 2022
d34deb7
Mention Hugging Face integration (#4601)
LysandreJik Jul 26, 2022
5fba639
Fix NLLB readme (#4621)
LysandreJik Aug 1, 2022
f87107c
Fix Santali language code: sat_Beng -> sat_Olck (#4576)
elbayadm Aug 2, 2022
eb61e22
[nllb] Release NLLB-200 translations (#4673)
vedanuj Aug 29, 2022
ec0ae71
copied in XSTS open-sourcing files and readme from previous attempt (…
Lichtphyz Feb 7, 2023
d3f0583
Initial data exploration + fixes for the original NLLB branch
gordicaleksa Aug 29, 2023
4e7a31d
Update README
gordicaleksa Aug 29, 2023
61c69be
Fix datasets structure - get filtering to run
gordicaleksa Aug 30, 2023
f057ab7
Refactor the code
gordicaleksa Aug 30, 2023
d5b1094
Refactor and robustify dataset structure modification scripts
gordicaleksa Aug 31, 2023
d1cd0f4
Analyze filtered data num sentences statistics
gordicaleksa Sep 1, 2023
d49027f
Improve data analysis scripts, robustify train pipeline
gordicaleksa Sep 2, 2023
f63570e
Refactor TIL so as to support the new ds structure
gordicaleksa Sep 2, 2023
72707bc
Refactor TICO so as to support the new ds structure
gordicaleksa Sep 2, 2023
ea88222
Refactor indic NLP dataset
gordicaleksa Sep 2, 2023
fe5e114
Refactor lingala songs + robustify the other funcs
gordicaleksa Sep 2, 2023
3146cd5
Refactor ffr & covid datasets
gordicaleksa Sep 2, 2023
867ed83
(fix) Update indic nlp - rename to bcp47
gordicaleksa Sep 2, 2023
7141c7b
Refactor all of the primary datasets download funcs
gordicaleksa Sep 2, 2023
d3d9521
Complete refactor of download funcs + validation check
gordicaleksa Sep 2, 2023
a5990e1
Add support for automatic download of NLLB-Seed
gordicaleksa Sep 3, 2023
d534cd3
Deprecate ds structure modification script
gordicaleksa Sep 3, 2023
7b5e98f
Update README with new download instructions
gordicaleksa Sep 3, 2023
2337d7c
Line length analysis
gordicaleksa Sep 3, 2023
ea9402a
Add some eval datasets to our download scripts + improve dedup analysis
gordicaleksa Sep 3, 2023
86c7cc9
added Apex pre-install instructions
lavaman131 Sep 4, 2023
21cf526
Merge pull request #1 from lavaman131/nllb_replication
gordicaleksa Sep 4, 2023
746e07e
Prepare Flores202 for eval, small fixes for analyze data script
gordicaleksa Sep 4, 2023
0783a56
Minor logic change in data analyze script
gordicaleksa Sep 5, 2023
a3ef373
Update README - lang champs
gordicaleksa Sep 5, 2023
705a0b0
Start working on a new-joiner friendly getting started README
gordicaleksa Sep 5, 2023
0efe0ab
Improve the getting started instructions
gordicaleksa Sep 5, 2023
a52807e
Include requirements.txt for data download script
gordicaleksa Sep 5, 2023
229c625
Fix broken links, add new packages to requirements.txt
gordicaleksa Sep 5, 2023
0124ed7
Add requests package
gordicaleksa Sep 5, 2023
25adea6
Refine sections 4 and 5 of getting started README
gordicaleksa Sep 5, 2023
5ce9941
Update the main README
gordicaleksa Sep 5, 2023
83ca61f
Fix anchor links in the README, update gitignore
gordicaleksa Sep 5, 2023
5998fb0
Update the banner image in the main README
gordicaleksa Sep 8, 2023
d2de6da
Add support for text sampling during training + chrf++ eval
gordicaleksa Sep 10, 2023
b844e02
Add explanation of the MT data pipeline in fairseq
gordicaleksa Sep 10, 2023
7d00c19
Wrap up data pipeline document
gordicaleksa Sep 11, 2023
2877fa5
Refine the data pipeline doc
gordicaleksa Sep 11, 2023
a0b130f
Final data pipeline document version
gordicaleksa Sep 11, 2023
5c4b93d
fixed minor type hint error for python < 3.9
lavaman131 Sep 11, 2023
89681e0
Merge branch 'gordicaleksa:nllb_replication' into nllb_replication
lavaman131 Sep 11, 2023
0cae656
Merge pull request #9 from lavaman131/nllb_replication
gordicaleksa Sep 11, 2023
9dc9478
After batch by size func analysis updated the data pipeline README desc
gordicaleksa Sep 11, 2023
780f5a5
Improve project description + add the roadmap
gordicaleksa Sep 12, 2023
58e2acc
Add OPUS download and cleaning scripts
gordicaleksa Sep 15, 2023
02a75ce
mined bitext script
vienneraphael Oct 3, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 27 additions & 21 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ version: 2.1
# -------------------------------------------------------------------------------------
gpu: &gpu
environment:
CUDA_VERSION: "11.1"
CUDA_VERSION: "11.2"
machine:
image: ubuntu-1604-cuda-11.1:202012-01
image: ubuntu-2004-cuda-11.2:202103-01
resource_class: gpu.nvidia.medium.multi


Expand All @@ -23,10 +23,11 @@ install_dep_common: &install_dep_common
command: |
source activate fairseq
pip install --upgrade setuptools
pip install bitarray boto3 deepspeed editdistance fastBPE iopath ipdb ipython pyarrow pytest sacremoses sentencepiece subword-nmt hydra-core==1.0.7 omegaconf==2.0.6
pip install bitarray boto3 deepspeed editdistance fastBPE iopath ipdb ipython pyarrow pytest sacremoses sentencepiece subword-nmt hydra-core==1.2.0 omegaconf==2.2.2
pip install statsmodels==0.12.2 more_itertools submitit boto3 editdistance transformers sklearn scipy cython Jinja2==2.11.3
pip install --progress-bar off pytest
pip install --progress-bar off fairscale
pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda111 -U
pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda112 -U
python -c 'import torch; print("Torch version:", torch.__version__)'
python -m torch.utils.collect_env

Expand All @@ -36,40 +37,45 @@ install_dep_fused_ops: &install_dep_fused_ops
working_directory: ~/
command: |
source activate fairseq
git clone https://github.com/NVIDIA/apex
cd apex
git checkout e2083df5eb96643c61613b9df48dd4eea6b07690
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" --global-option="--xentropy" --global-option="--fast_multihead_attn" ./
cd ~/
git clone --depth=1 --branch v2.4 https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
pip install -e .

if [ ! -d "apex" ]; then
git clone https://github.com/NVIDIA/apex
cd apex
git checkout e2083df5eb96643c61613b9df48dd4eea6b07690
sed -i '101,107 s/^/#/' setup.py
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" --global-option="--xentropy" --global-option="--fast_multihead_attn" ./
cd ~/
fi
if [ ! -d "Megatron-LM" ]; then
git clone --depth=1 --branch v2.4 https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
pip install -e .
cd ~/
fi

install_dep_pt19: &install_dep_pt19
install_dep_pt110: &install_dep_pt110
- run:
name: Install Pytorch Dependencies
command: |
source activate fairseq
pip install --upgrade setuptools
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html
python -c 'import torch; print("Torch version:", torch.__version__)'

install_dep_pt18: &install_dep_pt18
install_dep_pt19: &install_dep_pt19
- run:
name: Install Pytorch Dependencies
command: |
source activate fairseq
pip install --upgrade setuptools
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
python -c 'import torch; print("Torch version:", torch.__version__)'

install_repo: &install_repo
- run:
name: Install Repository
command: |
source activate fairseq
pip install .
pip install -e .
python setup.py build_ext --inplace

run_unittests: &run_unittests
Expand Down Expand Up @@ -130,7 +136,7 @@ jobs:
- <<: *install_repo
- <<: *run_unittests

gpu_tests_pt18:
gpu_tests_pt110:
<<: *gpu

working_directory: ~/fairseq-py
Expand All @@ -141,7 +147,7 @@ jobs:
- <<: *create_conda_env
- restore_cache:
key: *cache_key
- <<: *install_dep_pt18
- <<: *install_dep_pt110
- <<: *install_dep_common
- <<: *install_dep_fused_ops
- save_cache:
Expand All @@ -155,5 +161,5 @@ workflows:
version: 2
build:
jobs:
- gpu_tests_pt18
- gpu_tests_pt19
- gpu_tests_pt110
63 changes: 63 additions & 0 deletions .github/workflows/cpu_tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: cpu_tests

on: [push, pull_request]

jobs:
unittest:

strategy:
fail-fast: false
max-parallel: 12
matrix:
platform: [ubuntu-latest, macos-latest]
python-version: [3.8, 3.9]

runs-on: ${{ matrix.platform }}

steps:
- name: Checkout branch 🛎️
uses: actions/checkout@v2

- name: Setup Conda Environment
uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: fairseq
python-version: ${{ matrix.python-version }}
auto-update-conda: true
use-only-tar-bz2: true

- name: Cache Conda Environment
uses: actions/cache@v2
env:
# Increase this value to reset cache if nothing has changed but you still
# want to invalidate the cache
CACHE_NUMBER: 0
with:
path: |
/usr/share/miniconda/envs/
/usr/local/miniconda/envs/
key: fairseq-cpu-${{ matrix.platform }}-python${{ matrix.python-version }}-${{ env.CACHE_NUMBER }}-${{ hashFiles('**/.github/workflows/cpu_tests.yml') }}-${{ hashFiles('**/setup.py') }}


- name: Install Dependencies
shell: bash -l {0}
run: |
conda activate fairseq
git submodule update --init --recursive
pip install torch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 statsmodels==0.12.2 more_itertools submitit boto3 editdistance iopath ipdb ipython pyarrow pytest sacremoses sentencepiece subword-nmt transformers sklearn scipy fairscale Jinja2==2.11.3

- name: Install Repository
shell: bash -l {0}
run: |
conda activate fairseq
python setup.py clean --all
pip install --editable .
python setup.py build_ext --inplace


- name: Run CPU tests
shell: bash -l {0}
run: |
conda activate fairseq
cd tests
pytest --continue-on-collection-errors -v .
24 changes: 5 additions & 19 deletions .github/workflows/build.yml → .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: build
name: lint_tests

on:
# Trigger the workflow on push to main or any pull request
Expand All @@ -11,10 +11,10 @@ jobs:
build:

strategy:
max-parallel: 4
max-parallel: 1
matrix:
platform: [ubuntu-latest, macos-latest]
python-version: [3.8, 3.9]
platform: [ubuntu-latest]
python-version: [3.8]

runs-on: ${{ matrix.platform }}

Expand All @@ -26,34 +26,20 @@ jobs:
with:
python-version: ${{ matrix.python-version }}

- name: Conditionally install pytorch
if: matrix.platform == 'windows-latest'
run: pip3 install torch -f https://download.pytorch.org/whl/torch_stable.html

- name: Install locally
run: |
python -m pip install --upgrade pip
git submodule update --init --recursive
python setup.py build_ext --inplace
python -m pip install --editable .

- name: Install optional test requirements
run: |
python -m pip install iopath transformers pyarrow
python -m pip install git+https://github.com/facebookresearch/fairscale.git@main
python -m pip install --editable '.[dev]'

- name: Lint with flake8
run: |
pip install flake8
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics --extend-exclude fairseq/model_parallel/megatron
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics --extend-exclude fairseq/model_parallel/megatron

- name: Run tests
run: |
python setup.py test

- name: Lint with black
run: |
pip install black
Expand Down
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ wheels/
# Checkpoints
checkpoints

# slurm snap shot
slurm_snapshot_code/

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
Expand Down Expand Up @@ -117,6 +120,10 @@ ENV/

# data
data-bin/
examples/nllb/data/non_train_datasets/
examples/nllb/data/train_datasets/
examples/nllb/data/eval_datasets/
model_checkpoints/

# reranking
/examples/reranking/rerank_data
Expand All @@ -128,6 +135,7 @@ data-bin/
# VSCODE
.vscode/ftp-sync.json
.vscode/settings.json
.vscode/launch.json

# Experimental Folder
experimental/*
Expand All @@ -139,3 +147,5 @@ wandb/
nohup.out
multirun
outputs

# data
3 changes: 2 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,15 @@ repos:
- id: no-commit-to-branch
args: ['--branch=master']
- id: check-added-large-files
args: ['--maxkb=500']
args: ['--maxkb=2048']
- id: end-of-file-fixer

- repo: https://github.com/ambv/black
rev: 22.1.0
hooks:
- id: black
language_version: python3.8
additional_dependencies: ['click==8.0.4']

- repo: https://gitlab.com/pycqa/flake8
rev: 3.9.2
Expand Down
13 changes: 8 additions & 5 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,13 @@ include:
Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
advances
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
professional setting

## Our Responsibilities

Expand All @@ -52,10 +52,14 @@ project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.

This Code of Conduct also applies outside the project spaces when there is a
reasonable belief that an individual's behavior may have a negative impact on
the project or its community.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at <conduct@pytorch.org>. All
reported by contacting the project team at <opensource-conduct@fb.com>. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Expand All @@ -74,4 +78,3 @@ available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.ht

For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq

14 changes: 8 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Contributing to Facebook AI Research Sequence-to-Sequence Toolkit (fairseq)
# Contributing to fairseq
We want to make contributing to this project as easy and transparent as
possible.

Expand All @@ -14,26 +14,28 @@ We actively welcome your pull requests.

## Contributor License Agreement ("CLA")
In order to accept your pull request, we need you to submit a CLA. You only need
to do this once to work on any of Facebook's open source projects.
to do this once to work on any of Meta's open source projects.

Complete your CLA here: <https://code.facebook.com/cla>

## Issues
We use GitHub issues to track public bugs. Please ensure your description is
clear and has sufficient instructions to be able to reproduce the issue.

Meta has a [bounty program](https://www.facebook.com/whitehat/) for the safe
disclosure of security bugs. In those cases, please go through the process
outlined on that page and do not file a public issue.

## License
By contributing to Facebook AI Research Sequence-to-Sequence Toolkit (fairseq),
you agree that your contributions will be licensed under the LICENSE file in
the root directory of this source tree.
By contributing to fairseq, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree.

## Pre-commit hooks
In order to ensure your code lints, there are pre-commit hooks configured in the repository which you can install.
After installation, they will automatically run each time you commit.
An abbreviated guide is given below; for more information, refer to [the offical pre-commit documentation](https://pre-commit.com/).

### Installation
```
```bash
pip install pre-commit
pre-commit install
```
Expand Down
Loading