Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Commit

Permalink
[Fix][Docker] Fix the docker image + Fix pretrain_corpus document. (#…
Browse files Browse the repository at this point in the history
…1378)

* update

* Update ubuntu18.04-devel-gpu.Dockerfile

* fix the docker image

* Update README.md

* Update ubuntu18.04-devel-gpu.Dockerfile

* Update README.md

* fix readme

* Add CPU DockerFile

* update

* update

* Update ubuntu18.04-devel-gpu.Dockerfile

* update

* prepare to add TVM to docker

* try to update

* Update ubuntu18.04-devel-gpu.Dockerfile

* Update ubuntu18.04-devel-gpu.Dockerfile

* Update install_openmpi.sh

* update

* Create install_llvm.sh

* Update ubuntu18.04-base-gpu.Dockerfile

* Update ubuntu18.04-base-gpu.Dockerfile

* Update run_squad2_albert_base.sh

* Update prepare_squad.py

* Update prepare_squad.py

* Update prepare_squad.py

* fix

* Update README.md

* update

* update

* Update README.md

* Update README.md

* Update ubuntu18.04-devel-gpu.Dockerfile

* update

* Update README.md

* fix

* Update ubuntu18.04-base-cpu.Dockerfile

* update

* add tvm to lazy import

* update

* Update README.md

* update

* Update README.md

* Update run_squad2_albert_base.sh

* update

* update

* update

* update

* update

* Update README.md

* Update install_ubuntu18.04_core.sh

* update

* update

* update

* fix

* Update README.md

* Update run_batch_squad.sh

* update

* Update run_batch_squad.sh

* Update run_batch_squad.sh

* update

* Update README.md

* fix

* Update gluon_nlp_job.sh

* update

* Update README.md

* Update README.md

* Update README.md

* update

* Update README.md

* update

* Update install_python_packages.sh

* Update install_llvm.sh

* Update install_python_packages.sh

* Update install_llvm.sh

* update

* Update install_ubuntu18.04_core.sh

* fix

* Update submit-job.py

* Update submit-job.py

* Update README.md

* Update README.md

* Update prepare_gutenberg.py

* Delete gluon_nlp_cpu_job.sh

* Update prepare_gutenberg.py

* Update prepare_gutenberg.py

* Update prepare_gutenberg.py

* Update conf.py

* update

* Update generate_commands.py

* fix readme

* use os.link for hard link

* Update README.md

* Update README.md

* Update gluon_nlp_job.sh

* Update __init__.py

* Update benchmark_utils.py

* try to use multi-stage build

* Update benchmark_utils.py

* multi-stage build

* Update README.md

* Update README.md

* update

* Update submit-job.py

* fix documentation

* fix

* update

* Update test.sh

* Update test.sh

* Update test.sh

* Update test.sh

* Update README.md

* Update test.sh

* fix

* Update README.md

* Update gluon_nlp_job.sh
  • Loading branch information
sxjscience authored Oct 15, 2020
1 parent d60dae3 commit 02c0ef8
Show file tree
Hide file tree
Showing 51 changed files with 1,186 additions and 558 deletions.
15 changes: 10 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,16 +34,16 @@ First of all, install the latest MXNet. You may use the following commands:

```bash
# Install the version with CUDA 10.0
python3 -m pip install -U --pre "mxnet-cu100>=2.0.0b20200802" -f https://dist.mxnet.io/python
python3 -m pip install -U --pre "mxnet-cu100>=2.0.0b20200926" -f https://dist.mxnet.io/python

# Install the version with CUDA 10.1
python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20200802" -f https://dist.mxnet.io/python
python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20200926" -f https://dist.mxnet.io/python

# Install the version with CUDA 10.2
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20200802" -f https://dist.mxnet.io/python
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20200926" -f https://dist.mxnet.io/python

# Install the cpu-only version
python3 -m pip install -U --pre "mxnet>=2.0.0b20200802" -f https://dist.mxnet.io/python
python3 -m pip install -U --pre "mxnet>=2.0.0b20200926" -f https://dist.mxnet.io/python
```


Expand Down Expand Up @@ -92,8 +92,13 @@ You may go to [tests](tests) to see how to run the unittests.
You can use Docker to launch a JupyterLab development environment with GluonNLP installed.

```
# GPU Instance
docker pull gluonai/gluon-nlp:gpu-latest
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 --shm-size=4g gluonai/gluon-nlp:gpu-latest
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 --shm-size=2g gluonai/gluon-nlp:gpu-latest
# CPU Instance
docker pull gluonai/gluon-nlp:cpu-latest
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 --shm-size=2g gluonai/gluon-nlp:cpu-latest
```

For more details, you can refer to the guidance in [tools/docker](tools/docker).
8 changes: 4 additions & 4 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,10 +234,10 @@ def setup(app):
'auto_doc_ref': True
}, True)
app.add_transform(AutoStructify)
app.add_javascript('google_analytics.js')
app.add_javascript('hidebib.js')
app.add_javascript('install-options.js')
app.add_stylesheet('custom.css')
app.add_js_file('google_analytics.js')
app.add_js_file('hidebib.js')
app.add_js_file('install-options.js')
app.add_css_file('custom.css')


sphinx_gallery_conf = {
Expand Down
9 changes: 1 addition & 8 deletions scripts/benchmarks/benchmark_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -792,12 +792,9 @@ def train_step():
raise NotImplementedError
timeit.repeat(train_step, repeat=1, number=3)
mxnet.npx.waitall()
for ctx in mx_all_contexts:
ctx.empty_cache()
runtimes = timeit.repeat(train_step, repeat=self._repeat, number=3)
mxnet.npx.waitall()
for ctx in mx_all_contexts:
ctx.empty_cache()
ctx.empty_cache()
mxnet.npx.waitall()
# Profile memory
if self._use_gpu:
Expand Down Expand Up @@ -844,8 +841,6 @@ def run(self):
infer_time = np.nan
infer_memory = np.nan
inference_result[model_name][workload] = (infer_time, infer_memory)
for ctx in mx_all_contexts:
ctx.empty_cache()
mxnet.npx.waitall()
self.save_to_csv(inference_result, self._inference_out_csv_file)
if self._profile_train:
Expand All @@ -858,8 +853,6 @@ def run(self):
train_time = np.nan
train_memory = np.nan
train_result[model_name][workload] = (train_time, train_memory)
for ctx in mx_all_contexts:
ctx.empty_cache()
mxnet.npx.waitall()
self.save_to_csv(train_result, self._train_out_csv_file)

Expand Down
14 changes: 7 additions & 7 deletions scripts/datasets/general_nlp_benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,13 +112,13 @@ benchmarking. We select the classical datasets that are also used in

| Dataset | #Train | #Test | Columns | Metrics |
|---------------|---------|---------|-----------------|-----------------|
| AG | 120000 | 7600 | content, label | acc |
| IMDB | 25000 | 25000 | content, label | acc |
| DBpedia | 560000 | 70000 | content, label | acc |
| Yelp2 | 560000 | 38000 | content, label | acc |
| Yelp5 | 650000 | 50000 | content, label | acc |
| Amazon2 | 3600000 | 400000 | content, label | acc |
| Amazon5 | 3000000 | 650000 | content, label | acc |
| AG | 120,000 | 7,600 | content, label | acc |
| IMDB | 25,000 | 25,000 | content, label | acc |
| DBpedia | 560,000 | 70,000 | content, label | acc |
| Yelp2 | 560,000 | 38,000 | content, label | acc |
| Yelp5 | 650,000 | 50,000 | content, label | acc |
| Amazon2 | 3,600,000 | 400,000 | content, label | acc |
| Amazon5 | 3,000,000 | 65,0000 | content, label | acc |

To obtain the datasets, run:

Expand Down
12 changes: 8 additions & 4 deletions scripts/datasets/pretrain_corpus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@

We provide a series of shared scripts for downloading/preparing the text corpus for pretraining NLP models.
This helps create a unified text corpus for studying the performance of different pretraining algorithms.
When releasing the datasets, we follow the [FAIR principle](https://www.go-fair.org/fair-principles/),
When picking the datasets to support, we follow the [FAIR principle](https://www.go-fair.org/fair-principles/),
i.e., the dataset needs to be findable, accessible, interoperable, and reusable.

For all scripts, we can either use `nlp_data SCRIPT_NAME`, or directly call the script.

## Gutenberg BookCorpus
Unfortunately, we are unable to provide the [Toronto BookCorpus dataset](https://yknzhu.wixsite.com/mbweb) due to licensing issues.

Expand All @@ -16,14 +18,14 @@ Thus, we utilize the [Project Gutenberg](https://www.gutenberg.org/) as an alter
You can use the following command to download and prepare the Gutenberg corpus.

```bash
python3 prepare_bookcorpus.py --dataset gutenberg
python3 prepare_gutenberg.py --save_dir gutenberg
```

Also, you should follow the [license](https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License) for using the data.

## Wikipedia

Please install [attardi/wikiextractor](https://github.com/attardi/wikiextractor) for preparing the data.
We used the [attardi/wikiextractor](https://github.com/attardi/wikiextractor) package for preparing the data.

```bash
# Download
Expand All @@ -33,7 +35,9 @@ python3 prepare_wikipedia.py --mode download --lang en --date latest -o ./
python3 prepare_wikipedia.py --mode format -i [path-to-wiki.xml.bz2] -o ./

```
The process of downloading and formatting is time consuming, and we offer an alternative solution to download the prepared raw text file from S3 bucket. This raw text file is in English and was dumped at 2020-06-20 being formated by the above very process (` --lang en --date 20200620`).
The process of downloading and formatting is time consuming, and we offer an alternative
solution to download the prepared raw text file from S3 bucket. This raw text file is in English and
was dumped at 2020-06-20 being formatted by the above process (` --lang en --date 20200620`).

```bash
python3 prepare_wikipedia.py --mode download_prepared -o ./
Expand Down
7 changes: 5 additions & 2 deletions scripts/datasets/pretrain_corpus/prepare_gutenberg.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import zipfile
from gluonnlp.base import get_data_home_dir
from gluonnlp.utils.misc import download, load_checksum_stats

import shutil

_CITATIONS = r"""
@InProceedings{lahiri:2014:SRW,
Expand Down Expand Up @@ -59,11 +59,14 @@ def main(args):
save_dir = args.dataset if args.save_dir is None else args.save_dir
if not os.path.exists(save_dir):
os.makedirs(save_dir, exist_ok=True)
print(f'Save to {save_dir}')
with zipfile.ZipFile(target_download_location) as f:
for name in f.namelist():
if name.endswith('.txt'):
filename = os.path.basename(name)
f.extract(name, os.path.join(save_dir, filename))
with f.open(name) as in_file:
with open(os.path.join(save_dir, filename.replace(' ', '_')), 'wb') as out_file:
shutil.copyfileobj(in_file, out_file)


def cli_main():
Expand Down
8 changes: 5 additions & 3 deletions scripts/datasets/question_answering/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Question Answering


## SQuAD
SQuAD datasets is distributed under the [CC BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0/legalcode) license.

Expand Down Expand Up @@ -39,7 +40,7 @@ python3 prepare_searchqa.py
nlp_data prepare_searchqa
```

Directory structure of the searchqa dataset will be as follows
Directory structure of the SearchQA dataset will be as follows
```
searchqa
├── train.txt
Expand All @@ -48,9 +49,10 @@ searchqa
```

## TriviaQA
[TriviaQA](https://nlp.cs.washington.edu/triviaqa/) is an open domain QA dataset. See more useful scripts in [Offical Github](https://github.com/mandarjoshi90/triviaqa)
[TriviaQA](https://nlp.cs.washington.edu/triviaqa/) is an open domain QA dataset.
See more useful scripts in [Offical Github](https://github.com/mandarjoshi90/triviaqa).

Run the following command to download triviaqa
Run the following command to download TriviaQA

```bash
python3 prepare_triviaqa.py --version rc # Download TriviaQA version 1.0 for RC (2.5G)
Expand Down
8 changes: 4 additions & 4 deletions scripts/datasets/question_answering/prepare_searchqa.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os
import argparse
from gluonnlp.utils.misc import download, load_checksum_stats
from gluonnlp.base import get_data_home_dir
from gluonnlp.base import get_data_home_dir, get_repo_url

_CURR_DIR = os.path.realpath(os.path.dirname(os.path.realpath(__file__)))
_BASE_DATASET_PATH = os.path.join(get_data_home_dir(), 'searchqa')
Expand All @@ -20,9 +20,9 @@
"""

_URLS = {
'train': 's3://gluonnlp-numpy-data/datasets/question_answering/searchqa/train.txt',
'val': 's3://gluonnlp-numpy-data/datasets/question_answering/searchqa/val.txt',
'test': 's3://gluonnlp-numpy-data/datasets/question_answering/searchqa/test.txt'
'train': get_repo_url() + 'datasets/question_answering/searchqa/train.txt',
'val': get_repo_url() + 'datasets/question_answering/searchqa/val.txt',
'test': get_repo_url() + 'datasets/question_answering/searchqa/test.txt'
}


Expand Down
17 changes: 11 additions & 6 deletions scripts/datasets/question_answering/prepare_squad.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import os
import argparse
import shutil
from gluonnlp.utils.misc import download, load_checksum_stats
from gluonnlp.base import get_data_home_dir

Expand Down Expand Up @@ -58,14 +59,18 @@ def main(args):
download(dev_url, path=os.path.join(args.cache_path, dev_file_name))
if not os.path.exists(args.save_path):
os.makedirs(args.save_path)
if not os.path.exists(os.path.join(args.save_path, train_file_name))\
if not os.path.exists(os.path.join(args.save_path, train_file_name)) \
or (args.overwrite and args.save_path != args.cache_path):
os.symlink(os.path.join(args.cache_path, train_file_name),
os.path.join(args.save_path, train_file_name))
if not os.path.exists(os.path.join(args.save_path, dev_file_name))\
os.link(os.path.join(args.cache_path, train_file_name),
os.path.join(args.save_path, train_file_name))
else:
print(f'Found {os.path.join(args.save_path, train_file_name)}...skip')
if not os.path.exists(os.path.join(args.save_path, dev_file_name)) \
or (args.overwrite and args.save_path != args.cache_path):
os.symlink(os.path.join(args.cache_path, dev_file_name),
os.path.join(args.save_path, dev_file_name))
os.link(os.path.join(args.cache_path, dev_file_name),
os.path.join(args.save_path, dev_file_name))
else:
print(f'Found {os.path.join(args.save_path, dev_file_name)}...skip')


def cli_main():
Expand Down
6 changes: 3 additions & 3 deletions scripts/datasets/url_checksums/searchqa.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
s3://gluonnlp-numpy-data/datasets/question_answering/searchqa/train.txt c7e1eb8c34d0525547b91e18b3f8f4d855e35c16 1226681217
s3://gluonnlp-numpy-data/datasets/question_answering/searchqa/test.txt 08a928e0f8c129d5b3ca43bf46df117e38be0c27 332064988
s3://gluonnlp-numpy-data/datasets/question_answering/searchqa/val.txt c2f65d6b83c26188d5998ab96bc6a38c1a127fcc 170835902
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/question_answering/searchqa/train.txt c7e1eb8c34d0525547b91e18b3f8f4d855e35c16 1226681217
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/question_answering/searchqa/test.txt 08a928e0f8c129d5b3ca43bf46df117e38be0c27 332064988
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/question_answering/searchqa/val.txt c2f65d6b83c26188d5998ab96bc6a38c1a127fcc 170835902
4 changes: 2 additions & 2 deletions scripts/machine_translation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ to generate the dataset. Then, run `train_transformer.py` to train the model.
In the following, we give the training script for WMT2014 EN-DE task with yttm tokenizer.
You may first run the following command in [datasets/machine_translation](../datasets/machine_translation).
```bash
bash ../datasets/machine_translation/wmt2014_ende_base.sh yttm (For transformer_base config)
bash ../datasets/machine_translation/wmt2014_ende.sh yttm (For transformer_wmt_en_de_big config)
bash ../datasets/machine_translation/wmt2014_ende_base.sh yttm # (For transformer_base config)
bash ../datasets/machine_translation/wmt2014_ende.sh yttm # (For transformer_wmt_en_de_big config)
```

Then, you can run the experiment.
Expand Down
8 changes: 8 additions & 0 deletions scripts/question_answering/commands/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Commands For Training on SQuAD

All commands are generated by parsing the template in [run_squad.template](run_squad.template).
To generate all commands, use the following code.

```bash
python3 generate_commands.py
```
Loading

0 comments on commit 02c0ef8

Please sign in to comment.