To prepare the Python environment:
poetry install
You need prepare config files and the .env
file:
- Copy base config file and edit
work_dir
cp configs/base_template.yaml configs/base.yaml
- Create a
.env
file and setDATA_DIR
.
echo DATA_DIR="/path/to/data_dir" >> .env
For morphological analysis, you need to convert JumanDIC in advance with the following commands.
cd /path/to/JumanDIC
git checkout kwja
make kwja
and
poetry run python scripts/preprocessors/preprocess_jumandic.py
--input-dir /path/to/JumanDIC
--output-dir /path/to/dic_dir
Options:
--input-dir, -i
: path to the JumanDIC dir.--output-dir, -o
: path to a directory where processed data are saved.
You must preprocess Japanese Wikipedia Typo Dataset.
poetry run python scripts/preprocessors/preprocess_typo.py
--input-dir "/path/to/unzipped_typo_dataset_dir"
Options:
--output-dir, -o
: path to directory to save. Default:./data
--num-valid-samples, -n
: number of validation data. Default:1000
"build_datasets.sh" performs formatting KWDLC and annotated FKC corpus.
./scripts/build_datasets.sh \
--jobs 2 \
--out-dir /path/to/output_dir
Options:
--jobs
: number of jobs--out-dir
: path to output directory
NOTE: To train word module on Kyoto University Text Corpus, you must have access to it and IREX CRL named entity data. If you have both access, you can format the corpus with the following commands. (You may need preprocessing to format IREX CRL named entity data.)
poetry run python scripts/build_dataset.py \
./KyotoCorpus/knp \
./kyoto/knp \
--ne-tags ./IREX_CRL_NE_data.jmn \
-j 2
poetry run kyoto idsplit \
--corpus-dir kyoto/knp \
--output-dir kyoto \
--train KyotoCorpus/id/full/train.id \
--valid KyotoCorpus/id/full/dev.id \
--test KyotoCorpus/id/full/test.id
poetry run python scripts/build_dataset.py \
./KyotoCorpus/knp \
./kyoto_ed \
--id ./KyotoCorpus/id/syntax-only \
-j 32
You can train and test the models in the following command:
# For training and evaluating word segmenter
poetry run python scripts/train.py -cn char_module devices=[0,1]
If you only want to do evaluation after training, please use the following command:
# For evaluating word segmenter
poetry run python scripts/test.py module=char checkpoint_path="/path/to/checkpoint" devices=[0]
# For debugging word segmenter
poetry run python scripts/train.py -cn char_module.debug
If you are on a machine with MPS devices (e.g. Apple M1), specify trainer=cpu.debug
to use CPU.
# For debugging word segmenter
poetry run python scripts/train.py -cn char_module.debug trainer=cpu.debug
If you are on a machine with GPUs, you can specify the GPUs to use with the devices
option.
# For debugging word segmenter
poetry run python scripts/train.py -cn char_module.debug devices=[0]
poetry run pytest
-
Checkout the
dev
branch -
Make sure the new version is supported in
_get_model_version
function insrc/kwja/cli/utils.py
-
Update
CHANGELOG.md
-
Edit
pyproject.toml
to updatetool.poetry.version
-
Update dependencies (edit
pyproject.toml
if necessary)poetry update
-
Push changes to the
dev
branch and create a pull request to themain
branch -
If CI is passed, merge the pull request
-
Checkout the
main
branch and pull changes -
Add a new tag and push changes
git tag -a v0.1.0 -m "Release v0.1.0" git push --follow-tags
-
Publish to PyPI
poetry build poetry publish [--username $PYPI_USERNAME] [--password $PYPI_PASSWORD]
-
Rebase the
dev
branch to themain
branchgit checkout dev git rebase main git push