Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can i get the data set <seq_lab_corpus> #1

Open
aa452948257 opened this issue Nov 1, 2021 · 6 comments
Open

how can i get the data set <seq_lab_corpus> #1

aa452948257 opened this issue Nov 1, 2021 · 6 comments

Comments

@aa452948257
Copy link

No description provided.

@aa452948257
Copy link
Author

Can you send me the Noisy Sequence Labeling Data Set, I can not get the write data following the readme text.

@mnamysl
Copy link
Owner

mnamysl commented Nov 2, 2021

Hi @aa452948257 , thank you for your issue report.

Unfortunately, because of licensing/copyright reasons, I cannot send you the data set directly. Following the instructions in README.md, you need to download the original data set and restore the noisy annotations.

Which original data set did you use? What exact error message did you get?

@aa452948257
Copy link
Author

aa452948257 commented Nov 2, 2021 via email

@mnamysl
Copy link
Owner

mnamysl commented Nov 2, 2021

After downloading the original data set, please move it to the resources/tasks sub-directory. Its content should look like this (when you downloaded both original data sets):

tasks/
├── conll_03
│   ├── dev.txt
│   ├── test.txt
│   └── train.txt
└── ud_english
    ├── en_ewt-ud-dev.conllu
    ├── en_ewt-ud-test.conllu
    └── en_ewt-ud-train.conllu

Let's assume that we want to restore the noisy CoNLL data sets. To achieve this, we first need to call the conversion script as follows:

python3 main.py --mode ds_restore --corpus conll03_en

We can validate the checksum by calling:

python3 main.py --mode ds_check --corpus conll03_en

The output should look like this:

...
2021-11-02 16:12:32,714 tess3_01: True
2021-11-02 16:12:32,727 tess4_01: True
2021-11-02 16:12:32,736 tess4_02: True
2021-11-02 16:12:32,744 tess4_03: True
2021-11-02 16:12:32,750 typos: True

The conversion results are stored in the resources/conversion/conll03_en_* directories. We can copy the files with the _restored suffix to the resources/task folder to be able to use the generated noisy data sets for evaluation. After completing these steps, the structure of our resources/tasks directory should look as follows:

test/resources/tasks/
├── conll_03
│   ├── dev.txt
│   ├── test.txt
│   └── train.txt
├── conll03_en_tess3_01
│   ├── dev_restored.txt
│   ├── test_restored.txt
│   └── train_restored.txt
├── conll03_en_tess4_01
│   ├── dev_restored.txt
│   ├── test_restored.txt
│   └── train_restored.txt
├── conll03_en_tess4_02
│   ├── dev_restored.txt
│   ├── test_restored.txt
│   └── train_restored.txt
├── conll03_en_tess4_03
│   ├── dev_restored.txt
│   ├── test_restored.txt
│   └── train_restored.txt
├── conll03_en_tess4_typos
│   ├── dev_restored.txt
│   ├── test_restored.txt
│   └── train_restored.txt
└── ud_english
    ├── en_ewt-ud-dev.conllu
    ├── en_ewt-ud-test.conllu
    └── en_ewt-ud-train.conllu

I hope it helps :-)

@aa452948257
Copy link
Author

aa452948257 commented Nov 5, 2021 via email

@mnamysl
Copy link
Owner

mnamysl commented Nov 9, 2021

Thank you for your feedback. Does the same problem also occur with the UD English EWT data set?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants