Preprocessing - inconsistent tokenization #11

leonardmq · 2019-05-28T19:17:36Z

Hi there,

I used your data_script.py to try and read the ACE dataset. It works pretty well, but many tokens end up not being properly segmented / broken down into tokens, which is even more of an issue considering that a large number of those incorrectly tokenized words are event triggers.

Here are a couple of such examples:

corres pondencebetwee n his estranged wife and partners while she worked at the law firm 's

hamas and other palestinian militias carried out five suicid ebombing s tha tkille d 12 israelis an dwounde d dozens abu amr said the militants apparently unleashed

military support u s officials speaking on the record deny there are firm plans t opul l troops from germany but senior diplomats and officials say privately that it is being

So I've got plenty of event triggers for the word ewa (which I made out is a poorly segmented war), or opoiso (instead of poison) and plenty of similar tokens.

I am suspecting the issue comes from the offset provided by the dataset, which I found to be difficult to work with if not inconsistent.

Did you have similar issues when preprocessing the set?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing - inconsistent tokenization #11

Preprocessing - inconsistent tokenization #11

leonardmq commented May 28, 2019

Preprocessing - inconsistent tokenization #11

Preprocessing - inconsistent tokenization #11

Comments

leonardmq commented May 28, 2019