You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used your data_script.py to try and read the ACE dataset. It works pretty well, but many tokens end up not being properly segmented / broken down into tokens, which is even more of an issue considering that a large number of those incorrectly tokenized words are event triggers.
Here are a couple of such examples:
corres pondencebetwee n his estranged wife and partners while she worked at the law firm 's
hamas and other palestinian militias carried out five suicid ebombing s tha tkille d 12 israelis an dwounde d dozens abu amr said the militants apparently unleashed
military support u s officials speaking on the record deny there are firm plans t opul l troops from germany but senior diplomats and officials say privately that it is being
So I've got plenty of event triggers for the word ewa (which I made out is a poorly segmented war), or opoiso (instead of poison) and plenty of similar tokens.
I am suspecting the issue comes from the offset provided by the dataset, which I found to be difficult to work with if not inconsistent.
Did you have similar issues when preprocessing the set?
The text was updated successfully, but these errors were encountered:
Hi there,
I used your data_script.py to try and read the ACE dataset. It works pretty well, but many tokens end up not being properly segmented / broken down into tokens, which is even more of an issue considering that a large number of those incorrectly tokenized words are event triggers.
Here are a couple of such examples:
So I've got plenty of event triggers for the word
ewa
(which I made out is a poorly segmentedwar
), oropoiso
(instead ofpoison
) and plenty of similar tokens.I am suspecting the issue comes from the offset provided by the dataset, which I found to be difficult to work with if not inconsistent.
Did you have similar issues when preprocessing the set?
The text was updated successfully, but these errors were encountered: