-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event-based segmentation #470
Comments
(I thought there was already an issue on that but haven't found one) |
First attempt:
This segmentation works, not by FSM, but by an XSL stylesheet. The general approach is:
To do: take care of segmentation errors. There are two options:
|
It can take up to two hours for gigantic (INEL) transcripts, so it needs to be implemented differently |
Please also add Pending decision on hyphens... |
Also word-external: On the contrary, the usual
makes Tsakorpus unhappy. |
Different implementation (no XSL) now takes seconds instead of minutes. To do / decide: What will count as a segmentation error? Or make the algorithm accept everything? |
|
Not necessarily utterance terminators, since it can be in the middle of an utterance. Rather just any punctuation, e.g. |
The exception type at line 105 ("Word characters after double closing round parentheses...") is created with null instead of tierID - so when I later call FSMException.getTierID on it, the result is null as well. Could you take a look on it? |
Hi Thomas, could you please check whether (your version of) the new segmentation throws an error on events like Or maybe the jar is not the latest one? |
The last one at https://exmaralda.org/files/prevDL/webjar.html was from August, 19th. I don't think the last changes were after this, but to make sure, I updated the jar. Please try again. If it doesn't work, I would need a more complete example to reproduce the error. |
Okay, but that is as intended: double round brackets can be optionally followed by puncutation characters, but not preceded. Is that a request for change then? |
Yes, please. Just because of cases like this, which I think are no worse than following punctuation. Or are there reasons to disallow that? |
We only have very few cases with these quotes, but it seems to say what it says, unintelligible fragment within direct speech. |
Not really, I'm just worried that new rules will break old ones... I tried a fix, there is a new Windows version and a new web jar. |
Thank you, these errors are no more with the new jar. Besides, combinations such as somewordhere((…)). got marked as erroneous, which they are. |
Would it be possible to add the forward slash "/" to the list of word-external punctuation marks in the segmentator? We use slashes to (a) mark line breaks in poetic texts and to (b) occasionally provide alternative variants of some part of a sentence. Currently the segmentator throws errors ("Non-permissible sequence of word and punctuation characters in event: kanʼixubiʔ // ".) if there are slashes following whitespace. If a slash immediately follows a word, i.e. "kanʼixubiʔ/", it will be treated as a part of the word by the segmentator, which is not the intended behavior. |
Adding or removing characters from word external punctuation is not a big deal. I have added the forward slash now and made a new Windows preview and web jar. If you want to see for yourselves: the characters are defined here: exmaralda/src/org/exmaralda/partitureditor/jexmaralda/segment/InelEventBasedSegmentation.java Line 40 in 651d82b
|
Thank you. Do I take it right that it would be perfectly fine if I commit any further changes to the external punctuation list myself? |
Yes, go ahead:-) But please do it in a branch and send a merge request |
Hi Thomas, |
Okay, I made this a separate issue and will check: |
From May 2023 meeting minutes:
TS: explore the possibility of event-based segmentation. That would eliminate the need for HIAT-based FSM segmentation and allow more flexible transcription conventions
@ts Note: current transcription conventions vary to some extent between INEL corpora but the core is documented in https://doi.org/10.14232/wpcl.2020.5 (with a summary of symbols in the end)
The text was updated successfully, but these errors were encountered: