Event-based segmentation #470

sarkipo · 2024-04-19T15:49:33Z

From May 2023 meeting minutes:

TS: explore the possibility of event-based segmentation. That would eliminate the need for HIAT-based FSM segmentation and allow more flexible transcription conventions
@ts Note: current transcription conventions vary to some extent between INEL corpora but the core is documented in https://doi.org/10.14232/wpcl.2020.5 (with a summary of symbols in the end)

sarkipo · 2024-04-19T15:49:38Z

(I thought there was already an issue on that but haven't found one)

berndmoos · 2024-04-21T08:34:27Z

First attempt:

New segmentation option INEL_EVENT_BASED in PE
Same option in wizard for COMA and EXAKT

This segmentation works, not by FSM, but by an XSL stylesheet. The general approach is:

Build an "ordinary" segmented transcription, i.e. without additional segmentation
Copy the event segmentation, and for each event
- decide if it's a non-word (starting with double parentheses) or a word
- parse the event content accordingly (into ts, ats, nts)
Group leaves into ts with "INEL:u" according to the ref tier
This seems to work (for the Dolgan corpus), but it is anything but fast (about 20 minutes for all 17 Dolgan transcripts).
The whole setup is analagous to HIAT, i.e. we have utterances, words, etc., they only have "INEL:" as prefix instead of "HIAT:"

To do: take care of segmentation errors. There are two options:

Decide that there can never be segmentation errors (because the XSL transformation will never fail)
Define segmentation errors in a separate process

berndmoos · 2024-04-24T10:50:28Z

it is anything but fast

It can take up to two hours for gigantic (INEL) transcripts, so it needs to be implemented differently

berndmoos · 2024-04-26T07:35:18Z

Please also add
: (colon)
; (semicolon)
“ (left double quotation mark)
” (right double quotation mark)

Pending decision on hyphens...

sarkipo · 2024-04-26T14:38:56Z

Also word-external:
« U+00AB (LEFT-POINTING DOUBLE ANGLE QUOTATION MARK)
» U+00BB (RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK)
‐ U+2010 HYPHEN
‑ U+2011 NON-BREAKING HYPHEN

On the contrary, the usual
- U+002D HYPHEN-MINUS
should be word-internal. @git-debase-verbose has just seen that

<ts e="T223" id="tx.w222" n="INEL:w" s="T222">Dʼɨllara</ts>
<nts id="nts_tx.e222_2" n="INEL:ip">-</nts>
<ts e="T223" id="tx.w222" n="INEL:w" s="T222">kunnere</ts>

makes Tsakorpus unhappy.

berndmoos · 2024-05-06T06:07:49Z

Different implementation (no XSL) now takes seconds instead of minutes. To do / decide: What will count as a segmentation error? Or make the algorithm accept everything?

berndmoos · 2024-05-23T07:56:36Z

((xxx)). -- there can be utterance terminators after the closing parentheses

sarkipo · 2024-05-23T08:15:20Z

Not necessarily utterance terminators, since it can be in the middle of an utterance. Rather just any punctuation, e.g.
((…)), – or ((…)),. But terminators are perhaps 97% of all cases when something follows. (Also complex ones like ((…))?”).

git-debase-verbose · 2024-08-19T14:03:42Z

In https://github.com/Exmaralda-Org/exmaralda/blob/master/src/org/exmaralda/partitureditor/jexmaralda/segment/InelEventBasedSegmentation.java:

The exception type at line 105 ("Word characters after double closing round parentheses...") is created with null instead of tierID - so when I later call FSMException.getTierID on it, the result is null as well. Could you take a look on it?

sarkipo · 2024-10-21T13:21:53Z

Hi Thomas, could you please check whether (your version of) the new segmentation throws an error on events like
“((…))
(punctuation preceding double brackets)?
Ours does, although in principle it shouldn't. It also still has Tier ID: null in some errors, namely
Word characters after double closing round parentheses in event: ((… beginning at T311. Tier ID: null
(this is right to be an error, just the null is wrong)

Or maybe the jar is not the latest one?

berndmoos · 2024-10-21T13:47:29Z

This one works for me.

Guess I need to check the jar, then...

berndmoos · 2024-10-21T13:51:25Z

The last one at https://exmaralda.org/files/prevDL/webjar.html was from August, 19th. I don't think the last changes were after this, but to make sure, I updated the jar. Please try again. If it doesn't work, I would need a more complete example to reproduce the error.

sarkipo · 2024-10-21T14:55:58Z

Just the double brackets ((…)) work fine, and ones followed by other punctuation, but not the ones preceded by quotes:

sarkipo · 2024-10-21T15:00:46Z

KXN_1978_ThreeShamans_flk.zip

berndmoos · 2024-10-21T15:09:42Z

Okay, but that is as intended: double round brackets can be optionally followed by puncutation characters, but not preceded. Is that a request for change then?

sarkipo · 2024-10-21T15:12:10Z

Yes, please. Just because of cases like this, which I think are no worse than following punctuation. Or are there reasons to disallow that?

sarkipo · 2024-10-21T15:13:47Z

We only have very few cases with these quotes, but it seems to say what it says, unintelligible fragment within direct speech.

berndmoos · 2024-10-22T09:42:45Z

Or are there reasons to disallow that?

Not really, I'm just worried that new rules will break old ones... I tried a fix, there is a new Windows version and a new web jar.

git-debase-verbose · 2024-10-22T10:50:15Z

Thank you, these errors are no more with the new jar. Besides, combinations such as somewordhere((…)). got marked as erroneous, which they are.

git-debase-verbose · 2024-10-27T20:18:07Z

Would it be possible to add the forward slash "/" to the list of word-external punctuation marks in the segmentator? We use slashes to (a) mark line breaks in poetic texts and to (b) occasionally provide alternative variants of some part of a sentence.

Currently the segmentator throws errors ("Non-permissible sequence of word and punctuation characters in event: kanʼixubiʔ // ".) if there are slashes following whitespace. If a slash immediately follows a word, i.e. "kanʼixubiʔ/", it will be treated as a part of the word by the segmentator, which is not the intended behavior.

…#470)

berndmoos · 2024-10-28T14:16:30Z

Adding or removing characters from word external punctuation is not a big deal. I have added the forward slash now and made a new Windows preview and web jar. If you want to see for yourselves: the characters are defined here:

exmaralda/src/org/exmaralda/partitureditor/jexmaralda/segment/InelEventBasedSegmentation.java

Line 40 in 651d82b

String[] WORD_EXTERNAL_PUNCUTATION = {

git-debase-verbose · 2024-10-28T15:33:16Z

Thank you. Do I take it right that it would be perfectly fine if I commit any further changes to the external punctuation list myself?

berndmoos · 2024-10-28T15:38:23Z

Yes, go ahead:-) But please do it in a branch and send a merge request

sarkipo · 2024-11-05T10:32:37Z

Hi Thomas,
The "Generate corpus statistics" function in Coma does not seem to see the # INEL counts, reports zeroes everywhere.

berndmoos · 2024-11-05T10:59:14Z

Okay, I made this a separate issue and will check:

#497

berndmoos added a commit that referenced this issue Apr 20, 2024

#470: First steps for new segmentation method INEL_EVENT_BASED

dba9497

berndmoos self-assigned this Apr 20, 2024

berndmoos added Partitur Editor EXAKT INEL Coma labels Apr 20, 2024

berndmoos added a commit that referenced this issue Apr 21, 2024

#470: Further steps, #417

8b679fb

berndmoos added a commit that referenced this issue Apr 21, 2024

#470 and #417: Smaller corrections

f90c07c

berndmoos added a commit that referenced this issue May 5, 2024

#470

cf80762

berndmoos added a commit that referenced this issue May 6, 2024

#470: Faster algorithm

3e3f99a

berndmoos added a commit that referenced this issue May 6, 2024

#470: Segmentation errors

4499bd3

berndmoos added a commit that referenced this issue Aug 19, 2024

#470, tier id must not be null

5aa3056

berndmoos added a commit that referenced this issue Oct 22, 2024

Allow leading punctuation for double round bracket events (#470)

e34e06b

berndmoos added a commit that referenced this issue Oct 28, 2024

Added forward slash to word external punctuation of INEL segmentation (…

651d82b

…#470)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event-based segmentation #470

Event-based segmentation #470

sarkipo commented Apr 19, 2024

sarkipo commented Apr 19, 2024

berndmoos commented Apr 21, 2024

berndmoos commented Apr 24, 2024

berndmoos commented Apr 26, 2024

sarkipo commented Apr 26, 2024 •

edited

Loading

berndmoos commented May 6, 2024

berndmoos commented May 23, 2024

sarkipo commented May 23, 2024

git-debase-verbose commented Aug 19, 2024

sarkipo commented Oct 21, 2024

berndmoos commented Oct 21, 2024

berndmoos commented Oct 21, 2024

sarkipo commented Oct 21, 2024

sarkipo commented Oct 21, 2024

berndmoos commented Oct 21, 2024

sarkipo commented Oct 21, 2024

sarkipo commented Oct 21, 2024 •

edited

Loading

berndmoos commented Oct 22, 2024

git-debase-verbose commented Oct 22, 2024

git-debase-verbose commented Oct 27, 2024

berndmoos commented Oct 28, 2024

git-debase-verbose commented Oct 28, 2024

berndmoos commented Oct 28, 2024

sarkipo commented Nov 5, 2024

berndmoos commented Nov 5, 2024

Event-based segmentation #470

Event-based segmentation #470

Comments

sarkipo commented Apr 19, 2024

sarkipo commented Apr 19, 2024

berndmoos commented Apr 21, 2024

berndmoos commented Apr 24, 2024

berndmoos commented Apr 26, 2024

sarkipo commented Apr 26, 2024 • edited Loading

berndmoos commented May 6, 2024

berndmoos commented May 23, 2024

sarkipo commented May 23, 2024

git-debase-verbose commented Aug 19, 2024

sarkipo commented Oct 21, 2024

berndmoos commented Oct 21, 2024

berndmoos commented Oct 21, 2024

sarkipo commented Oct 21, 2024

sarkipo commented Oct 21, 2024

berndmoos commented Oct 21, 2024

sarkipo commented Oct 21, 2024

sarkipo commented Oct 21, 2024 • edited Loading

berndmoos commented Oct 22, 2024

git-debase-verbose commented Oct 22, 2024

git-debase-verbose commented Oct 27, 2024

berndmoos commented Oct 28, 2024

git-debase-verbose commented Oct 28, 2024

berndmoos commented Oct 28, 2024

sarkipo commented Nov 5, 2024

berndmoos commented Nov 5, 2024

sarkipo commented Apr 26, 2024 •

edited

Loading

sarkipo commented Oct 21, 2024 •

edited

Loading