Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event-based segmentation #470

Open
sarkipo opened this issue Apr 19, 2024 · 25 comments
Open

Event-based segmentation #470

sarkipo opened this issue Apr 19, 2024 · 25 comments

Comments

@sarkipo
Copy link

sarkipo commented Apr 19, 2024

From May 2023 meeting minutes:

TS: explore the possibility of event-based segmentation. That would eliminate the need for HIAT-based FSM segmentation and allow more flexible transcription conventions
@ts Note: current transcription conventions vary to some extent between INEL corpora but the core is documented in https://doi.org/10.14232/wpcl.2020.5 (with a summary of symbols in the end)

@sarkipo
Copy link
Author

sarkipo commented Apr 19, 2024

(I thought there was already an issue on that but haven't found one)

@berndmoos
Copy link
Collaborator

First attempt:

  • New segmentation option INEL_EVENT_BASED in PE
  • Same option in wizard for COMA and EXAKT

This segmentation works, not by FSM, but by an XSL stylesheet. The general approach is:

  • Build an "ordinary" segmented transcription, i.e. without additional segmentation
  • Copy the event segmentation, and for each event
    • decide if it's a non-word (starting with double parentheses) or a word
    • parse the event content accordingly (into ts, ats, nts)
  • Group leaves into ts with "INEL:u" according to the ref tier
    This seems to work (for the Dolgan corpus), but it is anything but fast (about 20 minutes for all 17 Dolgan transcripts).
    The whole setup is analagous to HIAT, i.e. we have utterances, words, etc., they only have "INEL:" as prefix instead of "HIAT:"

To do: take care of segmentation errors. There are two options:

  • Decide that there can never be segmentation errors (because the XSL transformation will never fail)
  • Define segmentation errors in a separate process

@berndmoos
Copy link
Collaborator

it is anything but fast

It can take up to two hours for gigantic (INEL) transcripts, so it needs to be implemented differently

@berndmoos
Copy link
Collaborator

Please also add
: (colon)
; (semicolon)
“ (left double quotation mark)
” (right double quotation mark)

Pending decision on hyphens...

@sarkipo
Copy link
Author

sarkipo commented Apr 26, 2024

Also word-external:
« U+00AB (LEFT-POINTING DOUBLE ANGLE QUOTATION MARK)
» U+00BB (RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK)
‐ U+2010 HYPHEN
‑ U+2011 NON-BREAKING HYPHEN

On the contrary, the usual
- U+002D HYPHEN-MINUS
should be word-internal. @git-debase-verbose has just seen that

<ts e="T223" id="tx.w222" n="INEL:w" s="T222">Dʼɨllara</ts>
<nts id="nts_tx.e222_2" n="INEL:ip">-</nts>
<ts e="T223" id="tx.w222" n="INEL:w" s="T222">kunnere</ts>

makes Tsakorpus unhappy.

berndmoos added a commit that referenced this issue May 5, 2024
berndmoos added a commit that referenced this issue May 6, 2024
berndmoos added a commit that referenced this issue May 6, 2024
@berndmoos
Copy link
Collaborator

Different implementation (no XSL) now takes seconds instead of minutes. To do / decide: What will count as a segmentation error? Or make the algorithm accept everything?

@berndmoos
Copy link
Collaborator

((xxx)). -- there can be utterance terminators after the closing parentheses

@sarkipo
Copy link
Author

sarkipo commented May 23, 2024

Not necessarily utterance terminators, since it can be in the middle of an utterance. Rather just any punctuation, e.g.
((…)), – or ((…)),. But terminators are perhaps 97% of all cases when something follows. (Also complex ones like ((…))?”).

@git-debase-verbose
Copy link

In https://github.com/Exmaralda-Org/exmaralda/blob/master/src/org/exmaralda/partitureditor/jexmaralda/segment/InelEventBasedSegmentation.java:

The exception type at line 105 ("Word characters after double closing round parentheses...") is created with null instead of tierID - so when I later call FSMException.getTierID on it, the result is null as well. Could you take a look on it?

berndmoos added a commit that referenced this issue Aug 19, 2024
@sarkipo
Copy link
Author

sarkipo commented Oct 21, 2024

Hi Thomas, could you please check whether (your version of) the new segmentation throws an error on events like
“((…))
(punctuation preceding double brackets)?
Ours does, although in principle it shouldn't. It also still has Tier ID: null in some errors, namely
Word characters after double closing round parentheses in event: ((… beginning at T311. Tier ID: null
(this is right to be an error, just the null is wrong)

Or maybe the jar is not the latest one?

@berndmoos
Copy link
Collaborator

This one works for me.

image

Guess I need to check the jar, then...

@berndmoos
Copy link
Collaborator

The last one at https://exmaralda.org/files/prevDL/webjar.html was from August, 19th. I don't think the last changes were after this, but to make sure, I updated the jar. Please try again. If it doesn't work, I would need a more complete example to reproduce the error.

@sarkipo
Copy link
Author

sarkipo commented Oct 21, 2024

Just the double brackets ((…)) work fine, and ones followed by other punctuation, but not the ones preceded by quotes:
grafik

@sarkipo
Copy link
Author

sarkipo commented Oct 21, 2024

@berndmoos
Copy link
Collaborator

Okay, but that is as intended: double round brackets can be optionally followed by puncutation characters, but not preceded. Is that a request for change then?

@sarkipo
Copy link
Author

sarkipo commented Oct 21, 2024

Yes, please. Just because of cases like this, which I think are no worse than following punctuation. Or are there reasons to disallow that?

@sarkipo
Copy link
Author

sarkipo commented Oct 21, 2024

We only have very few cases with these quotes, but it seems to say what it says, unintelligible fragment within direct speech.

@berndmoos
Copy link
Collaborator

Or are there reasons to disallow that?

Not really, I'm just worried that new rules will break old ones... I tried a fix, there is a new Windows version and a new web jar.

@git-debase-verbose
Copy link

Thank you, these errors are no more with the new jar. Besides, combinations such as somewordhere((…)). got marked as erroneous, which they are.

@git-debase-verbose
Copy link

Would it be possible to add the forward slash "/" to the list of word-external punctuation marks in the segmentator? We use slashes to (a) mark line breaks in poetic texts and to (b) occasionally provide alternative variants of some part of a sentence.

Currently the segmentator throws errors ("Non-permissible sequence of word and punctuation characters in event: kanʼixubiʔ // ".) if there are slashes following whitespace. If a slash immediately follows a word, i.e. "kanʼixubiʔ/", it will be treated as a part of the word by the segmentator, which is not the intended behavior.

@berndmoos
Copy link
Collaborator

Adding or removing characters from word external punctuation is not a big deal. I have added the forward slash now and made a new Windows preview and web jar. If you want to see for yourselves: the characters are defined here:

@git-debase-verbose
Copy link

Thank you. Do I take it right that it would be perfectly fine if I commit any further changes to the external punctuation list myself?

@berndmoos
Copy link
Collaborator

Yes, go ahead:-) But please do it in a branch and send a merge request

@sarkipo
Copy link
Author

sarkipo commented Nov 5, 2024

Hi Thomas,
The "Generate corpus statistics" function in Coma does not seem to see the # INEL counts, reports zeroes everywhere.

@berndmoos
Copy link
Collaborator

Okay, I made this a separate issue and will check:

#497

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants