Broken words count like two words #4

danielrruf · 2018-04-15T17:42:40Z

Words splitted in two lines are counted and POS tagged as two words. An example in BGU 1.2.13, where the first part (ὀ) of the token ὀλίγη is POS tagged as "u" (undefined? but this tag doesn't appear in the list of codes in the Philologic website)

<t p="13" n="7" a="[1]" o="u--------" u="60">
  <f>ὀ</f>
</t>
<t p="14" n="1" a="[1]" o="v3saip---" u="61">
  <f>λίγη</f>

The text was updated successfully, but these errors were encountered:

gcelano · 2018-04-17T15:57:09Z

This is a known problem, unfortunately, and documented in the related forthcoming article. Tokenization of such fragmentary, highly marked-up texts poses a huge number of challenges.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken words count like two words #4

Broken words count like two words #4

danielrruf commented Apr 15, 2018

gcelano commented Apr 17, 2018

Broken words count like two words #4

Broken words count like two words #4

Comments

danielrruf commented Apr 15, 2018

gcelano commented Apr 17, 2018