Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken words count like two words #4

Open
danielrruf opened this issue Apr 15, 2018 · 1 comment
Open

Broken words count like two words #4

danielrruf opened this issue Apr 15, 2018 · 1 comment

Comments

@danielrruf
Copy link

Words splitted in two lines are counted and POS tagged as two words. An example in BGU 1.2.13, where the first part (ὀ) of the token ὀλίγη is POS tagged as "u" (undefined? but this tag doesn't appear in the list of codes in the Philologic website)

<t p="13" n="7" a="[1]" o="u--------" u="60">
  <f>ὀ</f>
</t>
<t p="14" n="1" a="[1]" o="v3saip---" u="61">
  <f>λίγη</f>
@gcelano
Copy link
Owner

gcelano commented Apr 17, 2018

This is a known problem, unfortunately, and documented in the related forthcoming article. Tokenization of such fragmentary, highly marked-up texts poses a huge number of challenges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants