-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checker info - SEGM and XPOSTAG fields #18
Comments
XPOSTAG: https://docs.google.com/spreadsheets/d/1Is7MGG0h8h0vfHj9C9mnWOD2utPeuvm1ZeYb1dsaejg/edit#gid=0 only ORACC SEGM |
Almost done here - https://github.com/cdli-gh/morphology-pre-annotation-tool/tree/wiplast |
Can you please allow underscores in the "sense" so in the square brackets following a lemma Error: The segm bisagdubak[filing_basket] in line number 3 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P340861.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$". Use underscore in SEGM sense. |
same for the tilde : [~*] - tilda start is valid (not anywhere in between) |
morphemes and tags attached to nouns can also be attached to verbs : Noun tags can also be verb suffixes. So, corresponding morphemes are also part of verb form. |
Error: The xpostag FIN.3-SG-H-A.V.3-SG-P in line number 22 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P102311.conll has a prefix 3-SG-H-A not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN']. This is correct, a verb can have more than one prefix : Check in https://docs.google.com/spreadsheets/d/1y0_y9HDQNwH0VqDCjjYuUpFsugw4GEybu6Pu01I_D9c/edit#gid=0 Verbs can display prefixes from these categories:
Add Pronominal element before base into prefix list. |
in text : Error: The segm V.SUB in line number 9 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P102311.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$". so V.SUB is the XPOSTAG column no? Is the dash inside the sense maybe the problem but error gives other information? If it's the case please permit - inside sense Permit dash in the middle of the segm. |
Error: The xpostag _ in line number 9 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P102311.conll does not have a base postag out of "['AJ', 'AV', 'NU', 'CNJ', 'DET', 'J', 'N', 'PRP', 'DN', 'EN', 'GN', 'MN', 'PN', 'RN', 'SN', 'TN', 'WN', 'AN', 'CN', 'FN', 'LN', 'ON', 'QN', 'YN', 'V', 'V-PL', 'V-PF', 'V-RDP']". Maybe give a warning, not error, saying that segm or xpostag hasn't been filled if you find an underscore? Tab missing. Error in file. |
Error: The segm us[follow]-'a in line number 15 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P102315.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$". the normalized morpheme 'a exists, just that sheets doesn't like apostrophes Allow apostrophes. |
Error: The segm Geme'enlila[1] in line number 126 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P142753.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$". apostrophes in lemmata are OK Allow apostrophes in lemmata. |
Error: The xpostag NU.3-SG-POSS.GEN in line number 31 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P480072.conll has a suffix 3-SG-POSS not in noun postag suffix list ['L1', 'L2-NH', 'GEN', '3-PL-POSS', '3-SG-H-POSS', '3-SG-NH-POSS', 'DEM2', 'COM', 'PL', 'ERG', 'DAT-NH', 'L3-NH', 'DEM', 'PL', 'ADV', 'EQU', '1-SG-POSS', '1-PL-POSS', 'L4', 'ABS', 'DAT-H', 'L2-H', 'L3-H', 'TERM', 'ABL', '2-SG-POSS', '2-PL-POSS', 'COP-1-SG', 'COP-2-SG', 'COP-3-SG', 'COP-1-PL', 'COP-2-PL', 'COP-3-PL', 'EXCEPT']. you can add NU (numerals) in the list of POS that can see suffixes appended Error in pos tag. |
Forward slash is acceptable in lemma Error: The segm 1/3(disz)[one] in line number 7 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P480072.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$". Allow forward slash in lemmata. |
Thanks @epageperron for your extensive testing. I will start adding the fixes one by one. |
@ is also fine in lemma ( my browser crashed so I don't have the example anymore but something like disz@t Allow @ in lemmata |
From Jinyan:
https://docs.google.com/spreadsheets/d/1y0_y9HDQNwH0VqDCjjYuUpFsugw4GEybu6Pu01I_D9c/edit#gid=0 Error: The xpostag MID.DAT.V.SUB in line number 210 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P106438.conll has a prefix DAT not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN']. Error: The xpostag MID.DAT.V.SUB in line number 253 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P106438.conll has a prefix DAT not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN']. Error: The xpostag VEN.3-PL-H.DAT.3-SG-H-A.V.3-SG-P in line number 26 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P453803.conll has a prefix 3-PL-H not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN']. Error: The xpostag VEN.3-PL-H.DAT.3-SG-H-A.V.3-SG-P in line number 26 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P453803.conll has a prefix DAT not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN']. Error: The xpostag VEN.3-PL-H.DAT.3-SG-H-A.V.3-SG-P in line number 26 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P453803.conll has a prefix 3-SG-H-A not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN']. Error: The xpostag VEN.3-PL-H.DAT.3-SG-H-A.V.3-SG-P in line number 29 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P459210.conll has a prefix 3-PL-H not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN']. Error: The xpostag VEN.3-PL-H.DAT.3-SG-H-A.V.3-SG-P in line number 29 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P459210.conll has a prefix DAT not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN']. Error: The xpostag VEN.3-PL-H.DAT.3-SG-H-A.V.3-SG-P in line number 29 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P459210.conll has a prefix 3-SG-H-A not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN']. Error: The xpostag MID.DAT.V.3-SG-S.SUB in line number 13 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P458899.conll has a prefix DAT not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN'].
https://docs.google.com/spreadsheets/d/1y0_y9HDQNwH0VqDCjjYuUpFsugw4GEybu6Pu01I_D9c/edit#gid=0 Error: The xpostag V.RDP.SUB.GENABL in line number 16 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P459210.conll has a suffix RDP not in verb postag suffix list ['1-SG-A', '1-SG-S', '1-SG-P', '2-SG-A', '2-SG-S', '2-SG-P', '3-SG-S', '3-SG-P', '3-SG-A', '3-SG-S-OB', '1-PL-A', '1-PL-S', '1-PL', '2-PL-A', '2-PL-S', '2-PL', '3-PL-S', '3-PL-P', '3-PL', '3-PL-A', 'PF', 'SUB', 'PLEN']. Handle RDP as part of speech. |
Files to check: |
What should I be looking for in those files exactly? |
That note was for me :) |
Error: The segm 1/2(disz)[one] in line number 6 in file /Users/user/annotation_assistant/scr/data/mtaac_gold_corpus/morph/processed/P339103.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$". This error appears again? Has forward slash been acceptable in lemma? |
Error: The segm igi4(disz)[one]gal[fraction] in line number 23 in file /Users/user/annotation_assistant/scr/data/mtaac_gold_corpus/morph/processed/P339103.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$". This is correct. This lemma is actually a compound lemma, can it work for the computer? @wangj619 It passes for "igi4(disz)[one]-gal[-fraction] " |
All +1 are now fixed @epageperron @khoidt @wangj619 The segm ones are fixed now. |
As of now, these are the rules concerning the SEGM and XPOSTAG field of the CDLI-CoNLL format. It might still change since we are still running into problems wile annotating, problems that require making decisions on the rules.
SEGM
containts the lemma which is composed of a dictionary word and its sense, appended and in square brackets. eg : udu[sheep] or dab[seize]
For all word types except verbs, there will only be suffixed morphemes, no prefixed ones.
All morphemes except the first element are composed of a dash, followed by the morpheme.
Only in the case of verbs, the first prefix will be without a dash. eg : i[-n]-dab[seize][-ø]
every morpheme can be enclosed in [], or nor.
There are also rules concerning the "slots" for morpheme but since we are not noting them we will not check for the order at this time, but we should open a backlog issue to that effect, since we want to democratize the usage of the tool, checking the possible order of morphemes would be an asset for inexperienced annotators.
XPOSTAG
This field will display the ETCSRI/ORACC POS tag OR the named entity tag instead of the lemma in the SEGM field.
For all word types except verbs, there will only be suffixed morphological tags, no prefixed ones.
All morph tags morphemes except the first element are composed of a period, followed by the morpheme.
Only verbs can have prefixes. eg : FIN.3-SG-H-A.V.3-SG-P
Tags can contain dashes, they are not meaningful in this context since the checker should use a map of morphemes and morph tags.
The text was updated successfully, but these errors were encountered: