Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces in column MISC #1

Open
michmech opened this issue Nov 16, 2022 · 2 comments
Open

Spaces in column MISC #1

michmech opened this issue Nov 16, 2022 · 2 comments

Comments

@michmech
Copy link

When I attempt to train a UDPipe model from this treebank, using UDPipe 1.2.0:

$ udpipe --train mymodel.udpipe UD_Czech-CAC-master/cs_cac-ud-train.conllu

I get the following error message:

Loading training data:
Cannot load training data from file 'UD_Czech-CAC-master/cs_cac-ud-train.conllu':
The CoNLL-U line
'39	vytvrditelné	vytvrditelný	ADJ	AAFP1----1A----	Case=Nom|Degree=Pos|Gender=Fem|Number=Plur|Polarity=Pos	27	acl:relcl	27:acl:relcl	SpaceAfter=No|LDeriv=vytvrdit { přidat k tvrdit }'
contains spaces in column MISC!

Does this mean the treebank is broken? Or is there an option in UDPipe that I could use to get over this?

Thank you,
Michal

@dan-zeman
Copy link
Member

This line is surprising and I think the part { přidat k tvrdit } should not be there; nothing similar occurs anywhere else in the treebank.

However, spaces in MISC are not an error in general, so UDPipe should not die on them @foxik. (I think a leading or trailing whitespace would trigger a validation error, but there can be a space in the middle of a value, for example, if there is Latin transliteration of a FORM or LEMMA that contain a space.)

@foxik
Copy link
Member

foxik commented Nov 16, 2022

If I recall correctly, the spaces in MISC were not originally allowed in CoNLL-U v2 (maybe in the proposed version) -- so the implementation in UDPipe 1 did not originally allowed them, only in FORM and LEMMA. The spaces in MISC are allowed since ufal/udpipe@9df115a, but we have not made a release since then (yes, it is long planned...). Once the release is made, it will work again; or it is possible to compile manually in the meantime.

Note that this affects also UDPipe 2 (which uses UDPipe 1 for tokenization).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants