Skip to content
This repository has been archived by the owner on Jun 8, 2020. It is now read-only.

Bad segmentation in Arabic project #27

Open
uhallac opened this issue Feb 6, 2018 · 6 comments
Open

Bad segmentation in Arabic project #27

uhallac opened this issue Feb 6, 2018 · 6 comments

Comments

@uhallac
Copy link

uhallac commented Feb 6, 2018

Can you please check the following Arabic project?
https://www.matecat.com/translate/33409docx/ar-SA-en-GB/1116612-3ffd0e90c8f0

The segmentation seems to have failed. Do you think this is a Matecat-Filters issue?

Thank you.

@giusilvano
Copy link
Contributor

Hi @uhallac!

In your document there is no punctuation, so the segmenter has no hints to understand the structure of sentences. Moreover, all the text is in one single paragraph, so morphologically it is correct to not split it into more segments.

Can you please explain me better the segmentation you were expecting?

@uhallac
Copy link
Author

uhallac commented Feb 14, 2018

Hi @giusilvano,

Where are the tags coming from? I don't see any special characters between words but spaces only.
Thank you.

@giusilvano
Copy link
Contributor

I checked in the source file of your project and each word seems to carry an ID related to a past revision-check work. The filters are producing tags to preserve these IDs in the target file. We have to discuss internally if this is useful or not. Can you confirm you used the Word's revisions feature on this text?

@uhallac
Copy link
Author

uhallac commented Feb 15, 2018

The file was created using only a paragraph from a larger client document with the same issue. Not quite sure if revisions feature was used on it, couldn't detect them in Libre Office editor.

As far as I know Matecat doesn't let such documents get analyzed at all, am I wrong? This restriction by the way is a huge obstacle when using the Matecat API to create projects automatically. Latest revision of a document should be used in such cases in my opinion.

Thank you.

@giusilvano
Copy link
Contributor

You are right, MateCat does not allow files with revisions. Our point on this is that a file with revisions contains a lot of comments and suggestions that must be accepted / rejected by a human in order to have the document in a consistent state. Moreover the implementation of the auto-accept of revisions is really hard!

Anyway this issue requires a fix on the underlying framework, Okapi. I will communicate them the problem, but I can't estimate how long it will take to process it.

@uhallac
Copy link
Author

uhallac commented Feb 15, 2018

Thank you for the information.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants