-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect handling of hyphens #16
Comments
Note that let hyphenated = en_us.hyphenate("self-aware");
let collected: Vec<String> = hyphenated.iter().collect();
// collected would be `vec!["self--", "aware"]` While this is clearly inconsistent, I take you meant to say that
whereas the TeX patterns lean toward doubling the hyphen:
Whether to open new lines with a second hyphen when breaking hyphen-joined compounds is an editorial decision, with some languages having stronger conventions than others. With reference to your example, it should be noted that the American English patterns do not cover hyphens generally; this specific result is produced by a rule to break after the "self" substring regardless of what follows. It should also be noted that the hyphenation API does not handle existing hyphens on its own, as mentioned in the notes about word segmentation; individual dictionaries may do it, but only as a consequence of what's included in the TeX patterns. (I had originally written v0.7 to handle hyphens independently, but ultimately decided against it, since diverging conventions apply, and adopting one is ultimately an editorial decision. I'm not strictly against offering a default, mind, but not before the library sees some extended usage.) |
Thanks, I'm aware of that. In fact, I want to deal with the slices and not with the strings with the added hyphens and I had to dig into the code to realize that Also, it might be worth noting that the current README has broken asserts: let segments = hyphenated.into_iter();
let collected : Vec<String> = segments.collect();
assert_eq!(collected, vec!["hy", "phen", "ation"]);
I knew about the existence of language specific hyphenation rules, but not about this one.
Don't worry: I'm using a custom slice iterator when the word I'm hyphenating contains special characters, otherwise I just use hyphenation's slice iterator. The two iterators are unified via Either. |
Mh, I'll add some module documentation. The function is nominally documented but not all that visibile.
Thank you, I'll fix those. (I test code examples manually because Cargo / Rustdoc require them to be laid out in a rather unwieldy fashion, and things do slip through.) |
The following example:
yields the following output:
Am I correct in thinking that the proper output should be
["self-", "aware"]
?The text was updated successfully, but these errors were encountered: