Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of hyphens #16

Closed
baskerville opened this issue Sep 29, 2018 · 3 comments
Closed

Incorrect handling of hyphens #16

baskerville opened this issue Sep 29, 2018 · 3 comments

Comments

@baskerville
Copy link
Contributor

The following example:

let en_us = Standard::from_embedded(Language::EnglishUS).unwrap();
let hyphenated = en_us.hyphenate("self-aware");
let segments: Vec<&str> = hyphenated.iter().segments().collect();
println!("{:?}", segments);

yields the following output:

["self", "-aware"]

Am I correct in thinking that the proper output should be ["self-", "aware"]?

@tapeinosyne
Copy link
Owner

tapeinosyne commented Oct 8, 2018

Note that iter().segments() only returns string slices without inserting a hyphen before breaks, meaning that your expected output would become ["self--", "aware"] once marked:

let hyphenated = en_us.hyphenate("self-aware");
let collected: Vec<String> = hyphenated.iter().collect();
// collected would be `vec!["self--", "aware"]`

While this is clearly inconsistent, I take you meant to say that hyphenate() should recognize the existing hyphen and break after it, with the final output being ["self-", "aware"] when marked, whereas the American English dictionary yields ["self-", "-aware"]. Neither is wrong; rather, the output you suggest implies a preference for this style:

To his dismay, Bob realized the smart toaster had become self-
aware

whereas the TeX patterns lean toward doubling the hyphen:

To his dismay, Bob realized the smart toaster had become self-
-aware

Whether to open new lines with a second hyphen when breaking hyphen-joined compounds is an editorial decision, with some languages having stronger conventions than others. With reference to your example, it should be noted that the American English patterns do not cover hyphens generally; this specific result is produced by a rule to break after the "self" substring regardless of what follows.

It should also be noted that the hyphenation API does not handle existing hyphens on its own, as mentioned in the notes about word segmentation; individual dictionaries may do it, but only as a consequence of what's included in the TeX patterns.

(I had originally written v0.7 to handle hyphens independently, but ultimately decided against it, since diverging conventions apply, and adopting one is ultimately an editorial decision. I'm not strictly against offering a default, mind, but not before the library sees some extended usage.)

@baskerville
Copy link
Contributor Author

Note that iter().segments() only returns string slices without inserting a hyphen before breaks, meaning that your expected output would become ["self--", "aware"] once marked:

let hyphenated = en_us.hyphenate("self-aware");
let collected: Vec<String> = hyphenated.iter().collect();
// collected would be `vec!["self--", "aware"]`

Thanks, I'm aware of that. In fact, I want to deal with the slices and not with the strings with the added hyphens and I had to dig into the code to realize that .iter().segments() was what I wanted.

Also, it might be worth noting that the current README has broken asserts:

let segments = hyphenated.into_iter();
let collected : Vec<String> = segments.collect();
assert_eq!(collected, vec!["hy", "phen", "ation"]);

While this is clearly inconsistent, I take you meant to say that hyphenate() should recognize the existing hyphen and break after it, with the final output being ["self-", "aware"] when marked, whereas the American English dictionary yields ["self-", "-aware"].

I knew about the existence of language specific hyphenation rules, but not about this one.

It should also be noted that the hyphenation API does not handle existing hyphens on its own, as mentioned in the notes about word segmentation; individual dictionaries may do it, but only as a consequence of what's included in the TeX patterns.
(I had originally written v0.7 to handle hyphens independently, but ultimately decided against it, since diverging conventions apply, and adopting one is ultimately an editorial decision. I'm not strictly against offering a default, mind, but not before the library sees some extended usage.)

Don't worry: I'm using a custom slice iterator when the word I'm hyphenating contains special characters, otherwise I just use hyphenation's slice iterator. The two iterators are unified via Either.

@tapeinosyne
Copy link
Owner

Thanks, I'm aware of that. In fact, I want to deal with the slices and not with the strings with the added hyphens and I had to dig into the code to realize that .iter().segments() was what I wanted.

Mh, I'll add some module documentation. The function is nominally documented but not all that visibile.

Also, it might be worth noting that the current README has broken asserts:

Thank you, I'll fix those. (I test code examples manually because Cargo / Rustdoc require them to be laid out in a rather unwieldy fashion, and things do slip through.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants