Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid trailing whitespace in kobo spans #174

Merged
merged 7 commits into from
Oct 13, 2024
Merged

Conversation

Shiandow
Copy link
Contributor

@Shiandow Shiandow commented Oct 5, 2024

Implement long overdue TODO

Description of change

This seems like a small change but has huge implications for the way text is rendered with optimizeLegibilty and geometricPrecision. Really all it does is move the trailing whitespace to the front of the next group, just like the TODO note suggested ought to be done.

I'm guessing at the exact mechanism, but for unclear reasons whatever justification program Kobo method uses it apparently does not expand or contract the last space in a kobo span. (unless using optimizeSpeed, maybe?) This can causes weird looking text, and in the worst case may even expand the space between letters just to stretch the text far enough. If this is correct it would fit the behaviour observed by jackie_w on the mobileRead forums

This change may create a few pure-whitespace kobo spans when a sentence has some trailing whitespace, which I think could be ignored.

Test results

This seems to fix most issues with optimizeLegibilty. I'm still testing but the improvements I'm seeing are substantial enough that I think it might be best if other people try this out as well. Even if we can't fix everything it improves things a lot.

Only niggle is that the regexp only works for proper sentences, there are still a couple of issues with things like emphasis where a span might end up with trailing whitespace. These occur when the text ends in whitespace mid sentence.

This seems like a small change but has huge implications for the way text is rendered with optimizeLegibilty and geometricPrecision.

Signed-off-by: Shiandow <[email protected]>
Copy link
Contributor

deepsource-io bot commented Oct 5, 2024

Here's the code health analysis summary for commits 327f809..d63e6b4. View details on DeepSource ↗.

Analysis Summary

AnalyzerStatusSummaryLink
DeepSource Shell LogoShell✅ SuccessView Check ↗
DeepSource Python LogoPython❌ Failure
❗ 5 occurences introduced
View Check ↗

💡 If you’re a repository administrator, you can configure the quality gates from the settings.

@Shiandow
Copy link
Contributor Author

Shiandow commented Oct 5, 2024

I've made a small change to also avoid trailing whitespace when the text cuts off midsentence (as it does when a <em> tag or other span is in the middle).

I think this works but I still have some things I'm not entirely sure about. Even though I don't think any would affect the result much.

  • Why limit it to only one quote after a period? It might be incorrect but I do occasionally see "Quotes like 'this.'"
  • Maybe 'split' is not the most logical option, though it does work and ensure that nothing is skipped.
  • Does whitespace need its own span? A sentence with emphasis like this, would start <span>A sentence with<span><span> </span>. I've already confirmed that leaving the whitespace at the end results in problems, but I don't know if we need to wrap it in a span.

@Shiandow Shiandow marked this pull request as ready for review October 5, 2024 19:34
@jgoguen
Copy link
Owner

jgoguen commented Oct 7, 2024

Thanks for this!

For whitespace-only spans, let's not wrap only whitespace in a span.

You can ignore the Lint tasks, those look like something I have to fix. But please fix the tests (test_container.TestContainer.test_spans_added is failing).

@Shiandow
Copy link
Contributor Author

Shiandow commented Oct 8, 2024

In that case it's worth trying to put all leading and trailing whitespace outside of the spans. That will look more logical in the html and if one works I don't see why the other wouldn't.

I'll have to see if the result works equally well in kobo. If it does it shouldn't be too hard to fix the tests (may be worth adding one to check spans don't still end in whitespace somehow).

BTW, I think the lint is just one line that is too long, I'll add the extra newlines so it'll stop complaining.

Some of these were just replicating the implementation, which
seemed less than useful. I've tried to test the important properties.

The PR#106 test needs to be weakened a bit but the weakened test should
still catch the bug.
@Shiandow
Copy link
Contributor Author

Shiandow commented Oct 8, 2024

I haven't managed to get the tests running locally, but I think they should work now. The new version also seems to work at least equally well on an actual kobo, and I like both the result and the way it works better. Can't quite figure out how to make all linter(s?) happy though, I think the code is about as readable as I can make it.

Let me know if you want me to cleanup the commit history.

@jgoguen
Copy link
Owner

jgoguen commented Oct 8, 2024

A lot of what I've set up for tests assumes Linux, though macOS should work just as well. If you're on Windows, I wish you the best of luck getting things running locally without moving into WSL2 territory.

Don't worry about the lint tests, vale seems to not work with no changes made and I can easily clean up black later. The CI tests are the important ones and those look like they're probably just whitespace-related changes.

I'm not fussy about commit history. It's a view of what changed, I don't really care if a PR is one squashed commit or a series of commits.

The githup_106 test is a bit odd, because I'm not sure what changed.

Whitespace should be identical though, even without the lstrip.
Which is easy enough to check.
@Shiandow
Copy link
Contributor Author

Shiandow commented Oct 9, 2024

Yeah I was worried that might be the case. I don't have a unix system handy right now, and while I could reverse engineer it to work for windows I hope to get through it with limited trial and error. Ironically I'm more confident the actual changes are correct than I am about the tests.

Though if the github_106 keeps failing I'll have to figure out what's going on with the non-significant whitespace, which is probably not toing to be possible without the ability to actually test the tests.

@jgoguen jgoguen merged commit 296783a into jgoguen:main Oct 13, 2024
1 of 2 checks passed
@jgoguen
Copy link
Owner

jgoguen commented Oct 13, 2024

I had a chance to take a look at the code in more depth and the test failures are related to how text splitting has changed with this code. I'll fix that up, otherwise this looks good.

@Shiandow
Copy link
Contributor Author

Ok, thanks for that, that helps a lot.

By the way, should you wish to change anything in the future, the current version should be able to handle changes to the regexp. Anything captured will be put in a span and anything not captured will be left outside.

@Shiandow Shiandow deleted the patch-1 branch October 14, 2024 09:33
@callahad
Copy link

I suspect this has caused a regression in #175

@Shiandow
Copy link
Contributor Author

Shiandow commented Nov 27, 2024

I see. Well if that is a problem you could probably just capture the leading whitespace as well. Changing the regexp to

TEXT_SPLIT_RE = re.compile(
    r'(.*?(?:[\.\!\?\:][\'"\u201c\u201d\u2018\u2019\u2026]*(?=\s)|(?=\s*$)))',
    re.UNICODE | re.MULTILINE,
)

ought to do it.

Under no circumstances should you try to capture the trailing whitespace however. The kepub text rendering does not like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants