Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add normalization to the files/chapters name #288

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

BassantAbdelaziz
Copy link

solve normalization issue

@aerkalov
Copy link
Owner

aerkalov commented Aug 2, 2023

Thanks for this. I am just checking docs and it says "All file names within the same directory MUST be unique following Unicode canonical normalization and then full case folding". I am not that good with Unicode and I will have to read a bit more about it but do you know what would be this "full case folding" they are talking about?

@BassantAbdelaziz
Copy link
Author

@aerkalov Thank you for your interest and reply. Allow me to explain the reasons behind the changes I made to the code.

I utilized the ebooklib library to process Arabic EPUBs and extract essential information from the opf file, such as the spine, manifest, publisher name, and read the content for each chapter. However, I encountered an issue with the file-name/chapter name, which was نهائي_الخبر_الرشيد. The library requires that the file name used to access items in the EPUB archive must match the actual file name present in the archive.

The error I faced was due to the presence of certain Arabic characters that required normalization, such as 'ئ' and 'ئ', to ensure consistency in the file names. Therefore, I implemented normalization for Arabic letters to handle these characters appropriately.

In Arabic, there are different ways to represent characters with diacritics, like Hamza and Madda, which can lead to inconsistencies in file names. The normalization process involves converting these characters to their base forms with specific diacritics, ensuring that the file names are standardized.

By normalizing the file names, I was able to resolve the error encountered while accessing items in the EPUB archive. This solution ensures that the specified file name in the code matches the actual file name in the archive, thus enabling smooth processing of Arabic EPUBs with accurate and consistent file names.

Thanks for this. I am just checking docs and it says "All file names within the same directory MUST be unique following Unicode canonical normalization and then full case folding". I am not that good with Unicode and I will have to read a bit more about it but do you know what would be this "full case folding" they are talking about?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants