add normalization to the files/chapters name #288

BassantAbdelaziz · 2023-08-01T13:21:04Z

solve normalization issue

aerkalov · 2023-08-02T22:33:28Z

Thanks for this. I am just checking docs and it says "All file names within the same directory MUST be unique following Unicode canonical normalization and then full case folding". I am not that good with Unicode and I will have to read a bit more about it but do you know what would be this "full case folding" they are talking about?

BassantAbdelaziz · 2023-08-03T07:09:32Z

@aerkalov Thank you for your interest and reply. Allow me to explain the reasons behind the changes I made to the code.

I utilized the ebooklib library to process Arabic EPUBs and extract essential information from the opf file, such as the spine, manifest, publisher name, and read the content for each chapter. However, I encountered an issue with the file-name/chapter name, which was نهائي_الخبر_الرشيد. The library requires that the file name used to access items in the EPUB archive must match the actual file name present in the archive.

The error I faced was due to the presence of certain Arabic characters that required normalization, such as 'ئ' and 'ئ', to ensure consistency in the file names. Therefore, I implemented normalization for Arabic letters to handle these characters appropriately.

In Arabic, there are different ways to represent characters with diacritics, like Hamza and Madda, which can lead to inconsistencies in file names. The normalization process involves converting these characters to their base forms with specific diacritics, ensuring that the file names are standardized.

By normalizing the file names, I was able to resolve the error encountered while accessing items in the EPUB archive. This solution ensures that the specified file name in the code matches the actual file name in the archive, thus enabling smooth processing of Arabic EPUBs with accurate and consistent file names.

Thanks for this. I am just checking docs and it says "All file names within the same directory MUST be unique following Unicode canonical normalization and then full case folding". I am not that good with Unicode and I will have to read a bit more about it but do you know what would be this "full case folding" they are talking about?

BassantAbdelaziz added 3 commits August 1, 2023 16:20

add normalization to the files/chapters name

67176ab

add normalization to the files/chapters name

fe84484

use the normalized name

ab5a16c

use urllib.parse.unquote

4cdc1ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add normalization to the files/chapters name #288

add normalization to the files/chapters name #288

BassantAbdelaziz commented Aug 1, 2023

aerkalov commented Aug 2, 2023

BassantAbdelaziz commented Aug 3, 2023

add normalization to the files/chapters name #288

Are you sure you want to change the base?

add normalization to the files/chapters name #288

Conversation

BassantAbdelaziz commented Aug 1, 2023

aerkalov commented Aug 2, 2023

BassantAbdelaziz commented Aug 3, 2023