-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Japanese characters in metadata on (u)pLaTeX #18
Comments
Handling Japanese characters on (u)pTeX is a big deal, due to complicated traditional encoding rules. It is almost impossible to support such a complicated conversion rule properly, even for us Japanese, without knowing the historical reasons. So, we recommend you "just leave it to us, without any conversion." The halfway conversion makes things more complicated, so pass it literally to us! |
It's fine, one can always move if needed.
yes I guess it this problem, so we will have to solve that first.
Well the generic driver used by hyperref if the pdfmanagement is used is "generic", that means it doesn't contain any engine tests. And yes it force the unicode option. But the option mostly declares how hyperref writes out strings to the pdf or dvi, so I'm not sure why ptex cares about it. Side question: which input encoding is used with ptex? utf8 or something else? |
The check is done by the pxjahyper package (not by pTeX), as the package knows it cannot work as expected when the strings are written out in unicode.
It depends on users ;-) I think many users are writing in UTF-8 these days, but in some situations people still choose to write in Shift-JIS or EUC-JP for historical reasons. The pTeX engine accepts all of these encodings, by using |
@u-fischer Move to the latex3 repo? |
But strings are not written out "in unicode". I mean all in the PDF is ascii. hyperref writes out strings like |
@josephwright I think there is already an issue about str-convert in latex3, so this can stay here for now. |
because they still are unicode only written in octal or not? Point is (I may be mistaken) if ptex doesn't support unicode but Shift-JIS or EUC-JP it probably translates to that on input, i.e. it is like inputenc translates to some LICR internally and then works with that. But your output isn't Shift-JIS it is unicode and so it chokes ... my rough guess |
@FrankMittelbach but I'm writing that out in specials, so only the driver sees it. And this is dvips or something like that. Isn't it? |
maybe because of this?
the Japanese char ends up in your octals and so the to UnicodeMapping stops working? (just guessing) |
@u-fischer, @aminophen wil correct me if I'm wrong, but I think the old model for ptex was that the specials contained shift-jis or whatever in the special and the dvi driver did the conversion to Unicode in the final stage. Given that |
Correct.
"\ looks like Yen in shift-jis" is unrelated. The problem here is that "how the octal should be decoded." (using shift-jis? or euc-jp? or utf-8?). The encoding is pre-determined by (u)pTeX engine (not by a macro layer), so you cannot disable such a encoding conversion. Instead, you have to know "how (u)pTeX engine will encode it" and keep consistency with it. |
I think I can remove the unicode settings from the generic driver as pdftex use now unicode by default anyway. It will not solve everything as the driver doesn't always use hyperref commands for the conversion, so one will have to check if there are more wrong conversions somewhere, but this requires at first that l3str-convert works correctly with japanese. |
Unfortunately, Japanese devs found out that "it is almost impossible to support Japanese characters within the current behavior of pTeX, as long as l3str-convert uses As described in TUGboat article "Distinguishing 8-bit characters and Japanese characters in (u)pTeX" by H.Kitagawa, pTeX becomes confused among Latin and Japanese character tokens during "stringization". The "stringization" procedure occurs in The current behavior is considered unnatural today; however, at the time pTeX was designed and developed, Unicode 1.0 didn't even exist and 8-bit character inputs were rare, so the problem had never been exposed at all for years. Changing the behavior at this stage needs design change and requires lots of testing, so it would take much time. |
The question is if and how one can sanitize the input first. That means if you get an input like
can you convert
well I would say that while one can look for macro solutions, engine changes are really needed here ;-) But we quite understand that large changes aren't done fast. |
I don't think one can sanitize such an input easily, sorry. Theoretically it may be possible to delete all Japanese characters first and then restore, as a "temporary workaround" to cope with the current (u)pTeX behavior. However, such a manipulation would require a deep understanding of how (u)pTeX treats Japanese character tokens. We (Japanese developers) are the only ones who can provide that information, but our human resources to do that are limited and will overlap with those to examine/improve and conduct whole testing of (u)pTeX engine. Therefore, we'd like to concentrate on the improvement of (u)pTeX engine, rather than devising a temporary workaround. All we can hope is that you do not make l3str-convert code into default, at east for (u)pLaTeX, until we manage to release an "improved" version of (u)pTeX in TL2022 or something. |
I don't think that there is a large problem, even if ptex user wants to try the pdfmanagement code. It uses l3str-convert code in various places, but mostly for text for which a sensible user would use ascii like filenames or destination names. bookmarks still use the hyperref commands, and if we implement something new here it wouldn't be difficult to add an option to fallback to the older code. So from a practical point it remains your example at the start: handling pdftitle and pdfauthor. Can you provide some code so which shows how you would write something into the Info dictionary with the primitive \pdfinfo? |
it’s true that a sensible user would use ascii for filenames or destinations, but now we don't know all the places where l3 will be used already or in the future, so I cannot tell whether your point is safe enough.
something like \usepackage[strconvert=2e]{hyperref} or something would suffice; if you could provide a way to simply fallback to the old code, then we can extend pxjahyper to enable the strconvert=2e option for pTeX. In this scenario, you will not need any knowledge about Japanese tokens as the old code does no harm for us. |
I only meant that it is safe enough for the near future. I don't think that it is safe long term: non-ascii is used increasingly in places where traditionally only ascii was used, like file names, command names, label names, url's, verbatim content like code listings and so on. And that means that solutions that work ok if you only want to print non-ascii are no longer sufficient—and this doesn't refer only to ptex. 8-bit file encodings for example can get problematic too. Imho it is quite important that the updated ptex engines are made available as fast as possible so that tests can be done with them. |
I know it, but pTeX would not be able to fully support non-ascii by design, even after the "improvement" of engine is done. Actually it has been noted that pLaTeX cannot process some Latin documents due to the incompatible design feature of pTeX regarding reading bytes. So, we don't hope for full support for non-ascii; only we need is that there is no regression compared to the current behavior. OTOH upTeX has the full potential, as it has an enhanced design of kcatcode storing compared to pTeX.
We'll look into it. Fingers crossed... |
(I am the author of the pxjahyper package.) indeed we do think that l3str-convert (as well as other expl3 features) must support e-(u)pTeX in the future (unless we totally abondon e-pTeX and/or e-upTeX,) and we have already started to act for that. The survey of critical isuues is the first step. Another thing to note. The “old way” of hyperref + pjahyper works fine, but it heavily depends on hyperref’s inputenc-fontenc conversion chain. That conversion will handle non-CJK non-ASCII letters in the same way as pdfTeX. Thus I think that the most reasonable way to support e-(u)pTeX for the present would be to use |
I would suggest that for (u)pLaTeX you do for now something like this if the pdfmanagement is detected:
and similar for the author, subject and keywords keys. |
I'm not sure this is the right place, sorry.
After loading
\RequirePackage{pdfmanagement-testphase}
, adding Japanese characters into PDF metadata does not work at all on (u)pLaTeX + dvipdfmx.upLaTeX + dvipdfmx
Nowadays we use the following syntax.
If we add
then an error happens:
It seems that "l3str-convert" does not support Japanese characters. (maybe similar to latex3/latex3#939; you should consider better handing of Japanese tokens; just pass it as-is, literally untouched, is ok for us.)
pLaTeX + dvipdfmx
Nowadays we use the following syntax.
If we add
then an error happens:
The error itself is natural enough, since pLaTeX engine cannot handle Unicode at all. Therefore, "hyperref" should not enable "unicode" mode on pLaTeX.
FYI: What the "pxjahyper" package does?
When using (u)pLaTeX, we need to give the correct "encoding conversion rule" to dvipdfmx by using
pdf:tounicode
special. The rule is provided in the ToUnicode CMap developed by Adobe or other contributers (for upLaTeX "UTF8-UTF16" or "UTF8-UCS2", for pLaTeX "EUC-UCS2" or "90ms-RKSJ-UCS2"), but we need to select the correct one depending on "which engine is running (pLaTeX or upLaTeX)" and "which encoding is used (mostly Shift-JIS on win32 / EUC-JP on Unix)". The "pxjahyper" package automatically does it.The text was updated successfully, but these errors were encountered: