Emphasis and East Asian text #208

ikedas · 2017-06-25T03:33:07Z

Discussions:

There are three commits:

Change on code,
Proposed change on spec,
Additional test cases (maybe insufficient).

I realised that the change will introduce some ambiguity, but I think they are not actually problem.

Rule 6:

__foo、__bar__、baz__
.
<p><strong>foo、</strong>bar<strong>、baz</strong></p>

is not

<p><strong>foo、<strong>bar</strong>、baz</strong></p>

Rule 7:

**〔**foo〕
.
<p><strong>〔</strong>foo〕</p>

is not

<p>**〔**foo〕</p>

jgm · 2017-06-27T08:54:16Z

Thanks for doing this!

I think we could simplify this considerably by defining "punctuation character" (for purposes of the spec) so that it simply excludes East Asian pnuctuation characters.

This would really simplify the clauses in the spec for emphasis, since we'd avoid complicated logical constructions like (punctuation and not east asian).

It would also make the code slightly more efficient (one test rather than two -- though perhaps the compiler is smart enough to optimize away this difference).

What do you think?

@kivikakk

ikedas · 2017-06-27T14:19:45Z

Thanks for comment.

I think we could simplify this considerably by defining "punctuation character" (for purposes of the spec) so that it simply excludes East Asian pnuctuation characters.

This would really simplify the clauses in the spec for emphasis, since we'd avoid complicated logical constructions like (punctuation and not east asian).

I thought the same at first, but such modification counld not handle many cases using underscores (_). Anyway, for EA writers it is real that EA punctuations should be handled in different way from Western ones.

Another point is that some punctuations are shared among EA and Western, e.g. “, ”. They cannot be excluded.

jgm · 2017-06-27T21:00:33Z

+++ IKEDA Soji [Jun 27 17 07:19 ]:

I think we could simplify this considerably by defining "punctuation character" (for purposes of the spec) so that it simply excludes East Asian pnuctuation characters. This would really simplify the clauses in the spec for emphasis, since we'd avoid complicated logical constructions like (punctuation and not east asian). I thought the same at first, but such modification counld not handle many cases using underscores (_). Anyway, for EA writers it is real that EA punctuations should be handled in different way from Western ones.

Can you give a specific example of a case where you think what I suggest wouldn't work? I think I can do it in a way that is logically equivalent to yours, but simpler both in the spec and the program.

Another point is that some punctuations are shared among EA and Western, e.g. “, ”. They cannot be excluded.

Yes, the idea would be to define 'punctuation character' to include these but exclude east-asian-only puntuation.

ikedas · 2017-06-28T03:03:11Z

Can you give a specific example of a case where you think
what I suggest wouldn't work? I think I can do it in a way
that is logically equivalent to yours, but simpler both
in the spec and the program.

Ok here. In following texts, 「, 」 and 。 are EA punctuations.

Example 1

猫は*「のどか」*という。

猫は_「のどか」_という。

Current master:

<p>猫は*「のどか」*という。</p>
<p>猫は_「のどか」_という。</p>

Excluding EA punctuations:

<p>猫は<em>「のどか」</em>という。</p>
<p>猫は_「のどか」_という。</p>

Expected (with this PR):

<p>猫は<em>「のどか」</em>という。</p>
<p>猫は<em>「のどか」</em>という。</p>

Example 2

猫は*「のどか」*という。犬は*名がない*。

猫は_「のどか」_という。犬は_名がない_。

Current master:

<p>猫は*「のどか」<em>という。犬は</em>名がない*。</p>
<p>猫は_「のどか」<em>という。犬は_名がない</em>。</p>

Excluding EA punctuations:

<p>猫は<em>「のどか」</em>という。犬は<em>名がない</em>。</p>
<p>猫は_「のどか」_という。犬は_名がない_。</p>

Expected (with this PR):

<p>猫は<em>「のどか」</em>という。犬は<em>名がない</em>。</p>
<p>猫は<em>「のどか」</em>という。犬は_名がない_。</p>

Another point is that some punctuations are shared among EA and
Western, e.g. “, ”. They cannot be excluded.

Yes, the idea would be to define 'punctuation character'
to include these but exclude east-asian-only puntuation.

Excluding these from Western punctuations will not affect Western text, because space before/after punctuation is ordinary in Western texts (␣ means space).

The␣cat␣is␣named␣*“Nodoka”*.

On the other hand including them in EA punctuations will help formatting EA text because spaces before/after punctuation is unnatural in EA texts.

猫は*“のどか”*という。

猫は␣*“のどか”*␣という。            --- unnatural

So I think they would be better to belong to EA punctuations.

kivikakk · 2018-09-05T02:42:35Z

Just checking back in here; do we think we might be able to move forward with the suggestion in this PR?

ikedas · 2018-09-05T03:25:04Z

Just checking back in here; do we think we might be able to move forward with the suggestion in this PR?

Of course I agree. Please let me know if there are anything I should do.
I'll re-push my commits.

tamlok · 2018-09-30T04:01:52Z

Hi,

Any updates on this PR?

I think lots of projects are waiting for the update in upstream. :)

Thanks!

jgm · 2018-09-30T05:59:12Z

Excluding these from Western punctuations will not affect Western text, because space before/after punctuation is ordinary in Western texts (␣ means space).

Not always. Examples:

the Marines’ slogan—“semper fi”—is well known.
he uttered his usual greeting (“hello”).
‘“hello” is longer than “hi”,’ she noted.

ikedas · 2018-09-30T06:05:16Z

@jgm, in this pr interaction between punctuations and emphasis matters. Are your examples affected (I haven’t confirmed)?

jgm · 2018-09-30T06:26:07Z

My point was just that there might be unexpected consequences to treating these characters like non-punctuation, and that it isn't the case that they're never flanked by punctuation characters. It's hard to survey ahead of time all the cases that might arise, but here's one for concreteness:

He stammered, “*hello, I was...*”

If the double quotes get treated as non-punctuation for purposes of determining flankingness, then the final * is not right flanking and we don't get emphasis.

ikedas · 2018-09-30T07:40:12Z

If the double quotes get treated as non-punctuation for purposes of determining flankingness, then the final * is not right flanking and we don't get emphasis.

My PR does not treat LEFT/RIGHT DOUBLE QUOTATION MARKs as non-punctuations, but treats them as EA punctuations. In fact, even if my modification was applied:

$ build/src/cmark 
He stammered, “*hello, I was...*”
<p>He stammered, “<em>hello, I was...</em>”</p>
$

jgm · 2018-10-01T17:17:26Z

Sorry for the misunderstanding.

left_flanking = numdelims > 0 && !cmark_utf8proc_is_space(after_char) &&
                   (!cmark_utf8proc_is_punctuation(after_char) ||
                    cmark_utf8proc_is_eastasian_punctuation(after_char) ||
                    cmark_utf8proc_is_space(before_char) ||
                    cmark_utf8proc_is_punctuation(before_char));
right_flanking = numdelims > 0 && !cmark_utf8proc_is_space(before_char) &&
                  (!cmark_utf8proc_is_punctuation(before_char) ||
                   cmark_utf8proc_is_eastasian_punctuation(before_char) ||
                   cmark_utf8proc_is_space(after_char) ||
                   cmark_utf8proc_is_punctuation(after_char));

Simplifying a bit (EDIT: sorry, first version was completely wrong):

Left flanking:

after char is non-space, AND
one of the following:
- after char is EA punctuation or non-punctuation
- before char is space or punctuation

Right flanking:

before char is non-space, AND
one of the following:
- before char is EA punctuation or non-punctuation
- after char is space or punctuation

The effect of this part of the rule is to make it strictly easier to count as left-flanking and right-flanking, in the cases where a left-flanking run is followed by EA punctuation or a right-flanking run is preceded by EA punctuation. So there won't be examples of the sort I was trying to give, where your rule fails to count something as left- or right-flanking that the original rule does.

Your rule may, however, count some delimiter runs as BOTH left and right flanking where the original rule only has one flankingness. To deal with that, you also modify the rules for "can open" and "can close". The current rule says that a delimiter run that is both left and right flanking can open emphasis when the before char is punctuation. Your rule loosens that up to: when the before char is punctuation or the after char is EA punctuation. This ensures that, in every case where your rule makes a formerly left and not-right flanking delimiter run both left and right flanking, if it could open/close emphasis before it will still be able to open/close emphasis.

However, there could still be changes due to the fact that it could now close emphasis when it couldn't before. So, one kind of example to look for is a case where a delimiter run that formerly could only open emphasis can now both open and close, and gives bad results for that reason. I will think about whether there are realistic examples of this sort.

But, just to make a general comment, one thing I dislike about the proposed change is that it makes an already fairly complicated rule, which I could (barely) keep in my head, even more complicated and hard to think about. That is the reason I've found it difficult to get convinced that this change should be made. It's not by itself a reason to reject the change, but I haven't yet been convinced that the change won't have unanticipated consequences.

jgm · 2018-10-01T17:35:10Z

Here's an (admittedly artifical) example where we'd see a difference, if I'm not mistaken:

*“*there*”*

With the proposed rule, the second * can close emphasis and so we'd get

<em>“</em>there<em>”</em>

whereas currently we get

<em>“<em>there</em>”</em>

Unless I've made a mistake in thinking about it...

jgm · 2018-10-01T17:50:10Z

Another case:

*He said, **“*hello*”**.*

ikedas · 2018-10-02T06:49:37Z

I'll investigate your simplified rule afterward (but I want to confirm: It is equivalent to my rule, isn't it?).

Your rule may, however, count some delimiter runs as BOTH left and right flanking where the original rule only has one flankingness. To deal with that, you also modify the rules for "can open" and "can close". The current rule says that a delimiter run that is both left and right flanking can open emphasis when the before char is punctuation. Your rule loosens that up to: when the before char is punctuation or the after char is EA punctuation. This ensures that, in every case where your rule makes a formerly left and not-right flanking delimiter run both left and right flanking, if it could open/close emphasis before it will still be able to open/close emphasis.

What is the reason for "unique flankingness" requirement? For me, flankingness looks introduced only to describe behavior of the parser (without consideration of EA context).

However, there could still be changes due to the fact that it could now close emphasis when it couldn't before. So, one kind of example to look for is a case where a delimiter run that formerly could only open emphasis can now both open and close, and gives bad results for that reason. I will think about whether there are realistic examples of this sort.

It is natural that modification of rules will cause change of behavior. We have to modify rules if the rules can't handle texts as we expect.

I can't decide whether changes brought to existing texts will be acceptable or not. There seem these options:

In below, an "ambiguous punctuation" is a punctuation character having east_asian_width property "A", and can be used in both East Asian and Western contexts, including: ¡, ¿, –, —, ‘, ’, “, ”.

Reject entire changes by this PR. --- Obviously uncomfortable for East Asian writers.
Treat ambiguous punctuations as non-East Asian punctuations --- A bit uncomfortable for East Asian writers.
Add an option (during compilation or runtime) to treat ambiguous punctuations either as East Asian or non-East Asian punctuations according to choice of users.
Treat ambiguous punctuations East Asian punctuations --- More or less uncomfortable for Western writers.

ikedas · 2018-10-02T07:10:04Z

Another case:

*He said, **“*hello*”**.*

I'll add corresponding examples with East Asian context.

For example,

*他說，**“*你好*”**。*

will be handled by current master properly (Note: ， is not comma + space but an EA punctualtion), as:

<p><em>他說，<strong>“<em>你好</em>”</strong>。</em></p>

However, example above is a lucky case. Perhaps this sentence is understandable without ，. Removing it,

*他說**“*你好*”**。*

will be rendered with current master as:

<p><em>他說</em>*“<em>你好</em>”**。*</p>

I think it is hard to accept this result for writers.

As a workaround, for example, we might recommend writers to markup such as:

*他說 **“*你好*”**。*

This will be rendered as:

<p><em>他說 <strong>“<em>你好</em>”</strong>。</em></p>

The result is readalbe, if readers ignored an ugry space. However, it may not be easy to give excuse to force writers inserting unusual spaces not appeared in plain text witout markup.

Note: My PR will not solve all problems with current master: It can not handle as complex markup in East Asian context as Western context. In fact, since the example above is slightly complex, it will be rendered with my PR as:

<p><em>他說**“</em>你好<em>”**。</em></p>

However, from view of East Asian writers, it will improve current behavior much.

jgm · 2018-10-02T16:07:10Z

Yes, my simplified rephrasing was meant to be equivalent to your proposal. (Just to help me think about it more clearly.)

Thinking outside of the box a bit: instead of having two distinct classes of punctuation characters, would it work to treat East Asian characters in general (including both EA punctuation and EA non-punctuation characters) as equivalent to punctuation for determining flankingness and can-open/can-close?

That is: the rules would all be the same as they are, except that "punctuation" would be interpreted as including Western punctuation characters plus ALL EA characters. (Obviously, one might want a better name for this broad class than "punctuation," but that's a detail.)

This would keep the simpler logic of the current rules, and it would guarantee that nothing changes in the interpretation of Western texts.

cangyuyao · 2019-01-18T16:11:52Z

Just wondering is there any progress on this?

All CJK projects based on CommonMark just stuck on it for years.

mity · 2019-05-30T19:16:48Z

Maybe this issue can be seen better from a different perspective. At least I have always found using the left-flanking and right-flanking terms confusing and I always easily got lost in them when thinking about some particular complicated input example.

Eventually I started to use in my head an alternative wording which (I believe) is 100%-equivalent to the current specs wording. It may be spelled as follows:

Left score and right score of the delimiter run determine whether the run may or may not open/close an emphasis. The scores are computed as follows:

If the preceding character is Unicode whitespace, set the left score to 0.
If the preceding character is Unicode punctuation, set the left score to 1.
If the preceding character is anything else, set the left score to 2.

If the subsequent character is Unicode whitespace, set the right score to 0.
If the subsequent character is Unicode punctuation, set the right score to 1.
If the subsequent character is anything else, set the right score to 2.

If left score == 2 and right score == 2, and the delimiter run is _-based, then reset both scores to zero.

The delimiter run can open an emphasis iff left score <= right score and right score > 0.
The delimiter run can close an emphasis iff left score >= right score and left score > 0.

(If you prefer code, MD4C uses internally this alternative wording.)

I post this because it might be easier to come with the solution in this wording, if we just add more rules into the score calculations above. Imho, it could perhaps even solve the issue with the ambiguous punctuation noted in earlier comments. E.g. something like

If the preceding character is EA-punctuation and the subsequent character is any EA-character, then reset right score to zero. (I.e. this makes it to be treated as if there is any punctuation before the run and whitespace after it in current implementation.)
If the subsequent character is EA-punctuation and the preceding character is any EA-character, then reset left score to zero. (I.e. this makes it to be treated as if there is any punctuation after the run and whitespace before it in current implementation.)

At least, it can be easily seen this wouldn't change anything for western text, and the people who (unlike me) understand EA languages and their needs may play more safely as long as they propose rules which require EA-characters on both sides of the run. Divide et impera.

spencer246 · 2020-08-21T04:02:06Z

Although this PR works for Japanese and Chinese text (please note that Korean text uses "Western" punctuation marks), it does not solve a related but slightly different issue in Korean text reported here (github/javascript-tutorial, #2040).

Koreans expect *스크립트(script)*라고 to be rendered to <em>스크립트(script)</em>라고. Since Korean text uses "Western" punctuation marks, the current CommonMark spec or this PR does not render the above Korean text "correctly."

This Korean-text issue may be resolved by adding one more condition to @jgm's simple rule in this comment:

Right flanking:

before char is non-space, AND
one of the following:
- before char is EA punctuation or non-punctuation
- after char is space or punctuation or any EA character,

although it will break nested emphases more severely.

By the way, I think a better way to solve CJK-related emphasis issues is to introduce a new syntax ~_, _~, ~*, and *~ originally suggested by Prof. John MacFarlane for intra-word emphasis. However, his suggestion is equally applicable to any CJK-related emphasis issues arising from the lack of whitespace.

spencer246 · 2020-08-21T04:30:25Z

It seems that the issue on emphasizing Korean texts has not been reported before.

I posted this issue in https://talk.commonmark.org/t/emphasis-and-east-asian-text/2491 as a comment.

ikedas · 2022-07-17T07:22:02Z

Sorry I haven't had time to think over this issue.
I have one idea I would like to try and will post it later (perhaps in months).

kivikakk mentioned this pull request Jan 24, 2018

Quoted terms are not detected as emphasis, in Japanese text github/cmark-gfm#80

Closed

piroor mentioned this pull request Jan 24, 2018

tdiary-style-markdown: *...*で強調表示にならないことがある clear-code/statistics#112

Closed

ikedas added 4 commits September 5, 2018 13:28

Emphasis and East Asian text: code changes.

70c9af0

Emphasis and East Asian text: Proposed changes to spec.

d529d6c

Emphasis and East Asian text: Adding test cases.

ff2a079

Copyedit on proposed spec.

9a24044

ikedas force-pushed the eastasian_emph branch from 489926e to 9a24044 Compare September 5, 2018 04:42

cinty8b mentioned this pull request Sep 30, 2018

Markdown-it 渲染斜体有问题 vnotex/vnote#429

Closed

jgm mentioned this pull request Sep 30, 2018

Punctiation regexp is incomplete commonmark/commonmark.js#108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emphasis and East Asian text #208

Emphasis and East Asian text #208

ikedas commented Jun 25, 2017 •

edited

Loading

jgm commented Jun 27, 2017 •

edited

Loading

ikedas commented Jun 27, 2017

jgm commented Jun 27, 2017 via email

ikedas commented Jun 28, 2017 •

edited

Loading

kivikakk commented Sep 5, 2018

ikedas commented Sep 5, 2018 •

edited

Loading

tamlok commented Sep 30, 2018

jgm commented Sep 30, 2018

ikedas commented Sep 30, 2018

jgm commented Sep 30, 2018

ikedas commented Sep 30, 2018 •

edited

Loading

jgm commented Oct 1, 2018 •

edited

Loading

jgm commented Oct 1, 2018 •

edited

Loading

jgm commented Oct 1, 2018

ikedas commented Oct 2, 2018

ikedas commented Oct 2, 2018 •

edited

Loading

jgm commented Oct 2, 2018

cangyuyao commented Jan 18, 2019

mity commented May 30, 2019 •

edited

Loading

spencer246 commented Aug 21, 2020 •

edited

Loading

spencer246 commented Aug 21, 2020

ikedas commented Jul 17, 2022 •

edited

Loading

Emphasis and East Asian text #208

Are you sure you want to change the base?

Emphasis and East Asian text #208

Conversation

ikedas commented Jun 25, 2017 • edited Loading

jgm commented Jun 27, 2017 • edited Loading

ikedas commented Jun 27, 2017

jgm commented Jun 27, 2017 via email

ikedas commented Jun 28, 2017 • edited Loading

Example 1

Example 2

kivikakk commented Sep 5, 2018

ikedas commented Sep 5, 2018 • edited Loading

tamlok commented Sep 30, 2018

jgm commented Sep 30, 2018

ikedas commented Sep 30, 2018

jgm commented Sep 30, 2018

ikedas commented Sep 30, 2018 • edited Loading

jgm commented Oct 1, 2018 • edited Loading

jgm commented Oct 1, 2018 • edited Loading

jgm commented Oct 1, 2018

ikedas commented Oct 2, 2018

ikedas commented Oct 2, 2018 • edited Loading

jgm commented Oct 2, 2018

cangyuyao commented Jan 18, 2019

mity commented May 30, 2019 • edited Loading

spencer246 commented Aug 21, 2020 • edited Loading

spencer246 commented Aug 21, 2020

ikedas commented Jul 17, 2022 • edited Loading

ikedas commented Jun 25, 2017 •

edited

Loading

jgm commented Jun 27, 2017 •

edited

Loading

ikedas commented Jun 28, 2017 •

edited

Loading

ikedas commented Sep 5, 2018 •

edited

Loading

ikedas commented Sep 30, 2018 •

edited

Loading

jgm commented Oct 1, 2018 •

edited

Loading

jgm commented Oct 1, 2018 •

edited

Loading

ikedas commented Oct 2, 2018 •

edited

Loading

mity commented May 30, 2019 •

edited

Loading

spencer246 commented Aug 21, 2020 •

edited

Loading

ikedas commented Jul 17, 2022 •

edited

Loading