TOC:Anchor link written in Japanese does not work #1118

juu7g · 2021-03-22T00:42:43Z

I'm using the extension TOC with slugify_unicode for Japanese.
And I'm using anchor links.
In some cases, Japanese anchor link does not work.
That is when Japanese characters contains dakuon(for example 'ba')
or handakuon(for example 'pa').
I think this is because the characters in the generated ID
and the characters in the header are different.

Sample Markdown:

[TOC]
[anchor link to プログラム](#プログラム)  
[anchor link to ぷろぐらむ](#ぷろぐらむ)  

##プログラム

##ぷろぐらむ

Generated html:

<div class="toc">
<ul>
<li><a href="#フロクラム">プログラム</a></li>
<li><a href="#ふろくらむ">ぷろぐらむ</a></li>
</ul>
</div>
<p><a href="#プログラム">anker link to プログラム</a><br />
<a href="#ぷろぐらむ">anker link to ぷろぐらむ</a>  </p>
<h2 id="フロクラム">プログラム</h2>
<h2 id="ふろくらむ">ぷろぐらむ</h2>

The result I expect is:

<div class="toc">
<ul>
<li><a href="#プログラム">プログラム</a></li>
<li><a href="#ぷろぐらむ">ぷろぐらむ</a></li>
</ul>
</div>
<p><a href="#プログラム">anker link to プログラム</a><br />
<a href="#ぷろぐらむ">anker link to ぷろぐらむ</a>  </p>
<h2 id="プログラム">プログラム</h2>
<h2 id="ぷろぐらむ">ぷろぐらむ</h2>

As far as I can tell, this depends on how the unicodedata.normalize()
method arguments are used.
In other words, I think we need to change the first argument
from "NFKD" to "NFKC".

Reference: Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code | by Xu LIANG | Towards Data Science

I'm not familiar with unicode.
And this is my first post.
Please investigate.

The text was updated successfully, but these errors were encountered:

waylan · 2021-03-22T13:26:34Z

First of all, I'm not familiar with Japanese to know what is correct. However, even in English there is occasionally a change in a character when generating a slug. So, yes, you sometimes need to account for that when manually creating your own links to those generated slugs.

For all I know, converting プ to フ in a slug might be the correct behavior. Or maybe is is clearly not correct. I have no way of knowing personally. But if we change to a different method, could that cause a similar error in another language? I'm not sure.

We need to be very careful here about what changes we make. For example, if we make a change will that cause pre-existing links to slugs in existing document to now be wrong when those documents are re-rendered using the new version of Markdown?

mitya57 · 2021-03-22T16:07:25Z

Some historical investigation. I think the initial purpose of unicodedata.normalize('NFKD', value).encode('ascii', 'ignore') was converting Extended Latin characters to ASCII, e.g. café to cafe.

#970 added support for Unicode IDs, but it left the first part of that code unchanged, so the normalization happens also when sligify_unicode is in use.

Interesting that that PR adds a test for Japanese! It checks that ダ gets changed to タ, so probably the author of #970 @Hsn723 thought that it's the expected behavior.

If we just replace NFKD with NFKC in all cases, it will break the old behavior of ASCII-fying Extended Latin. But I think it should be safe to do it (or disable normalization at all) when slugify_unicode is in use. After all it's a relatively new option, so changing it won't hurt many users.

waylan · 2021-03-22T17:12:11Z

@mitya57 thanks for doing the research on this. I think your proposal makes sense. If I passed café into slugify_unicode I would expect to get back café, not cafe Therefore disabling normalization all together makes sense to me.

We provide two options (1) normalize to ASCII and (2) preserve Unicode as-is (only normalizing whitespace). If users want other behavior, they can provide their own function or use some third-party provided function.

mitya57 · 2021-03-22T18:00:28Z

@mitya57 thanks for doing the research on this. I think your proposal makes sense. If I passed café into slugify_unicode I would expect to get back café, not cafe

It already works like that. To replace é with e, one needs both NFKD normalization and the .encode('ascii', 'ignore') thing: the first part will decompose U+00E9 é into two characters (U+0065 e and U+0301 COMBINING ACUTE ACCENT), and the second part will remove the second character because it can't be encoded in ASCII. If you use slugify_unicode, the second part won't happen.

What I meant is that I see why normalization is needed with ASCII slugs: it is an essential part of making them ASCII. But I don't see a need in normalization when we use unicode slugs.

waylan · 2021-03-22T18:17:35Z

Thanks for reminding me how the normalization works (I forget those details sometimes). In any event, I agree, we don't need normalization for unicode slugs.

Update the existing test and add a new one to make sure that the behavior of default slugify function has not changed. Fixes Python-Markdown#1118.

Update the existing test and add a new one to make sure that the behavior of default slugify function has not changed. Fixes #1118.

waylan added bug Bug report. extension Related to one or more of the included extensions. needs-confirmation The alleged behavior needs to be confirmed. labels Mar 22, 2021

mitya57 mentioned this issue Mar 24, 2021

toc: Do not remove diacritical marks when slugify_unicode is used #1121

Merged

waylan closed this as completed in #1121 Mar 24, 2021

waylan pushed a commit that referenced this issue Mar 24, 2021

toc: Do not remove diacritical marks when slugify_unicode is used

a114315

Update the existing test and add a new one to make sure that the behavior of default slugify function has not changed. Fixes #1118.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TOC:Anchor link written in Japanese does not work #1118

TOC:Anchor link written in Japanese does not work #1118

juu7g commented Mar 22, 2021

waylan commented Mar 22, 2021

mitya57 commented Mar 22, 2021 •

edited

Loading

waylan commented Mar 22, 2021

mitya57 commented Mar 22, 2021

waylan commented Mar 22, 2021

TOC:Anchor link written in Japanese does not work #1118

TOC:Anchor link written in Japanese does not work #1118

Comments

juu7g commented Mar 22, 2021

waylan commented Mar 22, 2021

mitya57 commented Mar 22, 2021 • edited Loading

waylan commented Mar 22, 2021

mitya57 commented Mar 22, 2021

waylan commented Mar 22, 2021

mitya57 commented Mar 22, 2021 •

edited

Loading