Skip to content

TOC:Anchor link written in Japanese does not work #1118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
juu7g opened this issue Mar 22, 2021 · 5 comments · Fixed by #1121
Closed

TOC:Anchor link written in Japanese does not work #1118

juu7g opened this issue Mar 22, 2021 · 5 comments · Fixed by #1121
Labels
bug Bug report. extension Related to one or more of the included extensions. needs-confirmation The alleged behavior needs to be confirmed.

Comments

@juu7g
Copy link

juu7g commented Mar 22, 2021

I'm using the extension TOC with slugify_unicode for Japanese.
And I'm using anchor links.
In some cases, Japanese anchor link does not work.
That is when Japanese characters contains dakuon(for example 'ba')
or handakuon(for example 'pa').
I think this is because the characters in the generated ID
and the characters in the header are different.

Sample Markdown:

[TOC]
[anchor link to プログラム](#プログラム)  
[anchor link to ぷろぐらむ](#ぷろぐらむ)  

##プログラム

##ぷろぐらむ

Generated html:

<div class="toc">
<ul>
<li><a href="#フロクラム">プログラム</a></li>
<li><a href="#ふろくらむ">ぷろぐらむ</a></li>
</ul>
</div>
<p><a href="#プログラム">anker link to プログラム</a><br />
<a href="#ぷろぐらむ">anker link to ぷろぐらむ</a>  </p>
<h2 id="フロクラム">プログラム</h2>
<h2 id="ふろくらむ">ぷろぐらむ</h2>

The result I expect is:

<div class="toc">
<ul>
<li><a href="#プログラム">プログラム</a></li>
<li><a href="#ぷろぐらむ">ぷろぐらむ</a></li>
</ul>
</div>
<p><a href="#プログラム">anker link to プログラム</a><br />
<a href="#ぷろぐらむ">anker link to ぷろぐらむ</a>  </p>
<h2 id="プログラム">プログラム</h2>
<h2 id="ぷろぐらむ">ぷろぐらむ</h2>

As far as I can tell, this depends on how the unicodedata.normalize()
method arguments are used.
In other words, I think we need to change the first argument
from "NFKD" to "NFKC".

Reference: Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code | by Xu LIANG | Towards Data Science

I'm not familiar with unicode.
And this is my first post.
Please investigate.

@waylan
Copy link
Member

waylan commented Mar 22, 2021

First of all, I'm not familiar with Japanese to know what is correct. However, even in English there is occasionally a change in a character when generating a slug. So, yes, you sometimes need to account for that when manually creating your own links to those generated slugs.

For all I know, converting to in a slug might be the correct behavior. Or maybe is is clearly not correct. I have no way of knowing personally. But if we change to a different method, could that cause a similar error in another language? I'm not sure.

We need to be very careful here about what changes we make. For example, if we make a change will that cause pre-existing links to slugs in existing document to now be wrong when those documents are re-rendered using the new version of Markdown?

@waylan waylan added bug Bug report. extension Related to one or more of the included extensions. needs-confirmation The alleged behavior needs to be confirmed. labels Mar 22, 2021
@mitya57
Copy link
Collaborator

mitya57 commented Mar 22, 2021

Some historical investigation. I think the initial purpose of unicodedata.normalize('NFKD', value).encode('ascii', 'ignore') was converting Extended Latin characters to ASCII, e.g. café to cafe.

#970 added support for Unicode IDs, but it left the first part of that code unchanged, so the normalization happens also when sligify_unicode is in use.

Interesting that that PR adds a test for Japanese! It checks that gets changed to , so probably the author of #970 @Hsn723 thought that it's the expected behavior.

If we just replace NFKD with NFKC in all cases, it will break the old behavior of ASCII-fying Extended Latin. But I think it should be safe to do it (or disable normalization at all) when slugify_unicode is in use. After all it's a relatively new option, so changing it won't hurt many users.

@waylan
Copy link
Member

waylan commented Mar 22, 2021

@mitya57 thanks for doing the research on this. I think your proposal makes sense. If I passed café into slugify_unicode I would expect to get back café, not cafe Therefore disabling normalization all together makes sense to me.

We provide two options (1) normalize to ASCII and (2) preserve Unicode as-is (only normalizing whitespace). If users want other behavior, they can provide their own function or use some third-party provided function.

@mitya57
Copy link
Collaborator

mitya57 commented Mar 22, 2021

@mitya57 thanks for doing the research on this. I think your proposal makes sense. If I passed café into slugify_unicode I would expect to get back café, not cafe

It already works like that. To replace é with e, one needs both NFKD normalization and the .encode('ascii', 'ignore') thing: the first part will decompose U+00E9 é into two characters (U+0065 e and U+0301 COMBINING ACUTE ACCENT), and the second part will remove the second character because it can't be encoded in ASCII. If you use slugify_unicode, the second part won't happen.

What I meant is that I see why normalization is needed with ASCII slugs: it is an essential part of making them ASCII. But I don't see a need in normalization when we use unicode slugs.

@waylan
Copy link
Member

waylan commented Mar 22, 2021

Thanks for reminding me how the normalization works (I forget those details sometimes). In any event, I agree, we don't need normalization for unicode slugs.

mitya57 added a commit to mitya57/markdown that referenced this issue Mar 24, 2021
Update the existing test and add a new one to make sure that the
behavior of default slugify function has not changed.

Fixes Python-Markdown#1118.
waylan pushed a commit that referenced this issue Mar 24, 2021
Update the existing test and add a new one to make sure that the
behavior of default slugify function has not changed.

Fixes #1118.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. extension Related to one or more of the included extensions. needs-confirmation The alleged behavior needs to be confirmed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants