-
Notifications
You must be signed in to change notification settings - Fork 875
Unclosed script and style tags cause data loss #1036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. I didn't check every block-level tag, but of those I did test, this only occurs for |
The issue is different than I originally expected. It turns out that unless there is a closing >>> src = '''
... Some text `<script>` more text.
...
... A separate paragraph.
... '''
>>> markdown.markdown(src)
'<p>Some text `<script></p>' |
Oh, that's not good :(. |
For comparison, see this example which includes a closing tag: >>> src = '''
... Some text `<script>` more text.
...
... A separate paragraph with a closing `</script>` tag.
... '''
>>> markdown.markdown(src)
'<p>Some text <code><script></code> more text.</p>\n<p>A separate paragraph with a closing <code></script></code> tag.</p>' It appears that the underlying parser is withholding what is perceives to be content of the >>> markdown.markdown('foo `<script>` bar `<div>`')
'<p>foo `<script></p>' I'm afraid we may need to override some of the parent class to fix this one. 😬 |
I went spelunking in the code and it appears that the problem is related to html/parser.py#L345-L346 if tag in self.CDATA_CONTENT_ELEMENTS:
self.set_cdata_mode(tag)
CDATA_CONTENT_ELEMENTS = ("script", "style") which confirms that this only applies to It appears that the primary difference with if end and i < n and not self.cdata_elem:
if self.convert_charrefs and not self.cdata_elem:
self.handle_data(unescape(rawdata[i:n]))
else:
self.handle_data(rawdata[i:n])
i = self.updatepos(i, n)
self.rawdata = rawdata[i:] I suspect that this is an upstream bug, After all, the nested check for |
There are two related issues here:
Note that the current behavior in issue 2 is desired when the tag is part of an HTML block. However, when we have standalone tags in code spans (for example: I expect that 'fix' will have the effect of avoiding issue 1, except when the document contains an actual HTML block with a missing endtag. We don't guarantee good output from invalid HTML, but we generally try to avoid data loss. So, in the end, issue 1 needs to be fixed as well. |
I opened an issue upstream at https://bugs.python.org/issue41989 related to the dataloss and a PR has been submitted at python/cpython#22658 |
Provides a workaround for https://bugs.python.org/issue41989 Related to Python-Markdown#1036.
* Ensure unclosed script tags are parsed correctly by providing a workaround for https://bugs.python.org/issue41989. * Avoid cdata_mode outside of HTML blocks, such as in inline code spans. Fixes #1036.
Markdown 3.3 does not properly generate HTML for
`<script>`
markdown:
Current output:
Expected output:
The text was updated successfully, but these errors were encountered: