<hr> tags break markup processing in 3.3.0+? #1053

dbader · 2020-10-23T23:50:54Z

<hr> elements seem to interrupt markdown text processing:

Markdown Input: '*emphasis1*\n<hr>\n*emphasis2*'
HTML Output (actual): 'emphasis1\n<hr>\n*emphasis2*'
HTML Output (expected): 'emphasis1\n<hr>\nemphasis2'

*emphasis1*\n<hr/>\n*emphasis2* (closing the hr tag) causes *emphasis2* to render correctly.

Naked <hr> tags should be valid HTML: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/hr

Having the md_in_html extension enabled or disabled doesn't seem to make a difference.

Versions

Markdown==3.3.2 (affects 3.3.0 and above)
CPython 3.8.6

Steps to reproduce

mderror.py:

import markdown


def test(md_text):
    html = markdown.markdown(
        md_text,
        # Same result with or without md_in_html
        # extensions=["markdown.extensions.md_in_html"],
    )
    okay = "*emphasis" not in html
    print(f"{'OK ' if okay else 'ERR'} {md_text!r} → {html!r}")


# ERR
test("*emphasis1*\n<hr>\n*emphasis2*")

# OK
test("*emphasis1*\n<hr/>\n*emphasis2*")
test("*emphasis1*\n<hr></hr>\n*emphasis2*")
test("*emphasis1*\n<img>\n*emphasis2*")

Output:

897c854 also fails:

pip install git+https://github.com/Python-Markdown/markdown@897c8541998cdbd0499b04942244f7fdf1191a6c && python mderror.py
ERR '*emphasis1*\n<hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr>\n*emphasis2*'
OK  '*emphasis1*\n<hr/>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr/></p>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr></hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr></hr>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<img>\n*emphasis2*' → '<p><em>emphasis1</em>\n<img>\n<em>emphasis2</em></p>'

$ pip install Markdown==3.3.2 && python mderror.py
ERR '*emphasis1*\n<hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr>\n*emphasis2*'
OK  '*emphasis1*\n<hr/>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr/></p>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr></hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr></hr>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<img>\n*emphasis2*' → '<p><em>emphasis1</em>\n<img>\n<em>emphasis2</em></p>'

$ pip install Markdown==3.3.1 && python mderror.py
ERR '*emphasis1*\n<hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr>\n*emphasis2*'
OK  '*emphasis1*\n<hr/>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr/></p>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr></hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr></hr>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<img>\n*emphasis2*' → '<p><em>emphasis1</em>\n<img>\n<em>emphasis2</em></p>'

3.3.0 starts failing:

→ pip install Markdown==3.3.0 && python mderror.py
ERR '*emphasis1*\n<hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr>\n*emphasis2*'
OK  '*emphasis1*\n<hr/>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr/></p>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr></hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr></hr>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<img>\n*emphasis2*' → '<p><em>emphasis1</em>\n<img>\n<em>emphasis2</em></p>'

3.2.2 passes all tests:

→ pip install Markdown==3.2.2 && python mderror.py
OK  '*emphasis1*\n<hr>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr>\n<em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr/>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr/>\n<em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr></hr>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr></hr>\n<em>emphasis2</em></p>'
OK  '*emphasis1*\n<img>\n*emphasis2*' → '<p><em>emphasis1</em>\n<img>\n<em>emphasis2</em></p>'

The text was updated successfully, but these errors were encountered:

facelessuser · 2020-10-23T23:56:32Z

I'll have to test the tip. We have some unreleased fixes as there were some inconsistencies with certain tags.

facelessuser · 2020-10-24T00:21:21Z

Yeah, it looks like new behavior introduced by the new parser. I would think <hr> would get handled similar to  , but it isn't.

On the one hand, as it is a block item, I'm not sure it should render inside a paragraph. But on the other hand, it seems to treat it as a block (which it is) and just keep searching for an end tag. It doesn't find the end, but assumes all the trailing text is inside the <hr> and since blocks are not parsed for Markdown, the emphasis isn't parsed.

Interestingly, if we add markdown="1" to it, we get data loss:

OK  '*emphasis1*\n<hr markdown="1">\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr />'
OK  '*emphasis1*\n<hr/>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr/></p>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr></hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr></hr>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<img>\n*emphasis2*' → '<p><em>emphasis1</em>\n<img>\n<em>emphasis2</em></p>'

Seems we need to handle unclosed tags a bit better:

To not have data loss
Maybe to recognize and handle tags that shouldn't have content.

@waylan thoughts?

facelessuser · 2020-10-24T00:46:34Z

Here is a potential fix. Interested to hear your opinion on approach @waylan or if you have a different suggestion:

diff --git a/markdown/extensions/md_in_html.py b/markdown/extensions/md_in_html.py
index a2137c7..4bfc6c9 100644
--- a/markdown/extensions/md_in_html.py
+++ b/markdown/extensions/md_in_html.py
@@ -36,9 +36,10 @@ class HTMLExtractorExtra(HTMLExtractor):
         # Block-level tags which never get their content parsed.
         self.raw_tags = ['canvas', 'math', 'option', 'pre', 'script', 'style', 'textarea']
         # Block-level tags in which the content gets parsed as blocks
-        self.block_tags = [tag for tag in self.block_level_tags if tag not in self.span_tags + self.raw_tags]
         super().__init__(md, *args, **kwargs)
 
+        self.block_tags = [tag for tag in self.block_level_tags if tag not in self.span_tags + self.raw_tags]
+
     def reset(self):
         """Reset this instance.  Loses all unprocessed data."""
         self.mdstack = []  # When markdown=1, stack contains a list of tags
@@ -119,6 +120,9 @@ class HTMLExtractorExtra(HTMLExtractor):
                 else:
                     self.handle_data(text)
 
+        if tag in self.empty_tags:
+            self.handle_endtag(tag)
+
     def handle_endtag(self, tag):
         if tag in self.block_level_tags:
             if self.inraw:
diff --git a/markdown/htmlparser.py b/markdown/htmlparser.py
index 6776d34..bd75368 100644
--- a/markdown/htmlparser.py
+++ b/markdown/htmlparser.py
@@ -56,6 +56,10 @@ class HTMLExtractor(htmlparser.HTMLParser):
     def __init__(self, md, *args, **kwargs):
         if 'convert_charrefs' not in kwargs:
             kwargs['convert_charrefs'] = False
+
+        # Block tags that should contain no content (self closing)
+        self.empty_tags = ['hr']
+
         # This calls self.reset
         super().__init__(*args, **kwargs)
         self.md = md
@@ -135,6 +139,9 @@ class HTMLExtractor(htmlparser.HTMLParser):
                 # This is presumably a standalone tag in a code span (see #1036).
                 self.clear_cdata_mode()
 
+        if tag in self.empty_tags:
+            self.handle_endtag(tag)
+
     def handle_endtag(self, tag):
         text = self.get_endtag_text(tag)

facelessuser · 2020-10-24T00:57:35Z

Should be noted that this doesn't universally fix potential data loss in md_in_html if ending tag is missing.

facelessuser · 2020-10-24T01:05:13Z

This will have to be tweaked as well as it isn't perfect and doubles <hr> tags, but the general direction is what I'm trying to highlight.

facelessuser · 2020-10-24T01:48:24Z

This is a bit more sane:

diff --git a/markdown/extensions/md_in_html.py b/markdown/extensions/md_in_html.py
index a2137c7..1fc4ab8 100644
--- a/markdown/extensions/md_in_html.py
+++ b/markdown/extensions/md_in_html.py
@@ -36,9 +36,10 @@ class HTMLExtractorExtra(HTMLExtractor):
         # Block-level tags which never get their content parsed.
         self.raw_tags = ['canvas', 'math', 'option', 'pre', 'script', 'style', 'textarea']
         # Block-level tags in which the content gets parsed as blocks
-        self.block_tags = [tag for tag in self.block_level_tags if tag not in self.span_tags + self.raw_tags]
         super().__init__(md, *args, **kwargs)
 
+        self.block_tags = [tag for tag in self.block_level_tags if tag not in self.span_tags + self.raw_tags]
+
     def reset(self):
         """Reset this instance.  Loses all unprocessed data."""
         self.mdstack = []  # When markdown=1, stack contains a list of tags
diff --git a/markdown/htmlparser.py b/markdown/htmlparser.py
index 6776d34..2e14038 100644
--- a/markdown/htmlparser.py
+++ b/markdown/htmlparser.py
@@ -56,6 +56,10 @@ class HTMLExtractor(htmlparser.HTMLParser):
     def __init__(self, md, *args, **kwargs):
         if 'convert_charrefs' not in kwargs:
             kwargs['convert_charrefs'] = False
+
+        # Block tags that should contain no content (self closing)
+        self.empty_tags = ['hr']
+
         # This calls self.reset
         super().__init__(*args, **kwargs)
         self.md = md
@@ -264,6 +268,13 @@ class HTMLExtractor(htmlparser.HTMLParser):
         if end.endswith('/>'):
             # XHTML-style empty tag: <span attr="value" />
             self.handle_startendtag(tag, attrs)
+        elif tag in self.empty_tags:
+            if re.match(r'\s*</\s*{}\s*>'.format(tag), self.rawdata[self.line_offset + self.offset + len(self.__starttag_text):]):
+                if tag in self.CDATA_CONTENT_ELEMENTS:
+                    self.set_cdata_mode(tag)
+                self.handle_starttag(tag, attrs)
+            else:
+                self.handle_startendtag(tag, attrs)
         else:
             # *** set cdata_mode first so we can override it in handle_starttag (see #1036) ***
             if tag in self.CDATA_CONTENT_ELEMENTS:

facelessuser · 2020-10-24T01:56:53Z

Still not happy with how the <hr> is still included in the previous . It's probably due to the fact that handle_startendtag was more geared to handling inline tags. I don't have time to investigate that tonight.

waylan · 2020-10-24T02:24:24Z

This was an intentional change in behavior. Previously, we strictly enforced the rule which required that an HTML block must start with a blank line. However, in practice, in most cases, even the reference implementation doesn't follow that rule (strangely it does with <hr>). Therefore, as of version 3.3, Python-Markdown ends a Markdown paragraph (or any other block) upon finding an opening tag (or a self-closing tag) of a block-level element at the start of a line.

I should also note that it is invalid HTML to have an <hr> within a . In fact, browsers will interpret

<p>foo
<hr>
bar
</p>

as

<p>foo
</p><hr>
bar
<p></p>

Note that bar is not even in a  tag as far as the browser is concerned. Therefore, it is nonsensical to output that HTML. The correct way to parse the following Markdown

foo
<hr>
bar

would be

<p>foo</p>
<hr>
<p>bar</p>

And that will provide a better rendering in the browser.

Which means that we do have a bug in that the line after the <hr> is not being wrapped in a . Presumably a blank line needs to be inserted after block-level self-closing tags in the htmlparser.

facelessuser · 2020-10-24T02:32:27Z

I have some work here: #1054. We can discuss changes there and change direction if I'm going in bad direction.

This case:

foo
<hr>
bar

Should render as

<p>foo</p>
<hr>
<p>bar</p>

dbader changed the title ~~<hr> tags rendering differently in 3.3.0+~~ <hr> tags break markup processing in 3.3.0+? Oct 23, 2020

waylan added bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code. labels Oct 24, 2020

waylan closed this as completed in 11c9e17 Oct 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<hr> tags break markup processing in 3.3.0+? #1053

<hr> tags break markup processing in 3.3.0+? #1053

dbader commented Oct 23, 2020

facelessuser commented Oct 23, 2020

facelessuser commented Oct 24, 2020

facelessuser commented Oct 24, 2020

facelessuser commented Oct 24, 2020

facelessuser commented Oct 24, 2020

facelessuser commented Oct 24, 2020

facelessuser commented Oct 24, 2020

waylan commented Oct 24, 2020 •

edited

Loading

facelessuser commented Oct 24, 2020

<hr> tags break markup processing in 3.3.0+? #1053

<hr> tags break markup processing in 3.3.0+? #1053

Comments

dbader commented Oct 23, 2020

Versions

Steps to reproduce

facelessuser commented Oct 23, 2020

facelessuser commented Oct 24, 2020

facelessuser commented Oct 24, 2020

facelessuser commented Oct 24, 2020

facelessuser commented Oct 24, 2020

facelessuser commented Oct 24, 2020

facelessuser commented Oct 24, 2020

waylan commented Oct 24, 2020 • edited Loading

facelessuser commented Oct 24, 2020

waylan commented Oct 24, 2020 •

edited

Loading