Skip to content

PI (<? ?>) is escaped in error with md_in_html #1070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
waylan opened this issue Nov 19, 2020 · 3 comments · Fixed by #1071
Closed

PI (<? ?>) is escaped in error with md_in_html #1070

waylan opened this issue Nov 19, 2020 · 3 comments · Fixed by #1071
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. extension Related to one or more of the included extensions.

Comments

@waylan
Copy link
Member

waylan commented Nov 19, 2020

In any of the following source documents:

<div markdown=1>

<?php print("foo"); >

</div>
<div markdown=1>
<?php print("foo"); >
</div>
<div markdown=1><?php print("foo"); ></div>

We get the same output:

<div>
<p>&lt;?php print("foo"); &gt;</p>
</div>

Which is correct except for the fact the the opening and closing brackets of the processing instruction (PI) are escaped. We should be getting:

<div>
<p><?php print("foo"); ></p>
</div>

The problem is related to the fact that the extension builds an etree object of raw HTML when markdown=1. However, the PI (<?php print("foo"); >) is simply passed to the etree object as text data (to treebuilder.data). Upon serialization, it is escaped. The obvious fix would be to pass the content of the PI to treebuilder.pi, except that xml.etree.ElementTree.TreeBuilder.pi was only added in PY3.8. What do we do for PY3.6 and PY3.7?

@waylan waylan added bug Bug report. extension Related to one or more of the included extensions. needs-decision A decision needs to be made regarding request. confirmed Confirmed bug report or approved feature request. labels Nov 19, 2020
@waylan
Copy link
Member Author

waylan commented Nov 19, 2020

I just remembered that the workaround for no support for treebuilder.pi was to pass the text of the object to htmlStash, which avoids escaping by the serializer. See here:

def handle_empty_tag(self, data, is_block):
if self.inraw or not self.mdstack:
super().handle_empty_tag(data, is_block)
else:
if self.at_line_start() and is_block:
self.handle_data('\n' + self.md.htmlStash.store(data) + '\n\n')
else:
if self.mdstate and self.mdstate[-1] == "off":
self.handle_data(self.md.htmlStash.store(data))
else:
self.handle_data(data)

However, for some reason, handle_empty_tag is not getting called in the example cases in this issue.

@facelessuser
Copy link
Collaborator

Yes, this is a little bit of why I had a little bit of trouble jumping into the parser. I assumed that things were always going down the appropriate path, but there appear to be some cases, like this, where things don't seem to get handled properly.

Now, I went down the wrong path and tried to cover for this by injecting code that caused other issues instead of getting to the root. While much easier to debug than the old parser, it is still not without some tricky places.

@waylan
Copy link
Member Author

waylan commented Nov 19, 2020

# Monkeypatch HTMLParser to only accept `?>` to close Processing Instructions.
htmlparser.piclose = re.compile(r'\?>')

It took a while, but I finally remembered that we modified the piclose regex to only match ?>. We don't consider it a valid PI to close with >. The parser then does not consider the relevant bit to be a valid PI and falls back to treating it as random data which is enclosed between angle braces.

And, as PHP requires that it be closed with ?> (not just >), that is a perfectly acceptable solution.

In other words, all my example cases here are invalid. With the proper closing (?>), the multi-line examples work just fine. However, we still have an issue with the single line example:

<div markdown=1><?php print("foo"); ?></div>

@waylan waylan changed the title PI (<? >) is escaped in error with md_in_html PI (<? ?>) is escaped in error with md_in_html Nov 19, 2020
@waylan waylan removed the needs-decision A decision needs to be made regarding request. label Nov 19, 2020
waylan added a commit to waylan/markdown that referenced this issue Nov 19, 2020
Empty tags do not have a `mardkown` attribute set on them. Therefore,
there is no need to check the mdstack to determine behavior. If we
are in any md_in_html state (regardless of block, span, etc) the
behavior is the same. Fixes Python-Markdown#1070.
waylan added a commit that referenced this issue Nov 19, 2020
Empty tags do not have a `mardkown` attribute set on them. Therefore,
there is no need to check the mdstack to determine behavior. If we
are in any md_in_html state (regardless of block, span, etc) the
behavior is the same. Fixes #1070.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. extension Related to one or more of the included extensions.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants