Skip to content

Raw HTML is throwing an exception #780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ikus060 opened this issue Jan 27, 2019 · 15 comments
Closed

Raw HTML is throwing an exception #780

ikus060 opened this issue Jan 27, 2019 · 15 comments
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. extension Related to one or more of the included extensions.

Comments

@ikus060
Copy link

ikus060 commented Jan 27, 2019

The following mardown:

<div class="row" markdown="1">
<div class="col-md-6" markdown="1">
**SomeText**
</div>

<div class="col-md-6" markdown="1">

**blod text**  
<small>(<i class="fa fa-arrow-left"></i> small)</small>

<div class="barchart" markdown="1">
* item1
* item2
</div>

more text

</div>
</div>

is failing with this:

Traceback (most recent call last):
  File "/home/ikus060/workspace/PDSL/markdown.git/markdown/test_tools.py", line 117, in test
    output = markdown(input, **kwargs)
  File "/home/ikus060/workspace/PDSL/markdown.git/markdown/core.py", line 391, in markdown
    return md.convert(text)
  File "/home/ikus060/workspace/PDSL/markdown.git/markdown/core.py", line 268, in convert
    root = self.parser.parseDocument(self.lines).getroot()
  File "/home/ikus060/workspace/PDSL/markdown.git/markdown/blockparser.py", line 92, in parseDocument
    self.parseChunk(self.root, '\n'.join(lines))
  File "/home/ikus060/workspace/PDSL/markdown.git/markdown/blockparser.py", line 107, in parseChunk
    self.parseBlocks(parent, text.split('\n\n'))
  File "/home/ikus060/workspace/PDSL/markdown.git/markdown/blockparser.py", line 125, in parseBlocks
    if processor.run(parent, blocks) is not False:
  File "/home/ikus060/workspace/PDSL/markdown.git/markdown/extensions/extra.py", line 127, in run
    block = self._process_nests(element, block)
  File "/home/ikus060/workspace/PDSL/markdown.git/markdown/extensions/extra.py", line 95, in _process_nests
    block[nest_index[-1][1]:], True)                      # nest
  File "/home/ikus060/workspace/PDSL/markdown.git/markdown/extensions/extra.py", line 101, in run
    tag = self._tag_data[self.parser.blockprocessors.tag_counter]
IndexError: list index out of range

@waylan
Copy link
Member

waylan commented Jan 27, 2019

I am seeing the same error:

>>> import markdown
>>> markdown.__version__
'3.1.dev0'
>>> markdown.markdown(src, extensions=['extra'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "markdown/core.py", line 391, in markdown
    return md.convert(text)
  File "markdown/core.py", line 268, in convert
    root = self.parser.parseDocument(self.lines).getroot()
  File "markdown/blockparser.py", line 92, in parseDocument
    self.parseChunk(self.root, '\n'.join(lines))
  File "markdown/blockparser.py", line 107, in parseChunk
    self.parseBlocks(parent, text.split('\n\n'))
  File "markdown/blockparser.py", line 125, in parseBlocks
    if processor.run(parent, blocks) is not False:
  File "markdown/extensions/extra.py", line 127, in run
    block = self._process_nests(element, block)
  File "markdown/extensions/extra.py", line 95, in _process_nests
    block[nest_index[-1][1]:], True)                      # nest
  File "markdown/extensions/extra.py", line 101, in run
    tag = self._tag_data[self.parser.blockprocessors.tag_counter]
IndexError: list index out of range

@ikus060
Copy link
Author

ikus060 commented Jan 27, 2019

Super ! I tried to debug a bit the code to understand how raw html is working, but can't make sens of it. It doesn't work the way I would expect it to be.

@waylan
Copy link
Member

waylan commented Jan 27, 2019

This appears to be the minimum document which triggers the error:

<div markdown="1">

<div markdown="1">

<div markdown="1">
</div>

</div>
</div>

Note that there needs to be at least three levels of nesting and the blank lines are necessary. Remove any one of them and the error disappears.

@waylan waylan added bug Bug report. extension Related to one or more of the included extensions. confirmed Confirmed bug report or approved feature request. labels Jan 27, 2019
@facelessuser
Copy link
Collaborator

Ugh, I feel like we fixed an issue similar to this not too far back. I remember it being a pain to track down..

@ikus060
Copy link
Author

ikus060 commented Jan 30, 2019

Any update to get this fixed ? or some workaround ?

@waylan
Copy link
Member

waylan commented Jan 30, 2019

@ikus060 I believe in my experiments I got it working by removing all of the empty lines. Or at least there was some non-obvious combination which avoided the bug. I didn't save it as I assumed the provided example wasn't real content anyway and probably won't help you to workaround the problem in your actual document. If you want to experiment with removing blank lines, you may find a workaround.

As far a getting this fixed, all I did was confirm the bug exists and find the minimum document which triggers the bug. I have no idea what is causing it and that part of the code is let than ideal. There is a reason it is generally considered bad form to implement an HTML parser with regex. I suspect it is more likely to replace the entire raw HTML handling code than to fix this specific bug.

As a reminder, we work on this in our spare time as volunteers. Recently all I have had time for is managing the bug tracker. I haven't worked on any code in months and don't foresee that changing anytime soon. Of course, I can't speak for the other devs. If someone provides a PR, I'll do my best to review it. We should probably backport the fix to 3.0 as 3.1 is not quite ready, IIRC.

@facelessuser
Copy link
Collaborator

I haven't had time to look at it yet. I'm not looking forward to it either 🙂 . I've long said that the raw HTML parsing needs an overhaul, which @waylan has mentioned. I can confirm that I've often found empty line removal or addition to sometimes fix these issues.

I do plan on looking at this at some point though, but I can't commit to when.

@ikus060
Copy link
Author

ikus060 commented Jan 30, 2019

Thanks for the feedback. That set thing clear. ;)

@waylan waylan changed the title Raw HTML is trowing an exception Raw HTML is throwing an exception Feb 7, 2019
@waylan waylan added this to the Version 3.1 milestone Feb 7, 2019
@waylan
Copy link
Member

waylan commented Feb 7, 2019

FYI, this is the only blocking issue currently open in the 3.1 milestone. When we get this resolved, we can release 3.1 (assuming no additional blocking issues are reported in the meantime). Of course, if other commits are made prior to this being fixed, they will be in the release as well, but I'm not inclined to wait for anything else once this issue is fixed.

@facelessuser
Copy link
Collaborator

Well, this gives me some motivation to try and track this down then. Maybe I'll get to this over the weekend...

@facelessuser
Copy link
Collaborator

It appears the reason we get an indexing error is because the algorithm seems flawed. Now looking at this example:

<div markdown="1">

<div markdown="1">

<div markdown="1">
</div>

</div>
</div>

We start in the state of tag_counter = -1

So we come into extra and increment the tag_counter once for the first <div>. tag_counter = 0

And we find two nested regions, and recursively attempt to handle those in extra's nested handling function. But the two regions overlap (1, 7) and (3, 7). While processing the first region, we end up coming through extra twice, once in the nested loop, and once indirectly from processing the blocks (is it doing this because of the overlapped region?). Here we've incremented twice since we've passed through extra twice. tag_counter = 2.

There are only three tags, and tag_counter is used to index into the tag_data array. So a count of 2 means, the third element. So we shouldn't increment again, but we still haven't processed the last nested region from the nested function, so we call extra again recursively, which causes another increment: tag_counter = 3. But since we only have data for three tag regions, we are out of bounds.

I'm not yet sure how to fix this, but this appears to be the problem. I still need to get a handle on the specifics of why it is doing all of this, and understand better what it should do. I don't think just preventing the last call would fix this, because then things don't get parsed proper. There's something more fundamentally flawed with this algorithm, but I'm going to have to unravel this to try and understand how to properly fix it.

@waylan
Copy link
Member

waylan commented Feb 13, 2019

There's something more fundamentally flawed with this algorithm, but I'm going to have to unravel this to try and understand how to properly fix it.

@facelessuser I'm assuming you are talking about the MarkdownInHtmlProcessor in the extra extension. Note that that was all implemented in #260 and #310 by @ryneeverett. Perhaps @ryneeverett could provide some input on this issue?

@facelessuser
Copy link
Collaborator

Maybe, I haven't touched it since my initial debug session. I'll probably dig deeper this weekend, but any additional info would be helpful.

@waylan waylan removed this from the Version 3.1 milestone Mar 26, 2019
@waylan
Copy link
Member

waylan commented Jun 30, 2020

I just confirmed this is still an issue after moving the markdown=1 functionality to its own extension.

>>> s = '''
... <div markdown="1">
...
... <div markdown="1">
...
... <div markdown="1">
... </div>
...
... </div>
... </div>
... '''
>>> markdown.markdown(s, extensions=['md_in_html'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\code\md\markdown\core.py", line 386, in markdown
    return md.convert(text)
  File "C:\code\md\markdown\core.py", line 263, in convert
    root = self.parser.parseDocument(self.lines).getroot()
  File "C:\code\md\markdown\blockparser.py", line 90, in parseDocument
    self.parseChunk(self.root, '\n'.join(lines))
  File "C:\code\md\markdown\blockparser.py", line 105, in parseChunk
    self.parseBlocks(parent, text.split('\n\n'))
  File "C:\code\md\markdown\blockparser.py", line 123, in parseBlocks
    if processor.run(parent, blocks) is not False:
  File "C:\code\md\markdown\extensions\md_in_html.py", line 78, in run
    block = self._process_nests(element, block)
  File "C:\code\md\markdown\extensions\md_in_html.py", line 45, in _process_nests
    self.run(element, block[nest_index[-1][0]:nest_index[-1][1]],  # last
  File "C:\code\md\markdown\extensions\md_in_html.py", line 52, in run
    tag = self._tag_data[self.parser.blockprocessors.tag_counter]
IndexError: list index out of range

@facelessuser
Copy link
Collaborator

Yeah, I never got around to digging into this deeper. I didn't imagine moving the extension was going to fix it, just make it easier to only enable raw HTML...frankly I forgot this issue was open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. extension Related to one or more of the included extensions.
Projects
None yet
Development

No branches or pull requests

3 participants