-
Notifications
You must be signed in to change notification settings - Fork 875
Raw HTML is throwing an exception #780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am seeing the same error:
|
Super ! I tried to debug a bit the code to understand how raw html is working, but can't make sens of it. It doesn't work the way I would expect it to be. |
This appears to be the minimum document which triggers the error:
Note that there needs to be at least three levels of nesting and the blank lines are necessary. Remove any one of them and the error disappears. |
Ugh, I feel like we fixed an issue similar to this not too far back. I remember it being a pain to track down.. |
Any update to get this fixed ? or some workaround ? |
@ikus060 I believe in my experiments I got it working by removing all of the empty lines. Or at least there was some non-obvious combination which avoided the bug. I didn't save it as I assumed the provided example wasn't real content anyway and probably won't help you to workaround the problem in your actual document. If you want to experiment with removing blank lines, you may find a workaround. As far a getting this fixed, all I did was confirm the bug exists and find the minimum document which triggers the bug. I have no idea what is causing it and that part of the code is let than ideal. There is a reason it is generally considered bad form to implement an HTML parser with regex. I suspect it is more likely to replace the entire raw HTML handling code than to fix this specific bug. As a reminder, we work on this in our spare time as volunteers. Recently all I have had time for is managing the bug tracker. I haven't worked on any code in months and don't foresee that changing anytime soon. Of course, I can't speak for the other devs. If someone provides a PR, I'll do my best to review it. We should probably backport the fix to 3.0 as 3.1 is not quite ready, IIRC. |
I haven't had time to look at it yet. I'm not looking forward to it either 🙂 . I've long said that the raw HTML parsing needs an overhaul, which @waylan has mentioned. I can confirm that I've often found empty line removal or addition to sometimes fix these issues. I do plan on looking at this at some point though, but I can't commit to when. |
Thanks for the feedback. That set thing clear. ;) |
FYI, this is the only blocking issue currently open in the 3.1 milestone. When we get this resolved, we can release 3.1 (assuming no additional blocking issues are reported in the meantime). Of course, if other commits are made prior to this being fixed, they will be in the release as well, but I'm not inclined to wait for anything else once this issue is fixed. |
Well, this gives me some motivation to try and track this down then. Maybe I'll get to this over the weekend... |
It appears the reason we get an indexing error is because the algorithm seems flawed. Now looking at this example:
We start in the state of So we come into And we find two nested regions, and recursively attempt to handle those in There are only three tags, and I'm not yet sure how to fix this, but this appears to be the problem. I still need to get a handle on the specifics of why it is doing all of this, and understand better what it should do. I don't think just preventing the last call would fix this, because then things don't get parsed proper. There's something more fundamentally flawed with this algorithm, but I'm going to have to unravel this to try and understand how to properly fix it. |
@facelessuser I'm assuming you are talking about the MarkdownInHtmlProcessor in the extra extension. Note that that was all implemented in #260 and #310 by @ryneeverett. Perhaps @ryneeverett could provide some input on this issue? |
Maybe, I haven't touched it since my initial debug session. I'll probably dig deeper this weekend, but any additional info would be helpful. |
I just confirmed this is still an issue after moving the >>> s = '''
... <div markdown="1">
...
... <div markdown="1">
...
... <div markdown="1">
... </div>
...
... </div>
... </div>
... '''
>>> markdown.markdown(s, extensions=['md_in_html'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\code\md\markdown\core.py", line 386, in markdown
return md.convert(text)
File "C:\code\md\markdown\core.py", line 263, in convert
root = self.parser.parseDocument(self.lines).getroot()
File "C:\code\md\markdown\blockparser.py", line 90, in parseDocument
self.parseChunk(self.root, '\n'.join(lines))
File "C:\code\md\markdown\blockparser.py", line 105, in parseChunk
self.parseBlocks(parent, text.split('\n\n'))
File "C:\code\md\markdown\blockparser.py", line 123, in parseBlocks
if processor.run(parent, blocks) is not False:
File "C:\code\md\markdown\extensions\md_in_html.py", line 78, in run
block = self._process_nests(element, block)
File "C:\code\md\markdown\extensions\md_in_html.py", line 45, in _process_nests
self.run(element, block[nest_index[-1][0]:nest_index[-1][1]], # last
File "C:\code\md\markdown\extensions\md_in_html.py", line 52, in run
tag = self._tag_data[self.parser.blockprocessors.tag_counter]
IndexError: list index out of range |
Yeah, I never got around to digging into this deeper. I didn't imagine moving the extension was going to fix it, just make it easier to only enable raw HTML...frankly I forgot this issue was open |
The following mardown:
is failing with this:
The text was updated successfully, but these errors were encountered: