Voting

: max(zero, seven)?
(Example: nine)

The Note You're Voting On

emanueledelgrande at email dot it
15 years ago
The recursion in regular expressions is the only way to allow the parsing of HTML code with nested tags of indefinite depth.
It seems it's not yet a spreaded practice; not so much contents are available on the web regarding regexp recursion, and until now no user contribute notes have been published on this manual page.
I made several tests with complex patterns to get tags with specific attributes or namespaces, studying the recursion of a subpattern only instead of the full pattern.
Here's an example that may power a fast LL parser with recursive descent (http://en.wikipedia.org/wiki/Recursive_descent_parser):

$pattern = "/<([\w]+)([^>]*?) (([\s]*\/>)| (>((([^<]*?|<\!\-\-.*?\-\->)| (?R))*)<\/\\1[\s]*>))/xsm";

The performances of a preg_match or preg_match_all function call over an avarage (x)HTML document are quite fast and may drive you to chose this way instead of classic DOM object methods, which have a lot of limits and are usually poor in performance with their workarounds, too.
I post a sample application in a brief function (easy to be turned into OOP), which returns an array of objects:

<?php
// test function:
function parse($html) {
// I have split the pattern in two lines not to have long lines alerts by the PHP.net form:
$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|".
"(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/sm";
preg_match_all($pattern, $html, $matches, PREG_OFFSET_CAPTURE);
$elements = array();

foreach (
$matches[0] as $key => $match) {
$elements[] = (object)array(
'node' => $match[0],
'offset' => $match[1],
'tagname' => $matches[1][$key][0],
'attributes' => isset($matches[2][$key][0]) ? $matches[2][$key][0] : '',
'omittag' => ($matches[4][$key][1] > -1), // boolean
'inner_html' => isset($matches[6][$key][0]) ? $matches[6][$key][0] : ''
);
}
return
$elements;
}

// random html nodes as example:
$html = <<<EOD
<div id="airport">
<div geo:position="1.234324,3.455546" class="index">
<!-- comment test:
<div class="index_top" />
-->
<div class="element decorator">
<ul class="lister">
<li onclick="javascript:item.showAttribute('desc');">
<h3 class="outline">
<a href="http://php.net/manual/en/regexp.reference.recursive.php" onclick="openPopup()">Link</a>
</h3>
<div class="description">Sample description</div>
</li>
</ul>
</div>
<div class="clean-line"></div>
</div>
</div>
<div id="omittag_test" rel="rootChild" />
EOD;

// application:
$elements = parse($html);

if (
count($elements) > 0) {
echo
"Elements found: <b>".count($elements)."</b><br />";

foreach (
$elements as $element) {
echo
"<p>Tpl node: <pre>".htmlentities($element->node)."</pre>
Tagname: <tt>"
.$element->tagname."</tt><br />
Attributes: <tt>"
.$element->attributes."</tt><br />
Omittag: <tt>"
.($element->omittag ? 'true' : 'false')."</tt><br />
Inner HTML: <pre>"
.htmlentities($element->inner_html)."</pre></p>";
}
}
?>

<< Back to user notes page

To Top