Skip to content

#53655: Improve speed of DOMNode::C14N() on large XML documents #12278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

nielsdos
Copy link
Member

@nielsdos nielsdos commented Sep 22, 2023

https://bugs.php.net/bug.php?id=53655

The XPath query is in accordance to spec [1]. However, we can do it in a
simpler way. We can use a custom callback function instead of a linear
search in XPath to check if a node is visible. Note that comment nodes
are handled internally by libxml2 already, so we do not need to
differentiate between node types. The callback will do an upwards
traversal of the tree until the root of the canonicalization is reached.
In practice this will speed up the application a lot.

[1] https://www.w3.org/TR/2001/REC-xml-c14n-20010315 section 2.1

This can make processing easily 100 times faster for a large document. I generated some random XML documents with https://codebeautify.org/generate-random-xml: https://gist.github.com/nielsdos/369813d1b1c5c146a6fd7992b8ddbc28

file.xml: before -> after:
random.xml: 0.159s -> 0.004s
large.xml: 1.256s -> 0.008s

There's another speed-up I could do by replacing the linear search with a search in a HashTable, that's orthogonal to this but also a smaller time save. That's important for the cases that do use a nodeset. something to do as a follow-up probably.

…ents

The XPath query is in accordance to spec [1]. However, we can do it in a
simpler way. We can use a custom callback function instead of a linear
search in XPath to check if a node is visible. Note that comment nodes
are handled internally by libxml2 already, so we do not need to
differentiate between node types. The callback will do an upwards
traversal of the tree until the root of the canonicalization is reached.
In practice this will speed up the application a lot.

[1] https://www.w3.org/TR/2001/REC-xml-c14n-20010315 section 2.1
Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible

@nielsdos nielsdos closed this in 5d68d61 Sep 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants