Skip to content

Metaphone performance improvement #10501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 5, 2023
Merged

Metaphone performance improvement #10501

merged 5 commits into from
Feb 5, 2023

Conversation

nielsdos
Copy link
Member

@nielsdos nielsdos commented Feb 4, 2023

I saw a report here: https://gitlab.alpinelinux.org/alpine/aports/-/issues/14381#note_276471 that metaphone under Alpine was way slower than under Debian. Making less calls into libc should improve performance on both Alpine and other distros.

All inputs for ENCODE() are already uppercase, so there's no need to
spend time uppercasing them again.
If it's a zero-terminator check, or an isalpha() check there's no need
to convert it to uppercase first.
@nielsdos nielsdos changed the title Metaphone Metaphone cleanup Feb 4, 2023
@devnexen
Copy link
Member

devnexen commented Feb 4, 2023

Looks fine but did you measure any improvement ?

@nielsdos
Copy link
Member Author

nielsdos commented Feb 4, 2023

I did measure an improvement over a simple test case (executing the metaphone test in php 10K times) and the amount of instructions executed dropped from 851,037,969 to 699,677,969, so a 17.7% perf increase.
The profile still clearly shows toupper to be executed many many times redundantly, so I think I can improve this further. I'll look into improving this more later today :)

We don't have to re-read letters, and re-uppercase them if we already
did it once. By caching these results we gain performance.
Furthermore, we can avoid fetching and uppercasing in some conditions by
first checking what we already had: e.g. if a condition depends on both
Prev_Letter and After_Next_Letter, but we already have Prev_Letter
cached we can place that first to avoid a fetch+toupper of the
"after next letter".
@nielsdos
Copy link
Member Author

nielsdos commented Feb 4, 2023

I pushed an additional performance improvement.
Final performance numbers: started at 851,037,969 instructions and got it down to 557,080,277, which is a 34.54% improvement in performance over master.
This is without sacrificing too much readability of the code imo.

Main ideas were to cache the read and uppercased result and in some cases reorder the condition in an if check to avoid fetching and uppercasing if the condition couldn't be fulfilled anyway.

@nielsdos nielsdos changed the title Metaphone cleanup Metaphone performance improvement Feb 4, 2023
@devnexen devnexen requested a review from Girgias February 4, 2023 15:34
Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

It may also make sense to implement Double Metaphone (released in 2000) or an old version of Metaphone 3 which exists under a BSD licence (https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/clustering/binning/Metaphone3.java) to improve the accuracy of this function.

@Girgias Girgias merged commit c9cbe52 into php:master Feb 5, 2023
@andypost
Copy link
Contributor

andypost commented Feb 5, 2023

Will it be backported to 8.2?

@Girgias
Copy link
Member

Girgias commented Feb 5, 2023

Will it be backported to 8.2?

No.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants