Skip to content

Reduce overheads of Regex.Replace with a text replacement string #85564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 30, 2023

Conversation

stephentoub
Copy link
Member

If the replacement string doesn't contain any backreferences, we can reduce the overheads involved in processing the replacement. Rather than storing a list of ReadOnlyMemory<char> segments for every portion of the original string or replacement, we can just store a list of (int offset, int count) pairs; if the offset is non-negative it refers to the original string, and if it's negative, it means to use the whole replacement. We can also avoid evaluating the rules each time, and since we're not storing string references into the arrays, we don't need to clear the arrays before returning them to the pool. This is all primarily helpful when there are lots of matches found in the input.

private static readonly Regex s_vowels = new Regex("[aeiou]", RegexOptions.Compiled);

private static readonly string s_input = new HttpClient().GetStringAsync(@"https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;

[Benchmark]
public string RemoveVowels() => s_vowels.Replace(s_input, "");
Method Toolchain Mean Error StdDev Ratio Allocated Alloc Ratio
RemoveVowels \main\corerun.exe 253.9 ms 4.08 ms 5.59 ms 1.00 22.13 MB 1.00
RemoveVowels \pr\corerun.exe 202.2 ms 1.29 ms 1.01 ms 0.79 22.13 MB 1.00

@ghost
Copy link

ghost commented Apr 30, 2023

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

If the replacement string doesn't contain any backreferences, we can reduce the overheads involved in processing the replacement. Rather than storing a list of ReadOnlyMemory<char> segments for every portion of the original string or replacement, we can just store a list of (int offset, int count) pairs; if the offset is non-negative it refers to the original string, and if it's negative, it means to use the whole replacement. We can also avoid evaluating the rules each time, and since we're not storing string references into the arrays, we don't need to clear the arrays before returning them to the pool. This is all primarily helpful when there are lots of matches found in the input.

private static readonly Regex s_vowels = new Regex("[aeiou]", RegexOptions.Compiled);

private static readonly string s_input = new HttpClient().GetStringAsync(@"https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;

[Benchmark]
public string RemoveVowels() => s_vowels.Replace(s_input, "");
Method Toolchain Mean Error StdDev Ratio Allocated Alloc Ratio
RemoveVowels \main\corerun.exe 253.9 ms 4.08 ms 5.59 ms 1.00 22.13 MB 1.00
RemoveVowels \pr\corerun.exe 202.2 ms 1.29 ms 1.01 ms 0.79 22.13 MB 1.00
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 8.0.0

Copy link
Member

@MihaZupan MihaZupan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

If the replacement string doesn't contain any backreferences, we can reduce the overheads involved in processing the replacement.  Rather than storing a list of `ReadOnlyMemory<char>` segments for every portion of the original string or replacement, we can just store a list of `(int offset, int count)` pairs; if the offset is non-negative it refers to the original string, and if it's negative, it means to use the whole replacement.  We can also avoid evaluating the rules each time, and since we're not storing string references into the arrays, we don't need to clear the arrays before returning them to the pool.
@stephentoub stephentoub merged commit 0bcab6d into dotnet:main Apr 30, 2023
@stephentoub stephentoub deleted the regexreplaceperf branch April 30, 2023 22:59
@joperezr
Copy link
Member

joperezr commented May 1, 2023

Nice! LGTM

@ghost ghost locked as resolved and limited conversation to collaborators May 31, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants