Make locale-dependent regex character classes work for large char codes.

Previously, we failed to recognize Unicode characters above U+7FF as being members of locale-dependent character classes such as [[:alpha:]]. (Actually, the same problem occurs for large pg_wchar values in any multibyte encoding, but UTF8 is the only case people have actually complained about.) It's impractical to get Spencer's original code to handle character classes or ranges containing many thousands of characters, because it insists on considering each member character individually at regex compile time, whether or not the character will ever be of interest at run time. To fix, choose a cutoff point MAX_SIMPLE_CHR below which we process characters individually as before, and deal with entire ranges or classes as single entities above that. We can actually make things cheaper than before for chars below the cutoff, because the color map can now be a simple linear array for those chars, rather than the multilevel tree structure Spencer designed. It's more expensive than before for chars above the cutoff, because we must do a binary search in a list of high chars and char ranges used in the regex pattern, plus call iswalpha() and friends for each locale-dependent character class used in the pattern. However, multibyte encodings are normally designed to give smaller codes to popular characters, so that we can expect that the slow path will be taken relatively infrequently. In any case, the speed penalty appears minor except when we have to apply iswalpha() etc. to high character codes at runtime --- and the previous coding gave wrong answers for those cases, so whether it was faster is moot. Tom Lane, reviewed by Heikki Linnakangas Discussion: <[email protected]>
author: Tom Lane 2016-09-05 21:06:29 +0000
committer: Tom Lane 2016-09-05 21:06:29 +0000
commit: c54159d44ceaba26ceda9fea1804f0de122a8f30 (patch)
tree: eaab96027f054cf5ff864d0745e446e8b8e13544 /src/backend/regex/README
parent: f80049f76a32858601510eaaef19ab8160e4c9b3 (diff)
1 files changed, 67 insertions, 29 deletions
diff --git a/src/backend/regex/README b/src/backend/regex/README
index 6c9f48315e3..f08aab69e37 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -27,13 +27,14 @@ and similarly additional source files rege_*.c that are #include'd in
 regexec.  This was done to avoid exposing internal symbols globally;
 all functions not meant to be part of the library API are static.
 
-(Actually the above is a lie in one respect: there is one more global
-symbol, pg_set_regex_collation in regcomp.  It is not meant to be part of
-the API, but it has to be global because both regcomp and regexec call it.
-It'd be better to get rid of that, as well as the static variables it
-sets, in favor of keeping the needed locale state in the regex structs.
-We have not done this yet for lack of a design for how to add
-application-specific state to the structs.)
+(Actually the above is a lie in one respect: there are two more global
+symbols, pg_set_regex_collation and pg_reg_getcolor in regcomp.  These are
+not meant to be part of the API, but they have to be global because both
+regcomp and regexec call them.  It'd be better to get rid of
+pg_set_regex_collation, as well as the static variables it sets, in favor of
+keeping the needed locale state in the regex structs.  We have not done this
+yet for lack of a design for how to add application-specific state to the
+structs.)
 
 What's where in src/backend/regex/:
 
@@ -274,28 +275,65 @@ colors:
   an existing color has to be subdivided.
 
 The last two of these are handled with the "struct colordesc" array and
-the "colorchain" links in NFA arc structs.  The color map proper (that
-is, the per-character lookup array) is handled as a multi-level tree,
-with each tree level indexed by one byte of a character's value.  The
-code arranges to not have more than one copy of bottom-level tree pages
-that are all-the-same-color.
-
-Unfortunately, this design does not seem terribly efficient for common
-cases such as a tree in which all Unicode letters are colored the same,
-because there aren't that many places where we get a whole page all the
-same color, except at the end of the map.  (It also strikes me that given
-PG's current restrictions on the range of Unicode values, we could use a
-3-level rather than 4-level tree; but there's not provision for that in
-regguts.h at the moment.)
-
-A bigger problem is that it just doesn't seem very reasonable to have to
-consider each Unicode letter separately at regex parse time for a regex
-such as "\w"; more than likely, a huge percentage of those codes will
-never be seen at runtime.  We need to fix things so that locale-based
-character classes are somehow processed "symbolically" without making a
-full expansion of their contents at parse time.  This would mean that we'd
-have to be ready to call iswalpha() at runtime, but if that only happens
-for high-code-value characters, it shouldn't be a big performance hit.
+the "colorchain" links in NFA arc structs.
+
+Ideally, we'd do the first two operations using a simple linear array
+storing the current color assignment for each character code.
+Unfortunately, that's not terribly workable for large charsets such as
+Unicode.  Our solution is to divide the color map into two parts.  A simple
+linear array is used for character codes up to MAX_SIMPLE_CHR, which can be
+chosen large enough to include all popular characters (so that the
+significantly-slower code paths about to be described are seldom invoked).
+Characters above that need be considered at compile time only if they
+appear explicitly in the regex pattern.  We store each such mentioned
+character or character range as an entry in the "colormaprange" array in
+the colormap.  (Overlapping ranges are split into unique subranges, so that
+each range in the finished list needs only a single color that describes
+all its characters.)  When mapping a character above MAX_SIMPLE_CHR to a
+color at runtime, we search this list of ranges explicitly.
+
+That's still not quite enough, though, because of locale-dependent
+character classes such as [[:alpha:]].  In Unicode locales these classes
+may have thousands of entries that are above MAX_SIMPLE_CHR, and we
+certainly don't want to be searching large colormaprange arrays at runtime.
+Nor do we even want to spend the time to initialize cvec structures that
+exhaustively describe all of those characters.  Our solution is to compute
+exact per-character colors at regex compile time only up to MAX_SIMPLE_CHR.
+For characters above that, we apply the <ctype.h> or <wctype.h> lookup
+functions at runtime for each locale-dependent character class used in the
+regex pattern, constructing a bitmap that describes which classes the
+runtime character belongs to.  The per-character-range data structure
+mentioned above actually holds, for each range, a separate color entry
+for each possible combination of character class properties.  That is,
+the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
+whose rows correspond to high characters or character ranges that are
+explicitly mentioned in the regex pattern, and whose columns correspond
+to sets of the locale-dependent character classes that are used in the
+regex.
+
+As an example, given the pattern '\w\u1234[\U0001D100-\U0001D1FF]'
+(and supposing that MAX_SIMPLE_CHR is less than 0x1234), we will need
+a high color map with three rows.  One row is for the single character
+U+1234 (represented as a single-element range), one is for the range
+U+1D100..U+1D1FF, and the other row represents all remaining high
+characters.  The color map has two columns, one for characters that
+satisfy iswalnum() and one for those that don't.
+
+We build this color map in parallel with scanning the regex.  Each time
+we detect a new explicit high character (or range) or a locale-dependent
+character class, we split existing entry(s) in the high color map so that
+characters we need to be able to distinguish will have distinct entries
+that can be given separate colors.  Often, though, single entries in the
+high color map will represent very large sets of characters.
+
+If there are both explicit high characters/ranges and locale-dependent
+character classes, we may have entries in the high color map array that
+have non-WHITE colors but don't actually represent any real characters.
+(For example, in a row representing a singleton range, only one of the
+columns could possibly be a live entry; it's the one matching the actual
+locale properties for that single character.)  We don't currently make
+any effort to reclaim such colors.  In principle it could be done, but
+it's not clear that it's worth the trouble.
 
 
 Detailed semantics of an NFA
author	Tom Lane	2016-09-05 21:06:29 +0000
committer	Tom Lane	2016-09-05 21:06:29 +0000
commit	c54159d44ceaba26ceda9fea1804f0de122a8f30 (patch)
tree	eaab96027f054cf5ff864d0745e446e8b8e13544 /src/backend/regex/README
parent	f80049f76a32858601510eaaef19ab8160e4c9b3 (diff)