David Mitchell [Mon, 28 Apr 2025 12:25:09 +0000 (13:25 +0100)]
Deparse: fix 3-arg susbtr_left deparsing
v5.41.7-43-gcdbed2a40e introduced the OP_SUBSTR_LEFT op,
which is an optimised version of OP_SUBSTR for when the offset
is zero and the replacement string is missing or ''.
Unfortunately the deparsing for this OP missed out the closing
parenthesis when the replacement string wasn't present; i.e.:
Karl Williamson [Tue, 22 Apr 2025 17:00:27 +0000 (11:00 -0600)]
locale.c: Add debug for MB_CUR_MAX
The latest MacOS release has locales that set this wrongly. This isn't
the first time this has been a problem. Add a debugging line that can
easily be enabled to check for problems that may arise in the field.
Karl Williamson [Thu, 30 Jan 2025 11:26:23 +0000 (04:26 -0700)]
sv.h: SvPVx now equivalent to SvPV, et.al
These macros suffixed with 'x' are guaranteed to evaluate their
arguments just once. Prior to this commit, they used PL_Sv to
accomplish that. But since 1ef9039bccb in 5.37, the macros they call
only evaluate their arguments once, so the PL_Sv is superfluous.
Tony Cook [Mon, 9 Dec 2024 05:24:19 +0000 (16:24 +1100)]
cygwin: give cygperl*.dll an explicit base address
Cygwin's fork emulation doesn't handle overlapping addresses
between different DLLs, since it tries to lay out the address space
of the child process to match the parent process, but if there's an
address conflict between DLLs, Windows may load those DLLs at
different addresses.
To avoid having to manually assign addresses to each DLL, since
around 5.10 we've used --enable-auto-image-base to assign load
addresses for cygperl*.dll and dynamic extension DLLs and this
has mostly worked well, but as perl has gotten larger and
cygperl*.dll has grown, we've had two cases where there's overlap
between the address space for cygperl*.dll and some extension
DLL, see #22695 and #22104.
This problem occurs because:
- cygperl*.dll is large, and with -DDEBUGGING or some other option
that increases binary size, even large, occupying more than one of
the "slots" that the automatic image base code in ld can assign
the DLL to.
- unlike the extension DLLs, the name of cygperl*.dll changes with
every release, so we roll the dice each release on whether there
will be a conflict between cygperl*.dll and some other DLL.
Previously I've added an entry to perldelta and updated the CI
workflow to workaround the conflict, this change should prevent
that particular conflict.
The addresses I've chosen here are "just" (for large values of
"just") below the base address range used by automatic address
space selection.
For 64-bit this was done by inspection, examing the output of
"rebase -i" on the extension DLLs and looking at the source of ld,
in particular:
since I don't have a 32-bit cygwin install any more, since cygwin
no longer ship it and it commonly had the fork address conflicts
discussed above.
I would have liked to make the load address configurable via
-Dcygperl_base or similar, but I didn't see a way to get the
base address to pass from Configure through to Makefile.SH.
Karl Williamson [Tue, 22 Apr 2025 14:24:36 +0000 (08:24 -0600)]
locale.c: Don't do asymmetric back out on failure
This fixes #23519
When something goes wrong doing locale-aware string collation, the code
attempts to carry on as well as can be expected. Prior to this commit
the backout code was asymmetric, trying to undo things that had not been
done. This happened when the failure was early on.
In the case of this ticket, the platform has a defective locale that was
detectable before getting very far along.
The solution adopted here is to jump to a different label for those
early failures that does less backout than for later failures.
This entry was missed because I generated the MCL entries before the unicode
patch was actually applied. Therefore, the cpan release of Module::CoreList
5.20250420 will be incorrect, but all subsequent releases (and blead itself,
from 5.41.12 or 5.42.0 onward) will be correct.
Karl Williamson [Sat, 19 Apr 2025 02:38:36 +0000 (20:38 -0600)]
mk_invlists: Restore generating EBCDIC
This had been turned off in this branch to speed up compilatian, and
hence development. The code mostly changed in this branch is the same
as in ASCII anyway. It could have become an issue only if someone tries
to bisect on an EBCDIC machine, which I don't believe has happened, if
ever, in decades.
This is includes updates to a few perl files that need to know the
current Unicode version, and regenerating perl files that depend on the
Unicode data
Karl Williamson [Thu, 17 Apr 2025 16:12:12 +0000 (10:12 -0600)]
mk_invlists: Include cells in calculating column widths
This program generates tables for the Break properties that are somewhat
human readable. Before this commit, just the heading line for a column
determined its width. This commit factors in the maximum width of any
cell in the column as well. It used to be that this required a separate
pass, and so wasn't done. But now that separate pass is required anyway
for other reasons, and it is simple to add to it this check.
Karl Williamson [Thu, 17 Apr 2025 22:12:30 +0000 (16:12 -0600)]
mk_invlists: Restore calculation of new keywords, etc
Now we are ready to use a new Unicode version, we have to regenerate
everything. This was turned off earlier in this branch temporarily
until now so as to speed up the testing, as it was known these values
wouldn't change until now.
Karl Williamson [Sun, 20 Apr 2025 11:09:13 +0000 (05:09 -0600)]
UCD.t: Skip test which fails on 32 bit words
In Unicode 15.1, the ideograph U+4EAC now has a numeric value, and that
value is 10 quadrillion (1e+16). This is the first instance in Unicode
of an integer not fitting in a 32 bit word, as this requires 49 bits.
One of the tests in UCD.t requires round-trip equality in converting
from string to number and back; skip it for this case and any future
similar ones.
I find it interesting that U+4EAC is listed as having the meaning
"capital city".
Karl Williamson [Wed, 16 Apr 2025 22:00:03 +0000 (16:00 -0600)]
mk_invlists: Remove hard-coded numbers
A couple of commits ago, the last necessarily-hard-coded DFA enum
besides 0 and 1 was removed. This allows for all the rest to be
assigned by using the value of an incrementing variable.
This makes it easy to add DFAs in the middle of existing ones, as will
happen as future Unicode releases come our way.
Karl Williamson [Wed, 16 Apr 2025 04:52:31 +0000 (22:52 -0600)]
mk_invlists: Look for a DFA optimization possibility
If both branches of an else lead to the same result, skip the else and
set the result unconditionally. That's what this commit does for DFAs
that get the same value if they succeed as when they don't.
There is one current case where the DFA can return an anomalous result,
so it can't be optimized out. Add a field to the hash entry defining
that entry, so it doesn't get optimized.
Karl Williamson [Wed, 16 Apr 2025 21:47:20 +0000 (15:47 -0600)]
mk_invlists: Remove a temporary work-around
This code was due to a few commits ago having reversed the ordering the
Unicode rules are applied in. After updating to use a generalized DFA
scheme, it is no longer needed
Karl Williamson [Fri, 18 Apr 2025 09:21:02 +0000 (03:21 -0600)]
mk_invlists: Use new DFA scheme for horizontal white space
Perl doesn't follow the Unicode standard with regard to its treatment of
white space, in particular sequences of horizontal white space. Unicode
allows "tailoring" of its rules for local situations, and Perl
traditionally with \B has treated all sequences of white space as a
single unit. Unicode originally considered each space in a sequence of
them as a separate unit. A perl program would want them all a single
unit. Unicode eventually came round to our way of thinking, but not
entirely, as comments unaffected by this commit indicate.
The DFA for this situation does not fit in with the new stackable DFA
scheme, and woul start failing tests a few commits later as the shim
code is removed. Convert to the new scheme, which allows us to call the
functions that affect a single cell twice with effect. The order is
immaterial, but one call installs a default behavior, and the other a DFA
that ends up being executed first to override that behavior in certain
(rare) situations.
Karl Williamson [Wed, 16 Apr 2025 15:28:19 +0000 (09:28 -0600)]
mk_invlists: Generalize to stack DFAs for break properties
The Unicode breaking algorithms are supposed to be implemented by
executing DFAs in priority order, stopping at the first one that
succeeds. (In many cases a DFA isn't needed, and we can unconditionally
say that there is or isn't a break at a given position simply by looking
at the characters on either side of it.)
But it was a significant amount of work to get from where perl started
to be able to do that. And it hasn't been necessary until now. In most
cases, a single DFA suffices, and where not, a more complicated single
DFA took care of the stacking.
But this has become untenable in Unicode 15.1, so I ended up doing the
work to implement their algorithm. The result is more, but simpler
DFAs, and it becomes easier to add new ones, as they don't have to
interact with other ones. The stacking does that for them.
This commit implements a separate DFA table beyond the x,y lookup table.
If the decision that this is a breakable position requires a DFA, the
x,y contents are an index into this separate table, which contains the
DFA to follow. The first element gives the case statement number to use
to execute the DFA. The second element gives the value to return if the
DFA succeeds. If it fails, the code add +2 to get the next thing to
try.
Karl Williamson [Tue, 15 Apr 2025 13:39:28 +0000 (07:39 -0600)]
mk_invlists: Use new mktables enhancements
Now mk_invlists no longer has to know what the details are of properties
that have been split into more, smaller equivalence classes. mktables
handles that and provides the information in new hashes.
Karl Williamson [Mon, 7 Apr 2025 18:35:35 +0000 (12:35 -0600)]
mktables: Consolidate code into a single function
Some properties in Unicode essentially form equivalence classes for all
possible code points.
For example, Unicode publishes the Line Break (LB) property, where each
possible code point is given a type, like Alphabetic, or Opening
Parenthesis. All code points that act as alphabetics have the AL
equivalence class. All that act like Opening Parentheses have the OP
class.
Unicode also publishes rules as to if it is permissible to break between
code point of any types. For the Line Break property, you wouldn't
break a line between two alphabetics or between an opening parenthesis
and an alphabetic, but you could between a Space and almost any other
type or between a closing parenthesis and many types.
Perl uses these properties to implement the \b{lb} etc regular
expression constructs. It uses a two-dimensional array where the value
in the cell [x,y] tells whether a break is permissible between
characters of type x and characters of type y. (Some cases can't be
done with this simple lookup, but knowing the surrounding context is
necessary to make a decision. Those are implemented as DFAs in
regexec.c.)
Unicode used to publish such an array for the Line Break property, and
still publishes some non-normative .html files that contain similar
information. But to really know what to do, one has to read documents
UAX#14 and UAX#29 that contain textual descriptions of the rules. These
change each new release, and are the major pain in upgrading to a new
release.
In recent releases, Unicode has mostly stopped creating new equivalence
classes as it has refined the rules for the boundary conditions For
example, the line boundary conditions are very different for East Asian
(EA) characters than the Western scripts. Effectively there are thus
two sets of rules. But instead of creating new equivalence classes that
reflect this reality, Unicode has chosen to just document it in those
two UAX documents. I don't know the motivation for this.
But perl wants that table to divvy up all the possible boundary
conditions, so it can continue to use the array to make most of the
decisions, so mktables splits the equivalence classes that Unicode
provides into new ones that reflect what the UAXes say. At first, I
thought this was a one-off matter, so wrote a few lines to handle a
special case; then when the next release came out, added a few more for
another one, etc. But Unicode 15.1 and 16.0 continue the trend, so it's
become an effort.
This commit consolidates the previous one-off code snippets into one
generalized function. It should be able to handle future instances
without having to craft something new each time.
It also creates a new data structure that mk_invlists.pl can look at so
that it doesn't have to repeat the logic found here, as it currently
does.
Karl Williamson [Fri, 18 Apr 2025 02:43:32 +0000 (20:43 -0600)]
regexec.c: Skip CM and ZWJ in look behind in LB parsing
The Unicode standard says that these two characters are to be ignored
for the purposes of determining if there is a Line Break just before
certain characters. That is, you have to back up in the parse string
past all adjacent ones of these, and then examine it.
This applies to any lower priority rule than LB9. This commit fixes two
cases that didn't do that.
Karl Williamson [Fri, 18 Apr 2025 02:39:08 +0000 (20:39 -0600)]
regexec.c: Change static function API
Sometimes this functionality is needed to also skip over certain
intervening classes of characters while backing up in the parse string.
This commit creates two macros to call the modified underlying function
with a boolean flag. This names of the macros make it easy to know
what's happening.
Karl Williamson [Tue, 15 Apr 2025 15:40:10 +0000 (09:40 -0600)]
regexec.c: Change static function API
This makes it clearer to use. Instead of having a boolean flag to
change the behavior, there are now two macros that call the underlying
function, and their names reflect the desired behavior
Karl Williamson [Tue, 15 Apr 2025 13:09:51 +0000 (07:09 -0600)]
mk_invlists: Remove some special cases
These were added to compensate for reversing the order of handling the
break property rules. This commit hides the need for that in one place
per table, except for a second place for Line Break.
The only changes to the tables occur in the garbage row and column which
aren't actually accessed, so those changes are harmless.
It is a temporary commit. A few commits from now, this will be removed.
Karl Williamson [Sat, 12 Apr 2025 23:24:50 +0000 (17:24 -0600)]
mk_invlists: Remove obsolete function
This function was used when the previous scheme of applying the rules in
reverse order needed to be overridden in a few cases by prohibiting
changes to existing seemingly lower priority values. Now there's
no lower priority value in the cell that we would need to preserve.
Karl Williamson [Mon, 14 Apr 2025 10:43:08 +0000 (04:43 -0600)]
mk_invlists: Reverse order of break property rules
Before this commit, the rules for populating the tables for break
properties were laid out in reverse order, so that the lowest priority
rule was executed first. It filled a cell, which then would be
overwritten by any higher priority rule that applied to it.
This reverse order made it harder to compare the rules with the text of
the Unicode rules these are trying to implement.
This commit changes things to have the rules in the same order as
Unicode lists them.
The previous scheme had certain advantages that this has to make up for
by using temporary code to override what would otherwise have gone into
the cells. This code will no longer be needed in a few commits when a
general purpose stacking DFA scheme is implemented.
As a result, of this temporary code, only two cells in one property
change as a result of this complete reversal. They change to using a
DFA which ends up returning the same results as the original
unconditional value.
Karl Williamson [Fri, 28 Mar 2025 09:13:20 +0000 (03:13 -0600)]
mk_invlists/regexec.c: Generate and use macros
With this commit, mk_invlists.pl now generates #define macros isFOO that
regexec.c now uses to determine if a character is in a particular line
breaking class. Previously, x == foo was used. This change insulates
the code from having to worry about when classes get changed to be
combinations.
Karl Williamson [Thu, 27 Mar 2025 13:17:58 +0000 (07:17 -0600)]
mk_invlists: Improve DFA names
This commit now imposes more structure on the names.
The names are sort of pseudo code that lays out what the DFA is to do.
The most significant change is to standardize what has been done in
recent commits with newly added DFAs. And that is to use the string
'_v_' in the name where the tip of the 'v' points to where position in
the input string being processed where this rule applies to.
Karl Williamson [Sat, 12 Apr 2025 23:32:46 +0000 (17:32 -0600)]
mk_invlists: Use 'for' statement modifier
This significantly cuts down on the verbiage, and makes the rules in
this file more closely match the text from which they are derived in UAX
14 and UAX 29
Karl Williamson [Thu, 27 Mar 2025 00:37:45 +0000 (18:37 -0600)]
mk_invlists: Use abbreviations for Line Break
Unicode UAX #14 gives rules for the Line Break property using the short
names for them. Prior to this commit, we mostly used the full names for
the classes in this property. This commit changes to use the short
names. This makes it easier to compare the code here with the UAX text.
The abbreviations aren't always straight forward, so it was easy to go
astray.
Karl Williamson [Wed, 9 Apr 2025 10:18:57 +0000 (04:18 -0600)]
mk_invlists: Use new set subtraction ability
This allows the removal of some combinatorial complexity, thus showing
a bug in which the combination of PO to EOP had not been added when it
should have been.
Currently, mktables splits the Line Break OP and CP classes into East
Asian ones, and the remainders. The extra combinations occurred because
the code here needed to take every existing OP and add an East_Asian
(EA_OP) equivalent; same with CP. It's easy to miss one, and I did.
This commit allows this split to be hidden from most places in
mk_invlists.
Karl Williamson [Sun, 30 Mar 2025 21:52:23 +0000 (15:52 -0600)]
mk_invlists: Use new split capability with AHLetter
The description in UAX #29 of Unicode's Word Break property uses two
convenience macros to simplify some of their rules.
The split capability introduced several commits ago, allows this program
to follow along, making the rules here more closely aligned to the text
in UAX 29, hence simpler.
This commit creates one macro, AHLetter; the next commit does the other
macro.
The name of the DFAs involving this name are changed to correspond.
Karl Williamson [Sat, 12 Apr 2025 19:39:41 +0000 (13:39 -0600)]
mk_invlists: Use new split capability with ExtPict
\p{Extended_Pictographic} is not fully implemented yet because unlike
other properties, it can match a string instead of a single character.
And it is kind of a kludge here The 14.0 release was analyzed by me and
the rules here were customized based on that analysis. For example, in
the Line Break property, a clause was added by Unicode to Rule LB30b
that required taking the intersection of this property and all the
Unassigned code points. It turns out that everything in that
intersection had the Line Break class of Ideographic, so I modified
mktables to split the Ideographic class into two components, the
elements of the intersection went into the long-named
"Unassigned_Extended_Pictographic_Ideographic" and plain Ideographic was
left with the remainder. To match all of Ideographic you have to
specify both classes. By using the new split capability, this can be
done effectively as a macro expansion, and the special cases can be
removed from the code. This commit does this.
Similarly, both the Word Break and Grapheme Cluster Break properties
have somewhat different interactions with Extended_Pictographic that
this commit smooths over.
This situation is brittle. A new release of Unicode might change things
so that Ideographic isn't the only LB class in the intersection
mentioned above, so the customization has to be checked in every
release. A few commits later in this branch, this will be automated,
and no longer a concern.
Karl Williamson [Fri, 18 Apr 2025 01:05:43 +0000 (19:05 -0600)]
mk_invlists: Use new split capability with ALetter
ALetter also contains the class ExtPict_LE. Prior to this commit, there
had to be a rule for each ALetter doing the same thing with ExtPict_LE.
But the new splits capability allows ALetter to expand automatically to
both.
This uncovers a bug. There should have been a rule
WB5 ALetter x ExtPict_LE
which was missing.
Karl Williamson [Wed, 26 Mar 2025 11:34:45 +0000 (05:34 -0600)]
mk_invlists: Add effectively macro expansions
Unicode's Word Break rules have shortcut names that really mean multiple
ones. For example, AHLetter means either ALetter or Hebrew_Letter.
This commit allows "macros" to be defined like this so that the
statements in this file more closely resemble those of the Unicode text.
More importantly, Unicode's rules in recent times need subdivided
equivalence classes, such as Alphabetics that are also East Asian. What
has been done so far is when that happened, extra rules were added that
were all possible combinations of these subdivisions. It is easy to
miss a combination; and it turns out there are bugs. This new
capability allows us to say that an Alphabetic (ALetter) is a
combination of plain ALetters plus East Asian letters, and the code
generates all the combinations automatically. This makes the text
cleaner and safer.
Karl Williamson [Sun, 23 Mar 2025 12:37:23 +0000 (06:37 -0600)]
mk_invlists: Add ability to specify a complement of list
And use it in one instance.
Previous commits have added the ability to pass multiple items simply to
the functions that work on rows and columns. This now gives the ability
to complement the set of the multiple items passed.
Karl Williamson [Sun, 23 Mar 2025 12:19:32 +0000 (06:19 -0600)]
mk_invlists: Add ability to specify an entire row simply
Instead of having to loop through all the cells of a row or column, this
commit uses '*' to represent the whole thing. This is more in keeping
with the text of the Unicode rules which just leaves thing blank if it
means everything;
Karl Williamson [Wed, 9 Apr 2025 12:47:54 +0000 (06:47 -0600)]
mk_invlists: Set values in unused table cells to 0
These cells exist so that code is less likely to need to be changed when
a new Unicode release comes along. Currently it doesn't matter at all
what is in those cells, because they are never read. But future commits
will want to make sure they don't refer to dfas that are obsolete and
whose references to could be undefined symbols that would abort the
compilation.
The choice of 0 or 1 to put in the cells was arbitrary; I know of no
reason to prefer one or the other
Karl Williamson [Sun, 23 Mar 2025 10:29:39 +0000 (04:29 -0600)]
mk_invlists: Set and get break table values with functions
Previously, we would just set an individual element directly. This
changes most of those to use function calls instead. This has two main
benefits. The function can change what's being done without having to
change many lines; and these sets had a lot of visual noise with sigils
and hash references. The result is a lot easier to read.
The next few commits will continue this process.
Note that the generated tables are unchanged by this commit. It has no
effect on runtime processing. That will be true of the next commits as
well.
It became obvious in doing this that the rule for Perl_Tailored_HSpace
does not belong in the 3's, but comes immediately before that.
Arbitrarily use '2z'