Skip to content

The behavior of mb_strcut in mbstring has been changed in PHP8.1 #9535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pakutoma opened this issue Sep 14, 2022 · 5 comments
Closed

The behavior of mb_strcut in mbstring has been changed in PHP8.1 #9535

pakutoma opened this issue Sep 14, 2022 · 5 comments

Comments

@pakutoma
Copy link
Contributor

Description

When using the mb_strcut function to cut out ISO-2022-JP encoded strings, depending on the number of bytes specified, the behavior is different in PHP 8.1 than before.

The following code:

<?php
$input = 'あaいb';
$bytes_length = 10;
$encoding = "ISO-2022-JP";
$converted_str = mb_convert_encoding($input, $encoding, mb_internal_encoding());
$cut_str = mb_strcut($converted_str, 0, $bytes_length, $encoding);
$reconverted_str = mb_convert_encoding($cut_str, mb_internal_encoding(), $encoding);
var_dump($reconverted_str);

Resulted in this output:

string(5) "あa?"

But I expected this output instead:

string(4) "あa"

This behavior has changed since PHP 8.1.
3v4l: https://3v4l.org/FaVWR

I believe the behavior up to PHP 8.0 is correct.
This is because PHP 8.1 and later output contains a "?", which is not supposed to be there.

PHP Version

PHP 8.1.10

Operating System

No response

@cmb69
Copy link
Member

cmb69 commented Sep 14, 2022

Although the behavior is not documented exactly, this looks like a bug. We likely want to truncate partial characters (code points) at the end. This apparently works nicely for UTF-8, see e.g. https://3v4l.org/oZgCU.

NathanFreeman added a commit to NathanFreeman/php-src that referenced this issue Sep 16, 2022
NathanFreeman added a commit to NathanFreeman/php-src that referenced this issue Sep 19, 2022
@youkidearitai
Copy link
Contributor

youkidearitai commented Sep 24, 2022

Hi, I have a question. mb_strcut is unaffected of mb_substitute_character that manual written by below.

This setting affects mb_convert_encoding(), mb_convert_variables(), mb_output_handler(), and mb_send_mail().

However, there are other function affected by mb_substitute_character (e.g.: mb_convert_case https://3v4l.org/J7kVC )

For this issue, the 10th byte of 'あaいb' is 0x1b, meaning the escape sequence, then converted ?. (3v4l: https://3v4l.org/Intbd )
I'm curious why mb_substitute_character doesn't work.

If this other subject, then I open new feature request.

@NathanFreeman
Copy link
Contributor

@youkidearitai Hi.
I think then converted ? is a valid character now so mb_substitute_character doesn't work.

@youkidearitai
Copy link
Contributor

@NathanFreeman Thank you for reply.
My question was "Why mb_strcut is doesn't work of mb_substitute_character". But I understand that mb_substitute_character originally (since PHP 4.3.0 https://3v4l.org/s0gbn#v430) doesn't work.

Although the question remains why mb_convert_case is affected by mb_substitute_character which isn't in the manual(3v4l: https://3v4l.org/2lqgg#v430), this is another issue, thank you for your time.

@alexdowad
Copy link
Contributor

@youkidearitai Can you read C code? I would be happy to point you to the lines in the source code which would answer your question. Or do you just want a high-level answer which does not require delving into the details of the implementation?

@pakutoma Thanks for reporting this issue. It is nice to see that @NathanFreeman is working on a fix. I will try to guide him so he can complete this successfully.

There are some very tricky issues involved with the implementation of mb_strcut, but I think we can work them out.

Just for information, the unexpected change in behavior which you have kindly reported was (indirectly) caused by 4299e2d.

NathanFreeman added a commit to NathanFreeman/php-src that referenced this issue Oct 10, 2022
find the position of illegal subcharacters and modify the value of
device.pos

add ((encoder->status && ((encoder->status & 0xF) || (encoder->status == 0x11))) || encoder->cache) condition.

fix test

fix test

delay initialization parameter illegal_substchar

optimize code
alexdowad added a commit that referenced this issue Nov 13, 2022
* PHP-8.1:
  [ci skip] NEWS
  Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1)
alexdowad added a commit that referenced this issue Nov 13, 2022
* PHP-8.2:
  [ci skip] NEWS
  Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1)
alexdowad added a commit that referenced this issue Nov 14, 2022
Some of the legacy text encodings which were used in this regression
test are deprecated in PHP-8.2+. The deprecation warnings break the
expected output. Since using these encodings in mbstring is now
deprecated, I think there is little point in keeping them in this test.
So they are now removed from it.

Further, in 219fff3, I made a change to avoid a situation where the
legacy UTF7-IMAP conversion code gets stuck in a wrong state when its
attempt to emit a character fails. When a Base64-encoded section of
input ended with -, the previous code would FIRST emit a character if
necessary (using the CK or "check" macro, which causes the function to
return immediately if the downstream filter function returns an error
code), and THEN update its own state to indicate that it is now in
ASCII rather than Base64 mode.

If the downstream filter function returned an error code, the CK macro
would then cause the UTF7-IMAP filter function to return immediately
WITHOUT setting its own state to indicate that the Base64-encoded
section was done.

I fixed this by updating the filter state as needed BEFORE calling CK...
but I missed updating the filter state in the case where the Base64
section ends normally and there is no need to emit anything.

Again, in 6d525a4, I modified the legacy conversion code for
ISO-2022-KR to try to comply more closely with the RFC for this
text encoding. The RFC states that before any occurrence of 'Shift In'
or 'Shift Out' codes in a ISO-2022-KR string, a special escape
sequence must appear at least ONCE, at the beginning of a line.
The previous code did not comply with this requirement. I made it
comply by always emitting this escape sequence at the beginning of
the first line.

Since mb_strcut (wrongly) determines when it has consumed enough of
the input string by looking at the length of its output in bytes, this
extra escape sequence makes mb_strcut consume 4 bytes less of an
ISO-2022-KR string than would otherwise be the case. When this
strange behavior of mb_strcut is fixed, this test will have to be
adjusted to restore the previous expected outputs for ISO-2022-KR.
alexdowad added a commit that referenced this issue Nov 14, 2022
* PHP-8.2:
  Fix regression test for GH-9535 on PHP-8.2+
TimWolla added a commit to TimWolla/php-src that referenced this issue Nov 15, 2022
* master: (274 commits)
  Cache UTF-8-validity status of strings in GC flags
  Escape the role attribute of namespaced classes (php#9952)
  Fix phpGH-9932: Discards further characters for session name.
  Fix phpGH-9890: OpenSSL legacy providers not available on Windows
  Fix regression test for phpGH-9535 on PHP-8.2+
  Fix memory leak
  Introduce TEST_FPM_EXTENSION_DIR for FPM tests with shared extensions
  [ci skip] NEWS
  Fix phpGH-9535 (unintended behavior change for mb_strcut in PHP 8.1)
  [ci skip] NEWS
  [ci skip] NEWS
  Fix phpGH-9298: remove all registered signal handlers in pcntl RSHUTDOWN
  Fix phpGH-9923: Add the `SIGINFO` constant in pcntl for system supporting it.
  Skip tests if extension or SAPI is not included. (php#9939)
  Remove unused PHP 8.1 BC layer in JIT (php#9937)
  [skip ci] Skip function JIT in nightly for ASAN
  [skip ci] Backport XFAIL of failing test
  Disable opcache file_cache for observer preloading test
  No more need to cater to mime_magic extension
  [ci skip] Fix phpGH-9918: License information for xxHash is not included in README.REDIST.BINS file
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
@youkidearitai @alexdowad @cmb69 @pakutoma @NathanFreeman and others