Skip to content

Commit d3933e0

Browse files
committed
Fix regression test for GH-9535 on PHP-8.2+
Some of the legacy text encodings which were used in this regression test are deprecated in PHP-8.2+. The deprecation warnings break the expected output. Since using these encodings in mbstring is now deprecated, I think there is little point in keeping them in this test. So they are now removed from it. Further, in 219fff3, I made a change to avoid a situation where the legacy UTF7-IMAP conversion code gets stuck in a wrong state when its attempt to emit a character fails. When a Base64-encoded section of input ended with -, the previous code would FIRST emit a character if necessary (using the CK or "check" macro, which causes the function to return immediately if the downstream filter function returns an error code), and THEN update its own state to indicate that it is now in ASCII rather than Base64 mode. If the downstream filter function returned an error code, the CK macro would then cause the UTF7-IMAP filter function to return immediately WITHOUT setting its own state to indicate that the Base64-encoded section was done. I fixed this by updating the filter state as needed BEFORE calling CK... but I missed updating the filter state in the case where the Base64 section ends normally and there is no need to emit anything. Again, in 6d525a4, I modified the legacy conversion code for ISO-2022-KR to try to comply more closely with the RFC for this text encoding. The RFC states that before any occurrence of 'Shift In' or 'Shift Out' codes in a ISO-2022-KR string, a special escape sequence must appear at least ONCE, at the beginning of a line. The previous code did not comply with this requirement. I made it comply by always emitting this escape sequence at the beginning of the first line. Since mb_strcut (wrongly) determines when it has consumed enough of the input string by looking at the length of its output in bytes, this extra escape sequence makes mb_strcut consume 4 bytes less of an ISO-2022-KR string than would otherwise be the case. When this strange behavior of mb_strcut is fixed, this test will have to be adjusted to restore the previous expected outputs for ISO-2022-KR.
1 parent 6cbc911 commit d3933e0

File tree

2 files changed

+24
-39
lines changed

2 files changed

+24
-39
lines changed

ext/mbstring/libmbfl/filters/mbfilter_utf7imap.c

+3
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,9 @@ int mbfl_filt_conv_utf7imap_wchar(int c, mbfl_convert_filter *filter)
147147
* or it could be that it ended on the first half of a surrogate pair */
148148
filter->cache = filter->status = 0;
149149
CK((*filter->output_function)(MBFL_BAD_INPUT, filter->data));
150+
} else {
151+
/* Base64-encoded section properly terminated by - */
152+
filter->cache = filter->status = 0;
150153
}
151154
} else { /* illegal character */
152155
filter->cache = filter->status = 0;

ext/mbstring/tests/gh9535.phpt

+21-39
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,6 @@ mbstring
55
--FILE--
66
<?php
77
$encodings = [
8-
'BASE64',
9-
'HTML-ENTITIES',
10-
'Quoted-Printable',
118
'UTF-16',
129
'UTF-16BE',
1310
'UTF-16LE',
@@ -58,6 +55,8 @@ echo PHP_EOL;
5855

5956
$input = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA';
6057
$bytes_length = 10;
58+
// For ISO-2022-KR, the initial escape sequence 'ESC $ ) C' will occupy 4 bytes of the output;
59+
// this will make mb_strcut only pick out 6 'A' characters from the input string and not 10
6160
foreach($encodings as $encoding) {
6261
$converted_str = mb_convert_encoding($input, $encoding, mb_internal_encoding());
6362
$cut_str = mb_strcut($converted_str, 0, $bytes_length, $encoding);
@@ -69,24 +68,22 @@ echo PHP_EOL;
6968

7069
$input = '???';
7170
$bytes_length = 2;
71+
// ISO-2022-KR will be affected by the initial escape sequence as stated above
7272
foreach($encodings as $encoding) {
7373
$converted_str = mb_convert_encoding($input, $encoding, mb_internal_encoding());
7474
$cut_str = mb_strcut($converted_str, 0, $bytes_length, $encoding);
7575
$reconverted_str = mb_convert_encoding($cut_str, mb_internal_encoding(), $encoding);
76-
echo $encoding.': '.$reconverted_str.PHP_EOL;
76+
echo $encoding.trim(': '.$reconverted_str).PHP_EOL;
7777
}
7878

7979
echo PHP_EOL;
8080

8181
foreach($encodings as $encoding) {
82-
var_dump(mb_strcut($input, 0, $bytes_length, $encoding));
82+
echo $encoding.trim(': '.mb_strcut($input, 0, $bytes_length, $encoding)).PHP_EOL;
8383
}
8484

8585
?>
86-
--EXPECTF--
87-
BASE64: 宛如繁
88-
HTML-ENTITIES: 宛&#22914
89-
Quoted-Printable: %s
86+
--EXPECT--
9087
UTF-16: 宛如繁星般宛如
9188
UTF-16BE: 宛如繁星般宛如
9289
UTF-16LE: 宛如繁星般宛如
@@ -101,9 +98,6 @@ CP50220: 宛如繁星
10198
CP50221: 宛如繁星
10299
CP50222: 宛如繁星
103100

104-
BASE64: 星のように
105-
HTML-ENTITIES: 星の&#12
106-
Quoted-Printable: 星の
107101
UTF-16: 星のように月のように
108102
UTF-16BE: 星のように月のように
109103
UTF-16LE: 星のように月のように
@@ -118,9 +112,6 @@ CP50220: 星のように月の
118112
CP50221: 星のように月の
119113
CP50222: 星のように月の
120114

121-
BASE64: %s
122-
HTML-ENTITIES: あa&
123-
Quoted-Printable: あa
124115
UTF-16: あaいb
125116
UTF-16BE: あaいb
126117
UTF-16LE: あaいb
@@ -135,9 +126,6 @@ CP50220: あa
135126
CP50221: あa
136127
CP50222: あa
137128

138-
BASE64: AAAAAA
139-
HTML-ENTITIES: AAAAAAAAAA
140-
Quoted-Printable: AAAAAAAAAA
141129
UTF-16: AAAAA
142130
UTF-16BE: AAAAA
143131
UTF-16LE: AAAAA
@@ -146,15 +134,12 @@ UTF7-IMAP: AAAAAAAAAA
146134
ISO-2022-JP-MS: AAAAAAAAAA
147135
GB18030: AAAAAAAAAA
148136
HZ: AAAAAAAAAA
149-
ISO-2022-KR: AAAAAAAAAA
137+
ISO-2022-KR: AAAAAA
150138
ISO-2022-JP-MOBILE#KDDI: AAAAAAAAAA
151139
CP50220: AAAAAAAAAA
152140
CP50221: AAAAAAAAAA
153141
CP50222: AAAAAAAAAA
154142

155-
BASE64:%s
156-
HTML-ENTITIES: ??
157-
Quoted-Printable: ??
158143
UTF-16: ?
159144
UTF-16BE: ?
160145
UTF-16LE: ?
@@ -163,25 +148,22 @@ UTF7-IMAP: ??
163148
ISO-2022-JP-MS: ??
164149
GB18030: ??
165150
HZ: ??
166-
ISO-2022-KR: ??
151+
ISO-2022-KR:
167152
ISO-2022-JP-MOBILE#KDDI: ??
168153
CP50220: ??
169154
CP50221: ??
170155
CP50222: ??
171156

172-
string(0) ""
173-
string(2) "??"
174-
string(2) "??"
175-
string(2) "??"
176-
string(2) "??"
177-
string(2) "??"
178-
string(2) "??"
179-
string(2) "??"
180-
string(2) "??"
181-
string(2) "??"
182-
string(2) "??"
183-
string(2) "??"
184-
string(2) "??"
185-
string(2) "??"
186-
string(2) "??"
187-
string(2) "??"
157+
UTF-16: ??
158+
UTF-16BE: ??
159+
UTF-16LE: ??
160+
UTF-7: ??
161+
UTF7-IMAP: ??
162+
ISO-2022-JP-MS: ??
163+
GB18030: ??
164+
HZ: ??
165+
ISO-2022-KR:
166+
ISO-2022-JP-MOBILE#KDDI: ??
167+
CP50220: ??
168+
CP50221: ??
169+
CP50222: ??

0 commit comments

Comments
 (0)