From: "jeremyevans0 (Jeremy Evans)" Date: 2022-06-20T22:11:56+00:00 Subject: [ruby-core:109025] [Ruby master Feature#18838] Avoid swallowing regexp escapes in the lexer Issue #18838 has been updated by jeremyevans0 (Jeremy Evans). Backport deleted (2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN) ruby -v deleted (3.0.3) Subject changed from Regexp#source behaves inconsistently with / to Avoid swallowing regexp escapes in the lexer Tracker changed from Bug to Feature In the `/\//` and `%r/\//` cases, the regexp source is transformed from `\/` to `/` in the lexer (`tokadd_string`) before it even hits the parser, let alone the regexp engine. From a regexp perspective, `/\//` and `%r/\//` are treated as `Regexp.new('/')`, and `%r{\/}` as `Regexp.new('\/')`. Regexp#source should provide the source of the regexp, not necessarily the source as given in the source code. The statement `escape sequences are retained as is` refers to Regexp escape sequences, and the `\` in `/\//` and `%r/\//` is not a regexp escape sequence, but a lexer escape sequence (similar to `%/\//` or `%s/\//`). This issue is not related to `/` specifically, it occurs for most terminators: `%r,\,,.source # => ","` Note that in cases where escaping would actually change regexp behavior, the lexer doesn't swallow the escape character: `%r$\$$.source # => "\\$"` It's fairly simple to remove this behavior from the lexer just by deleting code: ```diff diff --git a/parse.y b/parse.y index 167f064b31..523d5a85b3 100644 --- a/parse.y +++ b/parse.y @@ -7130,19 +7130,6 @@ tokadd_mbchar(struct parser_params *p, int c) return c; } -static inline int -simple_re_meta(int c) -{ - switch (c) { - case '$': case '*': case '+': case '.': - case '?': case '^': case '|': - case ')': case ']': case '}': case '>': - return TRUE; - default: - return FALSE; - } -} - static int parser_update_heredoc_indent(struct parser_params *p, int c) { @@ -7277,10 +7264,6 @@ tokadd_string(struct parser_params *p, } } - if (c == term && !simple_re_meta(c)) { - tokadd(p, c); - continue; - } pushback(p, c); if ((c = tokadd_escape(p, enc)) < 0) return -1; ``` However, it breaks 3 tests in `test_regexp.rb`: `test_source_unescaped`, `test_source`, and `test_equal`. It also breaks a couple of specs, `Literal Regexps supports escaping characters when used as a terminator` and `Regexp#source will remove escape characters`. Since the current behavior is clearly by design in the tests and specs, I can safely conclude this is not a bug, or at most, it is a minor documentation bug (I'll update the documentation). Switching to feature request. I'll add this the list of tickets to review at the next developer meeting, since while I'm not in favor of making the change, I do think this issue warrants discussion. ---------------------------------------- Feature #18838: Avoid swallowing regexp escapes in the lexer https://bugs.ruby-lang.org/issues/18838#change-98139 * Author: andrykonchin (Andrew Konchin) * Status: Open * Priority: Normal ---------------------------------------- According to `Regexp#source` documentation: ``` Returns the original string of the pattern. /ab+c/ix.source #=> "ab+c" Note that escape sequences are retained as is. /\x20\+/.source #=> "\\x20\\+" ``` It works well but backslash (/) is processed in different way by different regexp literal forms. Examples: ```ruby /\//.source # => "/" %r/\//.source # => "/" %r{\/}.source # => "\\/" ``` Expected result - in all the cases result is the same. Moreover as documentation states - `escape sequences are retained as is`. So I would say that only `%r{}` works properly. The issue was reported here https://github.com/oracle/truffleruby/issues/2569. -- https://bugs.ruby-lang.org/ Unsubscribe: