Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API

From: Date: Sun, 30 Mar 2025 12:42:33 +0000
Subject: Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API
References: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
Hi

Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera:
Hi Máté, for RFC 3986: https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then this string is parsed and validated. Unfortunately, I recently realized that this approach may leave room for some kind of parsing confusion attack, namely when the scheme is for example "https", the authority is empty, and the path is "example.com <http://example.com>". This will result in a https://example.com URI. I believe a similar bug is not possible with the rest of the components because they have their delimiters. So possibly some other solution will be needed, or maybe adding some additional validation (?). This is not correct according to RFC3986 https://datatracker.ietf.org/doc/html/rfc3986#section-3 *When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). * So in your example it should throw an Uri\InvalidUriException 🙂 for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes. This is also one of the many reasons why at least for RFC3986 the path component can never be null but that's another discussion. Like I said having a fromComponenta named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements.
I just tested this with the implementation and it also appears to not yet be correct:
    var_dump((new Uri\Rfc3986\Uri("example.com"))->getHost()); // NULL
    var_dump((new Uri\Rfc3986\Uri("example.com"))->withScheme('https')->getHost()); // string(11) "example.com"
    var_dump((new Uri\Rfc3986\Uri("example.com"))->withScheme('https')->toRawString()); // string(19) "https://example.com"
and
    var_dump((new Uri\Rfc3986\Uri("foo/bar"))->withPath('//foo/bar')->getHost()); // string(3) "foo"
Best regards Tim Düsterhus

Thread (152 messages)

« previous php.internals (#126979) next »