Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API

From: Date: Tue, 15 Apr 2025 14:20:52 +0000
Subject: Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API
References: 1 2 3 4 5 6 7 8 9 10 11  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
Hi

Am 2025-04-13 14:10, schrieb Máté Kocsis:
     namespace Uri {
         class InvalidUriException extends \Uri\UriException
         {
         }
     }
     namespace Uri\WhatWg {
         class InvalidUrlException extends \Uri\InvalidUriException {
             /** @var list<UrlValidationError> */
             public readonly array $errors;
         }
     }
(note the use of Url in the name of the sub-exception) While this would result in a little more boilerplate, it would make static analysis tools more useful, since the $errors array could be properly typed instead of being just array<mixed>.
OK, this makes sense to me, and I've just implemented it.
Great. Don't forget to adjust the RFC text (that's the more important part :-)).
At last, when I changed the RFC so that only those characters were percent-decoded which were "URL code points", I didn't notice that the example you referred to above would go outdated: as "/" is an URL code point, it's currently percent-decoded by getPath(). Unfortunately, I still don't know what the best approach would be.
I see, thank you. I did some tests myself and read the spec. I've also checked https://github.com/whatwg/url/issues/565. Perhaps the correct solution would be to offer only the non-raw methods for WHATWG URL and to not attempt any additional percent-decoding there? My reasoning is that the WHATWG URL is a living standard anyways, so trying to add additional semantics on top will result in sadness. My understanding is also that it is primarily intended for interaction with web browsers or to embed these URLs into HTML. For access control, e.g. in your framework the RFC3986 URI should be used. It's what HTTP uses internally and it supports well-defined normalization. What do you think?
Please also give an explicit example for %3F in a path. I know that it is reserved from reading the Rfc3986, but I think it's a little unintuitive. You can adjust the last example in the component retrieval section to make it show all cases. So:
     $uri = new
Uri\Rfc3986\Uri("https:// [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux");
     echo $uri->getHost();                           //
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
     echo $uri->getRawHost();                        //
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
     echo $uri->getPath();                           // /foo/bar%3Fbaz
     echo $uri->getRawPath();                        // /foo/bar%3Fbaz
     echo $uri->getQuery();                          //
foo=bar%26baz%3Dqux
     echo $uri->getRawQuery();                       //
foo=bar%26baz%3Dqux
Why is this behavior unintuitive? I think the already added examples should
Unintuive probably is not the best word. But I expect users to primarily interact with the path component of an URL (e.g. within their framework’s router). So I think it makes sense to be extra explicit with examples there. As an example, I recently learned that Symfony's router does not support (encoded) slashes within a component:
    #[Route('/test/{message}', name: 'test')]
will work for http://localhost:8000/test/foo, but not for http://localhost:8000/test/foo%2fbar, resulting in:
    No route found for "GET http://localhost:8000/test/foo%2fbar"
So if you would just extend the: “Let's have a look at some other tricky example with Uri\Rfc3986\Uri:” to my suggestion, I would be happy :-) Note: I believe there is a small mistake in the example when you last modified it. It says:
    echo $uri->getHost();                           // [2001:0db8:0001:0000:0000:0ab9:C0a8:0102]
Should the 'C' in 'C0a8' also be lowercased?
In the “Component Modification” section, the RFC states that WhatWgUrl will automatically encode ? and # as necessary. Will the same happen for Rfc3986? Will the encoding of # also happen for the query-string component? The RFC only mentions the path component.
I think the question for RFC 3986 is answered in the PHP RFC by the following paragraph:
In order to offer consistent behavior with the parsing rules of RFC 3986, withers of Uri\Rfc3986\Uri also only accept properly formatted input,
meaning characters
that are not allowed to be present in a component must be percent-encoded. Let's see what this means in practice through the
following example
Yes, thank you for pointing that out.
Effectively, RFC 3986 has different behavior than what WHATWG does.
Understood, makes sense.
The latter question ("Will the encoding of # also happen for the query-string component?") was supposed to be answered by the RFC, because of this sentence:
WHATWG algorithm automatically percent-encodes characters that fall into
the percent-encoding
character set of the given component
It may be possible that "the given" part is misleading, but the behavior actually follows the WHATWG spec for all components. In any case, I change a few words to make this clear.
Yes, that makes sense. It's also explained in the “Percent-encoding & decoding” subsection of the “Important concepts” section, but I already forgot about that when I got down to the “Component recomposition” bit. My mistake! :-)
I haven't completely implemented withers yet for RFC 3986 (first and foremost validation is missing), so that's why you experienced this behavior. I would fix this later, but only if the vote succeeds. I've already worked a lot on the implementation without having any promise of the RFC to succeed.
Understood.
My expectation be be [2001:db8:0:0:0:0:0:1] for Rfc3986 and [2001:db8::1] for WhatWg. I have also tested the behavior of withHost() when leaving out the square brackets. The Rfc3986 correctly throws an Exception, but WhatWg silently does nothing:
     $url = 'https://example.com/foo/path';
     var_dump((new
Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString()); results in
     string(28) "https://example.com/foo/path"
This looks like this is the result of WHATWG's host setter algorithm ( https://url.spec.whatwg.org/#dom-url-hostname). After debugging the behavior, I noticed that "new Uri\WhatWg\Url('2001:db8:0:0:0:0:0:1')" only fails when trying to parse the port after the first ":" character. However, the setter algorithm obviously doesn't reach this point, since it only tries to parse the host, and then it stops (because of the state override). So I'm not sure this gotcha can be cured. I tried to reproduce the problem in Chrome, but I realized that the URL properties are not validated at all when they are set ("url.hostname = "2001:db8:0:0:0:0:0:1";" will change the hostname no problem)...
I just tested it with node.js:
      href: 'https://example.com/foo/path',
      origin: 'https://example.com',
      protocol: 'https:',
      username: '',
      password: '',
      host: 'example.com',
      hostname: 'example.com',
      port: '',
      pathname: '/foo/path',
      search: '',
      searchParams: URLSearchParams {},
      hash: ''
    }
u.hostname = '2001:db8:0:0:0:0:0:1'
    '2001:db8:0:0:0:0:0:1'
u
    URL {
      href: 'https://example.com/foo/path',
      origin: 'https://example.com',
      protocol: 'https:',
      username: '',
      password: '',
      host: 'example.com',
      hostname: 'example.com',
      port: '',
      pathname: '/foo/path',
      search: '',
      searchParams: URLSearchParams {},
      hash: ''
    }
u.toString()
    'https://example.com/foo/path'
u.hostname = '[2001:db8:0:0:0:0:0:1]'
    '[2001:db8:0:0:0:0:0:1]'
u
    URL {
      href: 'https://[2001:db8::1]/foo/path',
      origin: 'https://[2001:db8::1]',
      protocol: 'https:',
      username: '',
      password: '',
      host: '[2001:db8::1]',
      hostname: '[2001:db8::1]',
      port: '',
      pathname: '/foo/path',
      search: '',
      searchParams: URLSearchParams {},
      hash: ''
    }
u.toString()
    'https://[2001:db8::1]/foo/path'
So it indeed seems to be a limitation of the WHATWG specification and your PHP implementation is consistent with node.js. That is a good thing and when a user stumbles upon this, we can point them towards node.js / the spec. Not great, but this is workable! Best regards Tim Düsterhus

Thread (144 messages)

« previous php.internals (#127114) next »