We need a better URL parser in Scrapy, for similar reasons. Speed and WHATWG standard compliance (i.e. do the same as web browsers) are the main things.
It's possible to get closer to WHATWG behavior by using urllib and some hacks. This is what https://github.com/scrapy/w3lib does, which Scrapy currently uses. But it's still not quite compliant.
Also, surprisingly, on some crawls URL parsing can take CPU amounts similar to HTML parsing.
can_ada dev here. Scrapy is a fantastic project, we used it extensively at 360pi (now Numerator), making trillions of requests. Let me know if I can help :)
I wasn't aware of this, I'd seen those URLs before but only in the context of Chinese ones and thought it was Chinese-specific.
It's interesting because I just went down an apparent rabbit hole inplementing Byte-level encoding for using language models with unicode. There each byte in a unicode character is mapped to a printable character that goes up to 255 < ord(x) < 511 (I don't remember the highest but the point is each byte is mapped to another printable unicode character.
To expand on the sibling comments: This encoding (called Punycode) works by combining the character to encode (é) and the position the character should be in (the 7th position out of a possible 7) into a single number. é is 233, there are 7 possible positions, and it is in position 6 (0-indexed) so that single number is 233 * 7 + 6 = 1637. This is then encoded via a fairly complex variable-length encoding scheme into the letters "fsa".
I was wondering, can that clash with a "normal" domain registered as "xn--....."? Apparently there is another specific rule in RFC 5891 saying "The Unicode string MUST NOT contain "--" (two consecutive hyphens) in the third and fourth character positions" [0]
Also, if I was forced to represent Unicode as ASCII, punycode encoding is not the obvious one - it's pretty confusing. But, I don't know much about how and why it was chosen, so I assume there's good reason.
IDNs and Punycode were basically bolted-on extensions to DNS that were added after DNS was already widely deployed. Because there was no "proper" extension mechanism available, it was a design requirement that they can be implemented "on top" of the standard DNS without having to change any of the underlying components. So I think most of the DNS infrastructure can be (and is) still completely unaware that IDNs and Punycode exists.
Actually, I wonder what happens if you take a "normal" (i.e. non-IDN, ascii-only) domain and encode it as Punycode. Should the encoded and non-encoded domains be considered identical or separate? (for purposes of DNS resolutions, origin separation, etc)
Identical would be more intuitive and would match the behavior of domain names with non-ascii characters - on the other hand, this would require reworking of ALL non-punycode-aware DNS software, which I'm doubtful is possible.
IDNA 2003 is a particular annoyance of mine: The IDNA 2003 algorithm didn't encode the german 'ß' character, or rather 'wrongly', through overeager use of Unicode normalisation in the nameprep part. Then the browser makers for a long time stood still and didn't upgrade to IDNA 2008, which fixed that bug among other things. The WhatWG in its self-appointed role as stenograph of the browser cartel didn't change its weird URL spec. But that seems to have changed in recent years. Of course the original sin of IDNA was making it client-side. :/
Regarding the quality of Google search results - I copied this comment verbatim into GPT 3.5, Claude 1, and Mistral small (the lowest quality LLMs from each provider available through Kagi) and each one explained Punycode encoding.
In case anyone else is confused as to why the domain in the example provided needs to be unicode (compared to the filename which is obvious): it's because the hyphen is the shorter '‑' char, which is extended ASCII 226 not the standard '-' (which would be ASCII 45).
The first character you pasted is U+2011 (8209 in decimal), does not appear in the document and cannot be ASCII as it goes beyond the codepoint 127/7F. Also, U+2011 is meant to be a non-breaking hyphen.
Honestly to hell with the WHATWG's weird pseudo-standard. Backslashes? Five forward slashes? No one is sending those URLs around, and if they are, they should fix it. The only thing that needs to deal with that kind of malformed URL is the browser's address bar. If browsers want to do fixups on broken URLs at that point, they should feel free to, but it shouldn't pollute an entire spec.
(It's not even a real standard -- it's a "living standard", which means it just changes randomly and there's no way to actually say "yes, you're compliant with that".)
It's always been extremely funny to me how arguments like this and from the curl author go. "Yes, I had to change curl away from strictly accepting two slashes, for web compatibility. But `while(slash) { advance_parser() }`? That's completely unreasonable! `while(slash && ++i <= 3)` is so much better, and works almost as well!" Ok, whatever...
I don't think there's anything new and interesting in that FAQ. It describes what they mean by a Living Standard, but it doesn't mean they're not running a treadmill operation.
They seem to be making reference to things like the RFC system, but those get updated too.
Pretty urls are the most unnecessary thing that was invented, they don't provide anything that non pretty urls can't provide and millions of parsers have to process them on every request. What a waste.
A url is composed by <protocol>://<protocol-specific-params>
Boom, you have now parsed a URL.
Wanna parse an http query params into host,port, resource? Speak appropriately, ask how to parse an http request, wanna parse a resource? Get into those semantics, be precise
This is intriguing to me for the performance and correctness reasons, but also if it makes the result more dev friendly than the urllib.parse tiple-ish-object result thing.
I posted this elsewhere in the thread, but there absolutely is prior art here. Check out Yarl (urllib.parse wrapper with the nicer interface) and Hyperlink (green field, immutable OO style, focus on correctness). Both on PyPI for many years now.
Hard to imagine the tradeoff of using a third party binary library developed this year vs just using urllib.parse being worth it. Is this solving a real problem?
According to itself, it's solving the issue of parsing differentials vulnerabilities: urllib.parse is ad-hoc and pretty crummy, and the headliner function "urlparse" is literally the one you should not use under any circumstance: it follows RFC 1808 (maybe, anyway) which was deprecated by RFC 2396 25 years ago.
The odds that any other parser uses the same broken semantics are basically nil.
I agree that the stdlib parser is a mess, but as an observation: replacing one use of it with a (better!) implementation introduces a potential parser differential where one didn’t exist before. I’ve seen this issue crop up multiple times in real Python codebases, where a well-intentioned developer adds a differential by incrementally replacing the old, bad implementation.
That’s the perverse nature of “wrong but ubiquitous” parsers: unless you’re confident that your replacement is complete, you can make the situation worse, not better.
> unless you’re confident that your replacement is complete
And that any 3rd party libs you use also don't ever call the stdlib parser internally because you do not want to debug why a URL works through some code paths but not others.
Turns out that url parsing is a cross-cutting concern like logging where libs should defer to the calling code's implementation but the Python devs couldn't have known that when this module was written.
It seems unlikely that this C++ library written by a solo dev is somehow more secure than the Python standard library would be for such a security-sensitive task.
Hi, can_ada (but not ada!) dev here. Ada is over 20k lines of well-tested and fuzzed source by 25+ developers, along with an accompanying research paper. It is the parser used in node.js and parses billions of URLs a day.
can_ada is simply a 60-line glue and packaging making it available with low overhead to Python.
Ah, that makes more sense -- it might be a good idea to integrate with the upstream library as a submodule rather than lifting the actual .cpp/.h files into the bindings repo. That way people know the upstream C++ code is from a much more active project.
Despite my snarky comments, thank you for contributing to the python ecosystem, this does seem like a cool project for high performance URL parsing!
I guess you are right that there are 2 commits from a different dev, so it is technically not a solo project. I still wouldn't ever use this in production code.
Ada was developed in eoy 2022, and included in Node.js since March 2023. Since then, Ada powers Node.js, Cloudflare workers, Redpanda, Clickhouse and many more libraries.
The URL parsing in httpx is rfc3986, which is not the same as WHATWG URL living standard.
rfc3986 may reject URLs which browsers accept, or it can handle them in a different way. WHATWG URL living standard tries to put on paper the real browser behavior, so it's a much better standard if you need to parse URLs extracted from real-world web pages.
No it doesn’t, absolutely not. It’s ironic that you say this after the post you’re commenting on spells out quite explicitly why things “in base” are hard to change and adapt.
Another post in this thread was downvoted and flagged (really?) for claiming that URL parsing isn't difficult. The linked article claims that "Parsing URLs correctly is surprisingly hard." As a software tester, I'm very willing to believe that, but I don't know that the article really made the case.
I did find a paper describing some vulnerabilities in popular URL parsing libraries, including urllib and urllib3. Blog post here:
I didn't look into this in detail at the time, but the report's summary of CVE-2021-45046 is that the parser that validated an URL behaved differently than a separate parser used to fetch the URL, so an URL like
jndi:ldap://127.0.0.1#.evilhost.com:1389/a
is validated as 127.0.0.1, which may be whitelisted, but fetched from evilhost.com, which probably isn't.
I’d like to have 100 developers each write a url parser, and see how many bugs per implementation we can find. I’d guess an average in the double-digits
I did write a URL parser (including converting relative URLs into absolute) in C (I also wrote a simple HTTP client, and other protocols). However, it is only intended for use with a limited set of URI schemes (including "hashed" and "jar", both of which are unusual in the way they are handled).
Now, we can find bug in that one, and then if other people mention theirs too, and find bug in other one, then we can see how accurate your guess is.
However, there are other considerations. For one thing, WHATWG is not the only specification of the working of URLs, so not everyone will comply anyways. And, some features might be necessary or not necessary in specific applications.
Writing a new parser in C++ is a mistake IMO. At the very least, you need to write a fuzzer. At best, you should be using one of the many memory safe languages available to you.
I retract my criticism if this project is just for fun.
Edit: downvoters, do you disagree?
Edit2: OK, I may have judged a bit prematurely. Ada itself has fuzzers and tests. They're just not exported to the can_ada project.
Of course there is always room for new projects, but it still feels weird to act as if this is the first time anybody has ever tried this. It seems like a lot of people are under this same mistaken impression, at least according to the sample of HN users who commented in this thread.
The article makes it sound like the only parser for URLS in the entire Python ecosystem is urllib.parse, regardless of which spec it supports. Hyperlink and Yarl are absolutely prior art here IMO, and at least deserve a mention in an article like this.
We need a better URL parser in Scrapy, for similar reasons. Speed and WHATWG standard compliance (i.e. do the same as web browsers) are the main things.
It's possible to get closer to WHATWG behavior by using urllib and some hacks. This is what https://github.com/scrapy/w3lib does, which Scrapy currently uses. But it's still not quite compliant.
Also, surprisingly, on some crawls URL parsing can take CPU amounts similar to HTML parsing.
Ada / can_ada look very promising!