Parsing URLs in Python

kmike84 · on March 16, 2024

A great initiative!

We need a better URL parser in Scrapy, for similar reasons. Speed and WHATWG standard compliance (i.e. do the same as web browsers) are the main things.

It's possible to get closer to WHATWG behavior by using urllib and some hacks. This is what https://github.com/scrapy/w3lib does, which Scrapy currently uses. But it's still not quite compliant.

Also, surprisingly, on some crawls URL parsing can take CPU amounts similar to HTML parsing.

Ada / can_ada look very promising!

TkTech · on March 16, 2024

can_ada dev here. Scrapy is a fantastic project, we used it extensively at 360pi (now Numerator), making trillions of requests. Let me know if I can help :)

VagabundoP · on March 16, 2024

Okay so some googling found me that the "xn--" means the rest of the hostname will be unicode, but why does é become -fsa in www.xn--googl-fsa.com.

Google failed on the second part.

js2 · on March 16, 2024

Because that's the Punycode representation:

https://en.wikipedia.org/wiki/Punycode

https://www.punycoder.com/

andy99 · on March 17, 2024

I wasn't aware of this, I'd seen those URLs before but only in the context of Chinese ones and thought it was Chinese-specific.

It's interesting because I just went down an apparent rabbit hole inplementing Byte-level encoding for using language models with unicode. There each byte in a unicode character is mapped to a printable character that goes up to 255 < ord(x) < 511 (I don't remember the highest but the point is each byte is mapped to another printable unicode character.

See https://github.com/openai/gpt-2/blob/9b63575ef42771a015060c9...

And the actual list of characters:

https://github.com/rbitr/llm.f90/blob/dev/phi2/phi2/pretoken...

ekimekim · on March 16, 2024

To expand on the sibling comments: This encoding (called Punycode) works by combining the character to encode (é) and the position the character should be in (the 7th position out of a possible 7) into a single number. é is 233, there are 7 possible positions, and it is in position 6 (0-indexed) so that single number is 233 * 7 + 6 = 1637. This is then encoded via a fairly complex variable-length encoding scheme into the letters "fsa".

See https://en.wikipedia.org/wiki/Punycode#Encoding_the_non-ASCI...

ks2048 · on March 17, 2024

This system seems pretty weird to me.

I was wondering, can that clash with a "normal" domain registered as "xn--....."? Apparently there is another specific rule in RFC 5891 saying "The Unicode string MUST NOT contain "--" (two consecutive hyphens) in the third and fourth character positions" [0]

Also, if I was forced to represent Unicode as ASCII, punycode encoding is not the obvious one - it's pretty confusing. But, I don't know much about how and why it was chosen, so I assume there's good reason.

[0] https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3....

xg15 · on March 17, 2024

IDNs and Punycode were basically bolted-on extensions to DNS that were added after DNS was already widely deployed. Because there was no "proper" extension mechanism available, it was a design requirement that they can be implemented "on top" of the standard DNS without having to change any of the underlying components. So I think most of the DNS infrastructure can be (and is) still completely unaware that IDNs and Punycode exists.

Actually, I wonder what happens if you take a "normal" (i.e. non-IDN, ascii-only) domain and encode it as Punycode. Should the encoded and non-encoded domains be considered identical or separate? (for purposes of DNS resolutions, origin separation, etc)

Identical would be more intuitive and would match the behavior of domain names with non-ascii characters - on the other hand, this would require reworking of ALL non-punycode-aware DNS software, which I'm doubtful is possible.

So this seems like a tricky thing to get right.

Doxin · on March 18, 2024

Python has idna-encoding built in these days, so I figured I'd do a quick check to see what happens:

    >>> "foo".encode("idna")
    b'foo'
    >>> "fooé".encode("idna")
    b'xn--foo-dma'

So indeed a punycode'd ascii domain would remain unchanges by the looks of it.

There's also the "punycode" encoding available, but that does something subtly different that's not quite how domains get encoded:

    >>> "foo".encode("punycode")
    b'foo-'
    >>> "fooé".encode("punycode")
    b'foo-dma'

ttepasse · on March 19, 2024

According to the current Python documentation the 'idna' encoding in Python only does IDNA 2003, not IDNA 2008:

https://docs.python.org/3.12/library/codecs.html#module-enco...

The recommend the 3rd party 'idna' module for this:

https://pypi.org/project/idna/

IDNA 2003 is a particular annoyance of mine: The IDNA 2003 algorithm didn't encode the german 'ß' character, or rather 'wrongly', through overeager use of Unicode normalisation in the nameprep part. Then the browser makers for a long time stood still and didn't upgrade to IDNA 2008, which fixed that bug among other things. The WhatWG in its self-appointed role as stenograph of the browser cartel didn't change its weird URL spec. But that seems to have changed in recent years. Of course the original sin of IDNA was making it client-side. :/

lathiat · on March 17, 2024

I mean, yeah, but the odds of someone using "xn--" on the start of a domain are pretty small. The double dash is pretty uncommon.

whalesalad · on March 16, 2024

It’s called an IDN. This is an encoding format called puny code that transforms international domains into ascii

atorodius · on March 17, 2024

FWIW I find this is the perfect question for ChatGPT/Gemini. Whenever the knowedge is somewhere on the web but hard to Google, I use these LLMs.

In this case, Gemini correctly points to Punycode

Zambyte · on March 17, 2024

Regarding the quality of Google search results - I copied this comment verbatim into GPT 3.5, Claude 1, and Mistral small (the lowest quality LLMs from each provider available through Kagi) and each one explained Punycode encoding.

pixelesque · on March 17, 2024

In case anyone else is confused as to why the domain in the example provided needs to be unicode (compared to the filename which is obvious): it's because the hyphen is the shorter '‑' char, which is extended ASCII 226 not the standard '-' (which would be ASCII 45).

ezequiel-garzon · on March 17, 2024

The first character you pasted is U+2011 (8209 in decimal), does not appear in the document and cannot be ASCII as it goes beyond the codepoint 127/7F. Also, U+2011 is meant to be a non-breaking hyphen.

phyzome · on March 17, 2024

Honestly to hell with the WHATWG's weird pseudo-standard. Backslashes? Five forward slashes? No one is sending those URLs around, and if they are, they should fix it. The only thing that needs to deal with that kind of malformed URL is the browser's address bar. If browsers want to do fixups on broken URLs at that point, they should feel free to, but it shouldn't pollute an entire spec.

(It's not even a real standard -- it's a "living standard", which means it just changes randomly and there's no way to actually say "yes, you're compliant with that".)

domenicd · on March 17, 2024

It's always been extremely funny to me how arguments like this and from the curl author go. "Yes, I had to change curl away from strictly accepting two slashes, for web compatibility. But `while(slash) { advance_parser() }`? That's completely unreasonable! `while(slash && ++i <= 3)` is so much better, and works almost as well!" Ok, whatever...

As for your claim about living standards, I'd encourage you to read https://whatwg.org/faq#living-standard

phyzome · on March 18, 2024

I don't think there's anything new and interesting in that FAQ. It describes what they mean by a Living Standard, but it doesn't mean they're not running a treadmill operation.

They seem to be making reference to things like the RFC system, but those get updated too.

JulianWasTaken · on March 16, 2024

Nice.

I'll also throw in that I've recently wrote bindings to Mozilla's servo URL library.

Those live at https://github.com/crate-py/url

They're not complete yet (meaning only the parsing bits are exposed, not URL modification) but I too was frustrated with the state of URL parsing.

timhh · on March 17, 2024

IMO that URL crate is not especially high quality. I barely work with URLs and I quickly found an embarrassingly trivial bug:

https://github.com/servo/rust-url/issues/864#issuecomment-16...

abstractbslayer · on March 17, 2024

Pretty urls are the most unnecessary thing that was invented, they don't provide anything that non pretty urls can't provide and millions of parsers have to process them on every request. What a waste.

edflsafoiewq · on March 17, 2024

What is a "pretty url"?

d0mine · on March 17, 2024

There are people who are not native English speakers. They might find Unicode useful.

d0mine · on March 17, 2024

stdlib behavior may be more preferable:

- resolving "../" may have security implications - unicode hostname seems more readable. To get punycode, one can call .encode("idna") if necessary.

How often parsing urls is a performance bottleneck?

TZubiri · on March 18, 2024

A url is composed by <protocol>://<protocol-specific-params>

Boom, you have now parsed a URL.

Wanna parse an http query params into host,port, resource? Speak appropriately, ask how to parse an http request, wanna parse a resource? Get into those semantics, be precise

memco · on March 16, 2024

This is intriguing to me for the performance and correctness reasons, but also if it makes the result more dev friendly than the urllib.parse tiple-ish-object result thing.

nerdponx · on March 17, 2024

I posted this elsewhere in the thread, but there absolutely is prior art here. Check out Yarl (urllib.parse wrapper with the nicer interface) and Hyperlink (green field, immutable OO style, focus on correctness). Both on PyPI for many years now.

memco · on March 17, 2024

Thanks! I have come across yarl, but not hyperlink.

diarrhea · on March 17, 2024

Yeah, that tuple API is bizarre. It really doesn’t play well with type annotations either.

pyuser583 · on March 16, 2024

The Ada programming language is cursed with overlapping acronyms. GPS, ADA, SPARK, AWS. Seems it just got a little bit worse.

Solvency · on March 16, 2024

It is mindboggling to me how often developers create project names without even trying to search for precedent names in their own domain/industry.

Calling this Ada is just ridiculous.

pyuser583 · on March 16, 2024

Ada needs to fight back. Java templating library! Python dependency resolver! Zag image manipulation library! SQL event logging persistence framework!

I can’t believe Amazon wasn’t violating some anti-competition rule by using “AWS” for “Amazon Web Services” when it already meant “Ada Web Server.”

Wow when I search for “Ada GPS” I get global position system support libraries before GNAT Programming Studio.

yagiznizipli · on March 17, 2024

Ada developer here. Ada URL parser is named after my daughter Ada. We chose this name in particular as a reference to Ada Lovelace.

Hendrikto · on March 17, 2024

> We chose this name in particular as a reference to Ada Lovelace.

Just like the ~20 other projects named Ada.

gjvc · on March 16, 2024

yes but they think it's punny/funny/clever

nine_k · on March 17, 2024

It's usually a poor reason. Calling the GNU Image Manipulation Program "GIMP" was kinda funny and punny, and maybe humble, but not wise.

Inkscape or Krita have less poignant, but more reasonable name. (But yes, such an approach removes some of the teenage fun from doing a project.)

masklinn · on March 16, 2024

> Ada is a WHATWG-compliant and fast URL parser written in modern C++

Why would you do that, Daniel?

yagiznizipli · on March 17, 2024

Ada developer here, Ada is the name of my daughter, and this project is my gift to her, to remember me.

pyuser583 · on March 17, 2024

Congrats on having a daughter! I hope and day she’ll need to parse a url, and be able to see your love for her in her code!

TkTech · on March 16, 2024

Hah :) It's named after Yagiz Nizipli's (Ada dev) newborn daughter, Ada.

EmilStenstrom · on March 16, 2024

At least the can_ada (canada!) makes it a little bit more unique.

Areading314 · on March 16, 2024

Hard to imagine the tradeoff of using a third party binary library developed this year vs just using urllib.parse being worth it. Is this solving a real problem?

masklinn · on March 16, 2024

According to itself, it's solving the issue of parsing differentials vulnerabilities: urllib.parse is ad-hoc and pretty crummy, and the headliner function "urlparse" is literally the one you should not use under any circumstance: it follows RFC 1808 (maybe, anyway) which was deprecated by RFC 2396 25 years ago.

The odds that any other parser uses the same broken semantics are basically nil.

woodruffw · on March 16, 2024

I agree that the stdlib parser is a mess, but as an observation: replacing one use of it with a (better!) implementation introduces a potential parser differential where one didn’t exist before. I’ve seen this issue crop up multiple times in real Python codebases, where a well-intentioned developer adds a differential by incrementally replacing the old, bad implementation.

That’s the perverse nature of “wrong but ubiquitous” parsers: unless you’re confident that your replacement is complete, you can make the situation worse, not better.

Spivak · on March 16, 2024

> unless you’re confident that your replacement is complete

And that any 3rd party libs you use also don't ever call the stdlib parser internally because you do not want to debug why a URL works through some code paths but not others.

Turns out that url parsing is a cross-cutting concern like logging where libs should defer to the calling code's implementation but the Python devs couldn't have known that when this module was written.

Areading314 · on March 16, 2024

It seems unlikely that this C++ library written by a solo dev is somehow more secure than the Python standard library would be for such a security-sensitive task.

TkTech · on March 16, 2024

Hi, can_ada (but not ada!) dev here. Ada is over 20k lines of well-tested and fuzzed source by 25+ developers, along with an accompanying research paper. It is the parser used in node.js and parses billions of URLs a day.

can_ada is simply a 60-line glue and packaging making it available with low overhead to Python.

Areading314 · on March 17, 2024

Ah, that makes more sense -- it might be a good idea to integrate with the upstream library as a submodule rather than lifting the actual .cpp/.h files into the bindings repo. That way people know the upstream C++ code is from a much more active project.

Despite my snarky comments, thank you for contributing to the python ecosystem, this does seem like a cool project for high performance URL parsing!

masklinn · on March 16, 2024

Not in the sense of differential vulnerabilities, since the standard library refuses to match any sort of modern standard.

It's also

1. not a solo dev

2. Daniel Lemire

3. a serious engineering and research effort: https://arxiv.org/pdf/2311.10533.pdf

Areading314 · on March 16, 2024

This is the commit history: https://github.com/TkTech/can_ada/commits/main/

I guess you are right that there are 2 commits from a different dev, so it is technically not a solo project. I still wouldn't ever use this in production code.

masklinn · on March 16, 2024

...

can_ada is just the python bindings.

The actual underlying project is at https://github.com/ada-url/ada

bqmjjx0kac · on March 16, 2024

The can_ada repo threw me off, too. It looks super amateurish because of the lack of tests, fuzzers, etc.

But it appears that they've just exported the meat of the Ada project and left everything else upstream.

yagiznizipli · on March 17, 2024

Ada was developed in eoy 2022, and included in Node.js since March 2023. Since then, Ada powers Node.js, Cloudflare workers, Redpanda, Clickhouse and many more libraries.

pyuser583 · on March 16, 2024

urlib.parse is a pain. We really need something more like pathlib.Path.

Ch00k · on March 16, 2024

There is https://github.com/gruns/furl

AMCMneCy · on March 16, 2024

Also https://github.com/aio-libs/yarl

4ec0755f5522 · on March 17, 2024

I use yarl as my default for this as well, it's been great to work with.

VagabundoP · on March 16, 2024

This library is very pythonic.

pyuser583 · on March 16, 2024

Yes! That’s the one I like!

masklinn · on March 16, 2024

That used to be werkzeug.urls, kinda (it certainly had a more convenient API than urllib.parse), but it was killed in Werkzeug 3.

pyuser583 · on March 16, 2024

I remember and miss that. But I’m not going to install werkzeug just for the url parsing.

d_kahneman7 · on March 16, 2024

Is it that inconvenient?

pyuser583 · on March 18, 2024

It’s heavy. It’s a freaking server.

masklinn · on March 21, 2024

It’s a bunch of WSGI utilities. The server it provides is a thin layer over the stdlib’s HTTPServer.

joouha · on March 16, 2024

You might be interested in https://github.com/fsspec/universal_pathlib

bormaj · on March 16, 2024

Why not just use `httpx`? If you're not bound to the stdlib, it's a great alternative to `requests` and url parse

kmike84 · on March 16, 2024

The URL parsing in httpx is rfc3986, which is not the same as WHATWG URL living standard.

rfc3986 may reject URLs which browsers accept, or it can handle them in a different way. WHATWG URL living standard tries to put on paper the real browser behavior, so it's a much better standard if you need to parse URLs extracted from real-world web pages.

gjvc · on March 16, 2024

httpx is great, and "needs to be in base"

orf · on March 16, 2024

No it doesn’t, absolutely not. It’s ironic that you say this after the post you’re commenting on spells out quite explicitly why things “in base” are hard to change and adapt.

gjvc · on March 17, 2024

where do you draw the line with this approach?

ojbyrne · on March 16, 2024

Looking at the github site for can_ada, I discovered that the developers live in Montreal, Canada. Nice one.

TkTech · on March 16, 2024

I do! :) I should probably also never be allowed to name things.

ulrischa · on March 16, 2024

Parsing an url is really a pain in the a*

SloopJon · on March 16, 2024

Another post in this thread was downvoted and flagged (really?) for claiming that URL parsing isn't difficult. The linked article claims that "Parsing URLs correctly is surprisingly hard." As a software tester, I'm very willing to believe that, but I don't know that the article really made the case.

I did find a paper describing some vulnerabilities in popular URL parsing libraries, including urllib and urllib3. Blog post here:

https://claroty.com/team82/research/exploiting-url-parsing-c...

Paper here:

https://web-assets.claroty.com/exploiting-url-parsing-confus...

If you remember the Log4j vulnerability from a couple of years ago, that was an URL parsing bug.

masklinn · on March 16, 2024

> If you remember the Log4j vulnerability from a couple of years ago, that was an URL parsing bug.

I don't think that's a fair description of the issue.

The log4j vulnerability was that it specifically added JNDI support (https://issues.apache.org/jira/browse/LOG4J2-313) to property substitution (https://logging.apache.org/log4j/2.x/manual/configuration.ht...), which it would apply on logged messages. So it was a pretty literal feature of log4j. log4j would just pass the URL to JNDI for resolution, and substitute the result.

SloopJon · on March 16, 2024

I didn't look into this in detail at the time, but the report's summary of CVE-2021-45046 is that the parser that validated an URL behaved differently than a separate parser used to fetch the URL, so an URL like

    jndi:ldap://127.0.0.1#.evilhost.com:1389/a

is validated as 127.0.0.1, which may be whitelisted, but fetched from evilhost.com, which probably isn't.

chirau · on March 16, 2024

[flagged]

williamdclt · on March 16, 2024

> Parsing urls is not difficult at all

I’d like to have 100 developers each write a url parser, and see how many bugs per implementation we can find. I’d guess an average in the double-digits

zzo38computer · on March 17, 2024

I did write a URL parser (including converting relative URLs into absolute) in C (I also wrote a simple HTTP client, and other protocols). However, it is only intended for use with a limited set of URI schemes (including "hashed" and "jar", both of which are unusual in the way they are handled).

(See the function called "scogem_parse_url" in the https://raw.githubusercontent.com/zzo38/scorpion/trunk/scoge... file.)

Now, we can find bug in that one, and then if other people mention theirs too, and find bug in other one, then we can see how accurate your guess is.

However, there are other considerations. For one thing, WHATWG is not the only specification of the working of URLs, so not everyone will comply anyways. And, some features might be necessary or not necessary in specific applications.

otherme123 · on March 16, 2024

Maybe this is a case of "I could write curl in a weekend", proceeds to use libcurl one way or the other.

simonw · on March 16, 2024

I thought the linked article did a good job of explaining how deceptively hard it is to parse URLs correctly and handle all edge cases.

bqmjjx0kac · on March 16, 2024

Writing a new parser in C++ is a mistake IMO. At the very least, you need to write a fuzzer. At best, you should be using one of the many memory safe languages available to you.

I retract my criticism if this project is just for fun.

Edit: downvoters, do you disagree?

Edit2: OK, I may have judged a bit prematurely. Ada itself has fuzzers and tests. They're just not exported to the can_ada project.

yagiznizipli · on March 17, 2024

Ada developer here. Ada has more than 5000 tests, is included in oss-fuzz project and battle tested in Node.js and Cloudflare workers.

bqmjjx0kac · on March 17, 2024

I apologize for the misjudgment. I just followed the link to can_ada and saw really minimal tests, e.g. https://github.com/TkTech/can_ada/blob/main/tests/test_parsi...

I didn't understand that can_ada is not where the parser is developed.

nerdponx · on March 17, 2024

No mention of prior art? The Hyperlink library has stated correctness as its goal for a long time: https://pypi.org/project/hyperlink

Of course there is always room for new projects, but it still feels weird to act as if this is the first time anybody has ever tried this. It seems like a lot of people are under this same mistaken impression, at least according to the sample of HN users who commented in this thread.

TkTech · on March 17, 2024

Hyperlink (which I didn't know existed, by the way) is not a parser for the WHATWG spec, it's for RFC3986. You seem to be getting things confused.

nerdponx · on March 17, 2024

The article makes it sound like the only parser for URLS in the entire Python ecosystem is urllib.parse, regardless of which spec it supports. Hyperlink and Yarl are absolutely prior art here IMO, and at least deserve a mention in an article like this.