Phishing with Unicode Domains

merricksb · on April 20, 2017

Previous recent discussions:

https://news.ycombinator.com/item?id=14130241

https://news.ycombinator.com/item?id=14119713

stereo · on April 20, 2017

Gosh that’s old. The original paper was from 2001, the Shmoo group wrote about it in 2005 - https://blogs.oracle.com/yakshaving/entry/so_not_funny_shmoo... - and Joi Ito and I were able to register Veriѕign.com then, to highlight how their greedy mismanagement of .com made this possible.

Three possible solutions, not mutually incompatible:

* Make the browsers catch it - Chrome just shows characters that look like apple.com here, and shouldn’t.

* Whitelist character sets that are allowed to be mixed, at the registry level - it should only be possible to mix cyrillic-latin homoglyphs with cyrillic non-homoglyphs.

* Don’t allow IDNs on gTLDs - if you want “écriture”, get écriture.fr, not .com

Obviously, the registries have a conflict of interest here, and won’t let 2 and 3 happen on .com because it would cut into Verisign's revenue.

See also https://en.wikipedia.org/wiki/IDN_homograph_attack

darklajid · on April 20, 2017

I'm not quite sure if I understand your second suggestion. Isn't that what the article claims to circumvent? I read that as 'Firefox/Chrome already do not render these characters, unless they're all from the same subset' - and the 'apple.com' is presented as completely cyrillic (I don't know/understand that alphabet and might've missed either your point or misread the article).

Dylan16807 · on April 20, 2017

As far as I understand it the actual solution is that registries aren't supposed to let you buy a domain that looks like an existing domain. But that's broken?

And this domain isn't made of mixed characters.

ko27 · on April 20, 2017

Update your Chrome

runeks · on April 20, 2017

There are no updates to Chrome for iPad in the App Store, and it looks like this: http://i.imgur.com/JjR5evp.jpg

averagewall · on April 20, 2017

Do unicode URLs actually provide any real value? Every web user must be already used to typing Latin characters because so many major websites use them. So nobody would be excluded by that. Whereas, any non-Latin character is going to be nearly impossible for most of the world to enter.

A particularly terrible language is Chinese where most old people can't type the characters even though they can type Latin letters. That's because you have to deliberately invest time to sit down and learn an input method which is a non-trivial endeavor that takes weeks of effort and old people just aren't going to go back to school for that.

pilif · on April 20, 2017

> Do unicode URLs actually provide any real value?

yes. Not everybody speaks english.

>Every web user must be already used to typing Latin characters because so many major websites use them.

s/web user/existing web user/

Unicode domains are one more piece required for the net to be as inclusive as possible.

dgroshev · on April 20, 2017

On the other hand, unicode domains may lead to balkanization of the net. How would you even type in something like борщ.рф (and before you ask, you can easily translate its contents using Google Translate after entering the URL)? And everyone, just everyone in Russia is already capable of typing in stuff in ASCII. So the upside is small and diminishing (more people learn English over time and that's a beautiful thing), and the downside is the reversal of the unification effect that Internet had. I'm pretty sure it's not an obvious choice.

I should also add that the general attitude of "not everybody speaks English so we should adapt our tech to reduce the need for English" seems to imply a privilege of already knowing English. It is true that not everyone speaks English at this moment, but the right solution would be to teach everyone English as it expands horizons immensely, not to balkanize the world. Languages are not equal and English is the single most useful one. One can argue that e.g. Russian is just as good as English, but it's just not true. The amount of information available in English is immeasurably higher than in any other national language, and one should have the privilege of knowing English for some time (or being a native speaker) to forget the fact.

pilif · on April 20, 2017

> How would you even type in something like борщ.рф

by clicking on the link on my search engine. Though of course, either that page is in a language I can easily type (so why then use that URL?), or otherwise, I would not have searched for that term to begin with, whether I typed it in the URL bar or in the search field.

If I'm clicking a link on a page in a script I can read, then the point is moot too.

>(more people learn English over time and that's a beautiful thing

I don't know. Giving access to people who don't (yet) speak english to me is a nobler goal than forcing people to learn english and the latin script.

If they want to learn english, that's fine. But forcing them to is being exclusive.

Yes. There's a lot more content available in english and in the end, that's what made me learn it (honestly - the sole reason I started to learn english was to be able to play the talkie version of "Indiana Jones and the Fate of Atlantis"), but this was my decision. I wasn't forced to.

matt4077 · on April 20, 2017

Yes, knowing English is great. Everyone should learn English.

But declaring English the lingua franca means that people from other cultures will be at a perennial disadvantage vis-a-vis native speakers, or that their native language would be neglected.

There's an obvious benefit we get from everyone speaking English, but there's also a non-obvious cost to the homogenisation of cultures. Already, there are languages dying (and dead), and with every languages dies a potentially different and meaningful way to look at the world.

In any case, I'd leave such decisions to those actually affected by them. People tend to react in surprising ways to other cultures telling them what they're worth or not (c. f. "Balkanisation")

Geimfari · on April 20, 2017

Yes. It's very odd to need to adjust your own language to fit into ASCII.

I can easily continue replacing "þ" and "ð" with "th", continue removing diacritics, but it feels like being robbed of an aspect of your language.

runeks · on April 20, 2017

Maybe just limit eg. Danish characters to .dk, Icelandic characters to .is, Chinese characters to .cn etc., and leave .com ASCII?

pilif · on April 20, 2017

Google is going to do just that in Chrome. Of course doing this change means that some existing domains will now stop working correctly which may or may not be ok with their owners.

The other option would be to do this based on the browser's current language preference configuration. But that means that unless your domain is in ASCII, you can never be sure how it's going to be rendered on your customer's browsers which would make IDN domains second-class compared to ASCII domains.

yorwba · on April 20, 2017

> A particularly terrible language is Chinese where most old people can't type the characters even though they can type Latin letters. That's because you have to deliberately invest time to sit down and learn an input method which is a non-trivial endeavor that takes weeks of effort and old people just aren't going to go back to school for that.

Is that your personal experience? Most input methods I know work by 1. knowing the Pinyin romanization of the word you want to type 2. typing it in 3. selecting the appropriate characters from a long list of candidates.

If they know the characters and Latin letters, then the only roadblock I can think of would be not knowing Pinyin. That shouldn't take weeks to learn if you already know Chinese, it's more like learning a very simple alphabet.

matt4077 · on April 20, 2017

I see it as a very basic token of respect for other cultures, the actual costs of which in technical effort and a scammer here and there don't even register in absolute terms.

en3rme · on April 20, 2017

>Do unicode URLs actually provide any real value?

No. They are very very rarely used even in countries that do not use the Latin alphabet. Given the risk that phishing poses I don't think it's worth the risk.

Geimfari · on April 20, 2017

But they're very rarely used because support for them hasn't been good. Support is still unpredictable, I run into failing URL parsers constantly.

Jonnax · on April 20, 2017

Ouch. This is a good one.

Whilst it's easy to say "Just enable punicode always" People that use the web in different languages lose a lot of functionality because of it.

I could imagine a solution would be to collect a list of homogliphs then when the browser suspects an overlap it does a search for similarly spelt sites then warns the user of the possibility of the site being an imitation. Of course then also converting the URL in the address bar to punicode.

What other ideas are there?

3131s · on April 20, 2017

How about an icon next to the URL bar that displays and allows a user to select their preferred Unicode block(s)? If a character in the URL falls outside that block then the URL is highlighted in red or some warning is displayed.

herghost · on April 20, 2017

Safari just displays it as "https://www.xn--80ak6aa92e.com" whereas Chrome (Version 57.0.2987.133 (64-bit)) displays it as the author intended.

wila · on April 20, 2017

Same with pale moon just displays it with the puny code instead of the unicode characters.

Firefox 52.0.2 here displays it as apple.com So updated Firefox (which failed until I reran it) and then.. Still displays it as apple.com (Firefox 53.0)

wila · on April 20, 2017

OK, I read the linked article now and apparently you have to change a setting in Firefox in order to see the URL in punycode as Mozilla decided it is "not a bug".

Eg. about:config and set the "network.IDN_show_punycode" to true to avoid this trap.

runeks · on April 20, 2017

Chrome on iPad fails miserably: http://i.imgur.com/JjR5evp.jpg

herghost · on April 20, 2017

heh, and updating Chrome to Version 58.0.3029.81 (64-bit) fixes the issue there.

jwilk · on April 20, 2017

https://en.wikipedia.org/wiki/IDN_homograph_attack

mirages · on April 20, 2017

Chrome 58 rolled out yesterday fix the issue

pluma · on April 20, 2017

Except the fix seems to be simply to show the punycode URL.

That's not a fix, that's a workaround.

EDIT: This led me to read up on how various browsers handle non-ASCII letters which in turn helped me discover that apparently no browser supports the German sharp-s ("ß") which gets auto-expanded to "ss" although domains containing the sharp-s can be registered separately from "ss" domains -- effectively allowing people to register domains that can't be accessed in any browser without explicitly using the unreadable punycode representation.

EDIT2: It seems the fix is more fine-tuned than just showing punycode for everything. So it's still a workaround (punycode URLs are not fit for human consumption so this still actively punishes confusing domains even if they're not intentionally malicious) but it affects fewer domains than I initially feared.

jerryszczerry · on April 20, 2017

It was already fixed back when domain names had to be plain ASCII.

It was West-centric, yes, but it allowed for a unique and legible ASCII identifiers. And encouraged non-ASCII languages to create a unique (or, mostly-unique) Latin representation of their scripts — which is, in general, a good thing. It encouraged unification, using ASCII as the common ground.

Allowing for Unicode characters opened a new Pandora box, creating a situation that is unsolvable — either we keep the new names, making almost every string of characters potentially ambiguous, or we return to the state where ASCII-only names are the only ones usable.

Also, differentiating between ASCII and non-ASCII names doesn't solve the thing. Imagine what if the legitimate address is already in a non-ASCII script.

matt4077 · on April 20, 2017

In what universe is ASCII "common ground"? And in what universe is a few scammers here and there "pandora's box"?

Some people in this threat seem almost eager to throw out any attempt at respecting cultures other than their own using the earliest convenient excuse.

Dylan16807 · on April 20, 2017

> In what universe is ASCII "common ground"?

Excluding EBCDIC, which has the same characters, can you name a major character set that doesn't start with a carbon copy of ASCII? Shift JIS starts with ASCII. Big5 starts with ASCII. Every code page starts with ASCII. Unicode, of course, starts with ASCII. Look at just about any (physical) keyboard for any language and it will support ASCII.

TeMPOraL · on April 20, 2017

I can't think of a real fix though; you'd have to disambiguate the Unicode itself for that.

jwilk · on April 20, 2017

What would you consider a proper fix?

pluma · on April 20, 2017

Unless they did something ridiculously clever, they just made IDN domains unusable. That means legitimate IDN domains are as affected as malicious ones, punishing non-ASCII languages.

A proper fix would keep the domain name human-readable but differentiate between the ASCII and homoglyph versions.

How? Not my job to figure that out. If you want a random idea: the homoglyphs could be rendered differently (i.e. make the font disambiguate them). That's probably not a perfect solution but I'm not getting paid to do this.

jwilk · on April 20, 2017

The fix is https://chromium.googlesource.com/chromium/src/+/08cb718ba7c... :

> Block a label made entirely of Latin-look-alike Cyrillic letters when the TLD is not an IDN (i.e. this check is ON only for TLDs like 'com', 'net', 'uk', but not applied for IDN TLDs like рф.

That's neither "ridiculously clever", nor it will make (non-nefarious) IDN domains ununsable.

pluma · on April 20, 2017

Except that this assumes there are no legitimate IDN domains on non-IDN TLDs. Considering how few IDN TLDs there are, I would wager that most IDN domains don't live on these TLDs.

However it seems they don't flat out block all IDN domains but only those containing the homoglyphs. IUIC they also don't block domains containing Cyrillic homoglyphs alongside other Cyrillic characters.

This seems somewhat reasonable. I still think rendering Cyrillic in a way that makes alphabet mismatches more obvious would be a better and more future-proof solution.

petters · on April 20, 2017

That fix is an improvement, but in general I think it is better to whitelist stuff instead. Unicode is huge and complicated.