Your argument uses a website whose code was written by English speakers. There w...

Beltalowda · on Dec 1, 2022

<div class=..> is still <div class=..>, no matter the language of the author. As is "background-color: #fff" in CSS. The UTF-16 is almost double the size, so you'd have to replace a lot of identifiers with 2 or 3 byte ones.

Plus many identifiers come from libraries, and when creating their own identifiers many people use either full English or partial English no matter what language (it was a huge mistake to not use English for many identifiers in my first programming job, as you will invariably end up with a mishmash of two languages).

But it is easy enough to verify this with some actual websites: https://www.rakuten.co.jp is 330K in UTF-8 and 625K in UTF-16, https://ameblo.jp is 104K in UTF-8 and 187K in UTF-16, baidu.com is 360K in UTF-8 and 717K in UTF-16, sina.com.cn: 455K, 854K, daum.net: 666K, 1.2M.

And all of that is only the HTML document; if we'd add up the CSS – where there's almost no possibility to use non-ASCII outside of class and ID names – and JavaScript – where the filesize is usually dominated by React or jQuery or whatnot – things would skew even more in favour of UTF-8.

I'm sure there are examples where a page served over UTF-16 is smaller, such as pages with very little markup (like e.g. HN), but that is the common case, even for websites exclusively written and for users of CJK languages. Someone who does not speak a word of English will save many bytes of data every day with UTF-8. There's a reason all those websites are served over UTF-8 and not UTF-16.

But for the sake of the argument, let's replace all class="...", id="..", and data-event-name=".." with strings of the same length consisting of "回". That grows the filesize from 118K to 151K, and ... it's still larger in UTF-16 with 207K. We could start replacing more stuff and eventually UTF-16 may win, maybe. But you have to use a lot of CJK. Let's use a random excerpt:

        <button
            id="回回回回回回回回回回回回回回回回回回回回回回回回"
            tabindex="-1"
            data-event-name="回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回"

Has 91 7-bit characters and 60 multibyte ones (this includes indentation, which may not be represented 100% accurately here). If we do the math this is:

  UTF-8    91×1 + 60×3 = 271 bytes
  UTF-16   91×2 + 60×2 = 302 bytes

UTF-8 still wins.

And to repeat, there are certainly cases where UTF-16 is smaller. Markdown documents and other plain text files is an obvious one, but HTML is rarely one of them.

But imagine actually checking things before making a claim...

gary_0 · on Dec 1, 2022

Also, HTTP tends to use gzip in-flight, and modern office document file formats also use compression, making any potential space savings for UTF-16 CJK text completely negligible in these common cases. UTF-8 and UTF-16 are basically the same size after compression[0].

[0] http://utf8everywhere.org/#asian

knolax · on Dec 2, 2022

> But for the sake of the argument, let's replace all class="...", id="..", and data-event-name=".." with strings of the same length consisting of "回". That grows the filesize from 118K to 151

You're missing the point entirely, the amount of characters you used is enough for 2 or 3 sentences. This was not an example constructed in good faith.

Beltalowda · on Dec 2, 2022

How is it "bad faith" when I spent the time actually converting and checking the entire document?