One rather insidious one I'd never considered prior was that lowercase n is one ...

Cthulhu_ · on March 4, 2021

Reminds me of a former colleague of mine who was one of the principal developers behind an Android app, where they could really go to town on making it crash free, since Android offers a phone home function when a crash occurs.

At some point, he made the observation that 1-2% of crashes occur not because of a programmer error or anything, but "chance"; bit flips and the like, in either the app's memory or the phone's services.

So while on an individual basis it's rare, and in practice nobody will really notice an issue, statistically bit flips are significant enough to need attention.

sumtechguy · on March 4, 2021

I had a project where we were reaching for the 'cosmic ray' idea. But it turns out it was a supercap that did not hold enough charge on NAND during power off and the way NAND ages. It worked when 'new' but as the device aged and was used it the NAND would slowly take extra time to write a page. Eventually the time was just enough that it would mostly write everything. It would also sometimes randomly flip bits here and there depending on where it was in the write cycle.

cferr · on March 4, 2021

Isn't that the reason ECC memory was created?

abrookewood · on March 5, 2021

Yep, and it's super annoying that it's not standard in all devices. I have it on my NAS, but not my PC.

stjohnswarts · on March 4, 2021

I don't think bit errors happen nearly as often as people seem to think on HN. Sure they happen but they're relatively rare, especially on good quality servers. We have ECC/hashing all over the place looking for it at various layers. I bet it's 1 in a trillion emails that get a bit error in the addressing if not more. I have written embedded code (FPGA) that ran for YEARs that has hashing and I could probably count on one hand the number of times we've had to address any issue (always failed hardware) and it was caught every time unless SHA512 has failed us.

exmadscientist · on March 4, 2021

> relatively rare, especially on good quality servers

Good quality servers are the top of the reliability standings. The bottom... the bottom is very far down. Very far.

I'm a hardware guy now. I'm still amazed anything ever works at all. The worst are things like DDR SDRAM that work well enough that people think they can get away without error detection. At least modern MLC NAND flash is so bad that it obviously doesn't work at all without ECC. Better that than silent failures.

The first Guild Wars game actually added a Prime95-style iteration before sending a bug report/crash dump back. An astonishing number of clients failed this test; if I'm remembering the conversation I had correctly, something like half of all oddball crashes got filtered out by that test. Half! (To be fair, overclocked cheap gaming machines may actually be the bottom tier of reliability, but the point stands.)

com2kid · on March 4, 2021

> I don't think bit errors happen nearly as often as people seem to think on HN.

Back in the days of spinning disks, I had a coworker who had previous worked on the SQL Server test team.

One thing they tracked was bit errors. He said they were surprisingly common to encounter (and of course then account for)

Or, to put it another way, there is a reason modern file systems are paranoid.

Also way back when I managed a huge distributed test cluster, we ran ~3 million tests a week. A non-trivial # of test failures (few hundred) were spurious network failures where the test package didn't get copied to the test client correctly, we detected it "something went wrong" and marked the test as failed. I know in theory Ethernet promises the data gets there without issue, but in reality, at scale, it wasn't. Flaky hardware, bad cable, failing switch, a couple machines with unreliable RAM? Who knows, but it happens.

> I could probably count on one hand the number of times we've had to address any issue (always failed hardware)

Another product I worked on, the bus timings for the DRAM were slightly off. And I do mean slightly, I wasn't directly involved but I think it was something stupid like being 1 cycle short on waiting for something.

We had amazing logging. Soon as we had a few hundred thousand units out, we get ~10 stack traces come in from crash dumps, always in the same place. My team spent a month going over every line of code related to that place in memory.

New build goes out, more failures, again tiny tiny %, you'd only notice this at scale. We track it down, pointer being corrupted, it was stored in the same memory address we had been looking at last time.

Bit was being flipped. We only noticed it because it had hit a pointer. Most of the DRAM, the vast vast majority, was capable of operating 1 cycle out of spec, but some chips weren't.

We only detected and fixed this bit flip issue because the principle engineer on that product insisted that we investigate and fix EVERY single product crash. He insisted on 100% reliability.

Do you think every firmware team for every single component of every single motherboard has that same dedication to quality? Because if not, errors are going to slip through, and they won't even be noticed, not to mention investigated.

raverbashing · on March 4, 2021

Cool writeup. I think some groups or orgs suffer the opposite of survivorship bias, that is, even if an occurrence is rare, they get to see it a lot of times.

> Do you think every firmware team for every single component of every single motherboard has that same dedication to quality?

The answer is, of course, no. Build quality and work conditions vary wildly.

PavleMiha · on March 4, 2021

But what about consumer hardware the might be failing? So your typical 7-10 year old PC still trucking along serving as a Facebook and word machine. Genuinely curious as to how common bit flips are in the wild.

benjojo12 · on March 4, 2021

For this bit flip to work it would have to happen before it was handed down to the actual resolver, since on the wire this flip would just cause a invalid parse of the DNS label

silisili · on March 4, 2021

This "attack" is against the ordinary user, so yeah is before or at the time of the dns query. I'd expect(probably incorrectly) for the server side stuff like recursive resolvers to have ECC.

So imagine you are sending a mail to jim@mail.example.com. You see it in the address field as correct. You click send, and your client resolves mailnexample.com, gets an address, and delivers it, while you are none the wiser.

This is particularly bad for third level domains, which are more common than you'd expect.

hanniabu · on March 7, 2021

Since you have experience, maybe there's something you can help explain with what was mentioned:

> In fact, out of the 32 valid domain names that are 1-bitflip away from windows.com

"windows" is 7 characters long so how is there 32 1-bitflip combinations?

Also one of the ones mentioned was "windo7s", but "w" is 01111000 and "7" is 0000111, which isn't a single bit flip away unless I'm doing something wrong.

zenexer · on March 5, 2021

In theory, this should have lower efficacy than bit flips elsewhere in the domain. Periods aren’t actually sent in DNS queries, so you’re decreasing the number of places in memory where the necessary bit flip could occur. The equation changes slightly when you’re trying to intercept HTTPS traffic, though, since a bit flip in a DNS query wouldn’t help you anyway; it’d have to happen elsewhere.

judge2020 · on March 4, 2021

Presumably this shouldn't be a problem for most 'cloud' providers like AWS and Google who run ECC memory, right?

gregmac · on March 4, 2021

It happens on the system making the initial DNS request: so, for example, your PC.