I'd love to hear more of this debugging story!

__turbobrew__ · 2024-07-04T02:50:50 1720061450

Containers are offered block storage by creating a loopback device with a backing file on the kubelet’s file system. We noticed that on some very heavily utilized nodes that iowait was using 60% of all the available cores on the node.

I first confirmed that nvme drives were healthy according to SMART, I then worked up the stack and used BCC tools to look at block io latency. Block io latency was quite low for the NVME drives (microseconds) but was hundreds of milliseconds for the loopback block devices.

This lead me to believe that something was wrong with the loopback devices and not the underlying NVMEs. I used cachestat/cachetop and found that the page cache miss rate was very high and that we were thrashing the page cache constantly paging in and out data. From there I inspected the loopback devices using losetup and found that direct io was disabled and the sector size of the loopback device did not match the sector size of the backing filesystem.

I modified the loopback devices to use the same sector size as the block size of the underlying file system and enabled direct io. Instantly, the majority of the page cache was freed, iowait went way down, and io throughout went way up.

Without BCC tools I would have never been able to figure this out.

Double caching loopback devices is quite the footgun.

Another interesting thing we hit is that our version of losetup would happily fail to enable direct io but still give you a loopback device, this has since been fixed: https://github.com/util-linux/util-linux/commit/d53346ed082d...

jauntywundrkind · 2024-07-04T04:55:03 1720068903

There's also either Composefs or Puzzlefs, both of which attempt to let the page cache work across containers!

https://github.com/containers/composefs https://github.com/project-machine/puzzlefs

FooBarWidget · 2024-07-04T10:23:19 1720088599

Which container runtime are you using? As far as I know both Docker and containerd use overlay filesystems instead of loopback devices.

And how did you know that tweaking the sector size to equal the underlying filesystem's block size would prevent double caching? Where can one get this sort of knowledge?

__turbobrew__ · 2024-07-04T14:53:25 1720104805

The loopback devices came from a CSI which creates a backing file on the kubelet’s filesystem and mounts it into the container as a block device. We use containerd.

I knew that enabling direct io would most likely disable double caching because that is literally the point of enabling direct io on a loopback device. Initially I just tried enabling direct io on the loopback devices, but that failed with a cryptic “invalid argument” error. After some more research I found that direct IO needs the sector size to match the filesystems block size in some cases to work.

M_bara · 2024-07-04T17:28:05 1720114085

We had something similar about 10 years ago where I worked. Customer instances were backed via loopback devices to local disks. We didn’t think of this - face palm - on the loop back devices. What we ended up doing was writing a small daemon to posix fadvise the kernel to skip the page cache… your solution is way simpler and more elegant… hats off to you