Linux containers feel like a very weak imitation of what they could be under an ...

ori_b · on July 21, 2019

> Linux lacks a lot of core abstraction properties

No, it's worse than that: It has too many of them, leading to a mess of special cases that you have to deal with. What happens when you have a socket in your file system, and you export it over NFS?

Lacking abstraction properties is fixable -- you can add them. But removing them, especially if they're widely used, is incredibly hard.

jstimpfle · on July 21, 2019

Making good abstractions is hard. On Unix I sometimes wish I could unwrap the stream abstraction, but nevertheless I think it is one of the few abstractions that have really stood the test of time.

Why wouldn't a a socket exported over NFS just work seamlessly?

ori_b · on July 21, 2019

Because it's a kernel data structure thing that exists in the file system. The remote machine doesn't have access to it.

jstimpfle · on July 21, 2019

That's true for regular files as well.

ori_b · on July 22, 2019

No, the remote machine has access to file structures via NFS, which is complicated enough.

NFS doesn't have protocol-level special cases for forwarding operations for the local sockets, handling setsockopt(), various socket ioctls -- which, mind you, are often machine specific, where the data sent in ioctl is ABI dependent. I'm not even sure how you'd do that sort of thing, since NFS is a stateless protocol.

And then you would need to repeat the exercise for these special cases for all of the other special types of node, like /dev. What does it even mean to mmap the video memory of a video card on a remote machine?

And then you'd need to fix the assumptions of all the software that assumes local semantics ("the connection doesn't drop, and poll() is always accurate").

On top of that, you'd need to run on a treadmill to add support for new ioctls.

Do you really want to implement the server side of a protocol that can handle all of the complexity of all you can do on all file systems, with all ioctls, across all node types? How many implementations providing resources via this protocol do you think would exist?

jstimpfle · on July 22, 2019

What does it mean to mmap a file on an NFS server? Isn't it a connection drop when a local process dies, too? What happens when a disk is suddenly removed?

> On top of that, you'd need to run on a treadmill to add support for new ioctls.

Absolutely, it'd be a lot of work. So it's a better idea to not implement many of these things and instead simply return an error.

ori_b · on July 22, 2019

> What does it mean to mmap a file on an NFS server?

It means you have issues around synchronization and performance, if you use it as anything other than a private read only mapping.

And some things are just impossible, like a shared memory ringbuffer. Which is exactly what you do with the memory you mmap from a video card: submit commands to the command ringbuffer.

> So it's a better idea to not implement many of these things and instead simply return an error.

And now you need to start writing multiple code paths in user code, testing which calls work and which don't, one of which will be broken due to lack of testing. And when you guess wrong at the needed operations, software often goes off the rails instead of failing gracefully. Failure modes like blindly retrying forever, or assuming the wrong state of the system and destroying data.

Too many complicated abstractions break the ability to do interesting things with a system. It's death by a thousand edge cases.

On plan 9, you have 9p.

https://9p.io/magic/man2html/5/0intro

That, and process creation/control/namespace management, are the only ways to do anything with the system. There are few edge cases. Implementing a complete, correct server is a matter of hours, not weeks.

jstimpfle · on July 22, 2019

> like a [remote] shared memory ringbuffer.

Technically just as possible, only very slow... Performance is abstracted out by the VFS. You need to stay sane through other measures, like having your software configured right, etc.

> And now you need to start writing multiple code paths in user code

I don't think the number of paths is increased. Any software should handle calls that fail - if only by bailing out. That's acceptable for any operation that just can't complete due to failed assumptions - whether it's about file permissions or that the resource must be "performant" / not on an NFS share, etc.

> 9p.

Now what is the point? How's that different or better? They actually are much more into sharing resources over the network... which means less possible assumptions about availability/reliability/performance. I doubt they can make the shared ringbuffer work better.

ori_b · on July 22, 2019

> Technically just as possible, only very slow... Performance is abstracted out by the VFS.

How would you go about implementing the CPU cache coherency that allows you to do the cross machine compare and swap?

> I don't think the number of paths is increased. Any software should handle calls that fail - if only by bailing out.

If the software works fine without making a call, then you can just skip the extra work in the first place. Delete the call, and the checks around if the call fails. And if the call is important somehow, you need to find some workaround, or some alternative implementation, which is by definition never going to be very well tested.

> Now what is the point? How's that different or better? They actually are much more into sharing resources over the network... which means less possible assumptions about availability/reliability/performance. I doubt they can make the shared ringbuffer work better.

The tools to make a shared ringbuffer that depends on cache coherent operations simply aren't there -- it's not something you can write with those tools.

And that's the point: The tools needed simply don't work across the network. Instead of trying to patch broken abstractions, adding millions of lines of complexity to support things that aren't going to work anyways (and if they do work, they'd work poorly) pick a set of abstractions that work well everywhere, and skip the feature tests and guesswork.

Primitives that work everywhere, implement them uniformly, and stop special casing broken or inappropriate tools.

And then, it's a day of work to implement a 9p server, and everything works with it. So I can serve git as a file based API, DNS as a file API, fonts as a file based API, doom resources as a file API, or even json hierarchies as a file API, and not worry about whether my tools will run into an edge case. I can export any resource this way, and not need special handling anywhere.

Plan 9 doesn't have VNC; it has 'mount' and 'bind', which shuffles around which `/dev/draw` your programs write to, and which `/dev/mouse` and `/dev/kbd` your programs write to.

Plan 9 doesn't have NAT; it has 'mount' and 'bind', which shuffles around which machine's network stack your programs write to.

Plan 9 doesn't have SSH proxying that applications need to know about: It has sshnet, which is a file server that provides a network stack that looks just like any other network stack.

From parsimony comes flexibility. You're not dragging around a manacle of complexity.

jstimpfle · on July 22, 2019

> How would you go about implementing the CPU cache coherency that allows you to do the cross machine compare and swap?

build it in the protocol!

And so on...

> The tools to make a shared ringbuffer that depends on cache coherent operations simply aren't there -- it's not something you can write with those tools. And that's the point: The tools needed simply don't work across the network.

Ok. In theory, we just need to build access to the tool in the network protocol and have the network server execute the magic on the remote machine.

Of course, one needs a way to map e.g. a CAS operation to a network request. I don't think today's CPUs let us do that.

> Delete the call, and the checks around if the call fails.

    FILE *f = fopen(filepath, "rb");
    if (f == NULL)
        fatal("Failed to open file %s!\n", filepath);

There. I wouldn't remove a line, and I've magically handled whatever error condition it was, regardless if I've thought about network transparency issues or not.

> Primitives that work everywhere, implement them uniformly, and stop special casing broken or inappropriate tools.

I've never seriously looked at 9p, but the page you linked strongly suggests to me that it's more abstraction if anything (your initial statement was that that's bad), and more vague (if anything) as a consequence. More like HTTP, and I don't think of HTTP as a sort of universal solution - it's rather a sort of bandaid to glue things together with minimal introspection (HTTP verbs, status codes...). And the fact that it tries to be universal also means that it doesn't match some problems very well, and people will basically just sidestep HTTP there (I'm not a web person, but I've heard of major services that just return HTTP 200 always and just HTTP as a transport for their custom RPC mechanism or whatever).

> Plan 9 doesn't have VNC ... NAT ... SSH

Great. I get it. 9p is a basic transport method that gives some introspection for free if you can model your problem domain as an object hierarchy. But it's far from a free solution for any problem. It might save you some parsing in some cases, but it doesn't compress your VNC stream for example. Nor define the primitives of any problem domain that it just can't know about.

ori_b · on July 22, 2019

> build it in the protocol!

You don't have access to the interprocess cache snooping in software. This is CPU interconnect internal shit, and you actually need access to the local memory bus for correctness. mmap in its fully glory is only really worth having if you can share pages from the buffer cache.

And even if you did, and you turn a ten nanosecond operation into a ten millisecond operation, counting the network packets you send (a factor of a million overhead), without the assumption that all the peers are reliable and never fail, the abstraction still breaks. And if you assume all your peers are reliable in a distributed system, you're wrong. Damned either way.

> I've never seriously looked at 9p, but the page you linked strongly suggests to me that it's more abstraction if anything

No, it's a single abstraction, instead of dozens that step on each other's toes.

> Great. I get it. 9p is a basic transport method that gives some introspection for free

What introspection? It's just namespaces and transparent passthrough. Unless you're talking about file names.

jstimpfle · on July 22, 2019

Yes, I realized the CPU issue and already updated my comment. Technically we would need a way to catch the CAS operation and convert it into a network request - like for example segfaults can be handled and converted into a disk load.

And also we'd need to extend all the cache coherency stuff over the network.

> And if you assume all your peers are reliable in a distributed system, you're wrong. Damned either way.

Technically you have the reliability issues with all the components inside a single system, just as well. They are just more reliable. But I'm sure I have seen hard disks failing, etc.

--

Ok, let me think about that abstraction stuff. Thanks.

ori_b · on July 22, 2019

I'd also argue that if you need to turn a cache snoop into a network round trip (or several?), your abstraction is just as broken as if it returned the wrong value; it's unusably slow :)

jstimpfle · on July 22, 2019

Yes, that's why I'm still having trouble to understand the fuss about VFS abstraction as used in 9p etc ;-). I've always been glad to know when I was not on an NFS or sshfs mount (mostly for reasons that you can't design away, i.e. network reliability issues). So why bother abstracting out that knowledge even more?

ori_b · on July 22, 2019

When you come at it from the perspective that remote resources are the norm, and you assemble a computer from resources scattered across the network, local access becomes weird special case. Generally, you're running a file server and terminal as separate nodes, with an auth server somewhere else.

And if you're actually using the knowledge that some files are local, you get bugs and assumptions creep in, and now your software stops working in the case where you're running on someone else's network.

It's about transparently providing the same interface to everyone, and making that interface simple enough that implementing it is easy enough that it's actually done, and the interface actually gets used.

Then, if you want to interpose, remap, analyze, manipulate, redirect, or sandbox it, you can do that without much trouble. The special cases are rare and can be reasoned about.

Reasoning about your system in full frees you to do a huge amount.

pjmlp · on July 21, 2019

To be fair that is a general UNIX issue.

chungy · on July 21, 2019

Linux containers feel like weak imitations of jails and zones found in FreeBSD and Solaris/illumos, for that matter.

AsyncAwait · on July 21, 2019

Except the user interface for them on FreeBSD compared to i.e. docker is atrocious.

henesy · on July 21, 2019

Yeah zones especially, I agree.

hestefisk · on July 21, 2019

Why not Jails? It was pretty much the first to provide this in Unix land. I guess the advantage of Docker / containers was / is the nice TUI for developers as well as Hub. We never had a ‘jailshub’ in FreeBSD.

pjmlp · on July 21, 2019

Somehow I think Tru64 and HP-UX vaults had it first.