How to change symlinks atomically (2005)

Animats · on Aug 21, 2016

Ah, yes, atomic file system operations in UNIX-like systems.

May not be available for your operating system variant. May not be available for your file system. Not applicable to some remote file systems. May not function properly in some virtual machines. Consult your storage area network vendor for additional information.

wfunction · on Aug 21, 2016

Yeah, it's another one of the many things that the Windows kernel gets right but *nix programmers never care to give it credit for. `FILE_LINK_INFORMATION` in Windows has a `ReplaceIfExists` flag just for this purpose.

rincebrain · on Aug 21, 2016

I'm not sure which is wrong, the person writing or the documentation, but rename() claims to be atomic with few exceptions even when overwriting an existing item.[1]

The only reason I suspect things do unlink-then-rename is that (evidently) rename() doesn't promise anything about what happens to the "src" item if rename() fails, just that something will still be in "dest".[1]

But I don't know why the commands do something different, and don't want to try reading the source of either GNU or BSD mv to find out why.

[1] - http://pubs.opengroup.org/onlinepubs/009695399/functions/ren...

AnonymousPlanet · on Aug 22, 2016

It is one thing to get it right from a disgn perspective, but another to get it right from an implementation and common usage perspective (see the sibling threads to this comment).

If it doesn't exist in practice or does something different in a thousand edge cases, it's rightfully dismissed as nonexistant.

pif · on Aug 21, 2016

It's funny how you name Windows kernel in a comment about symbolic links, considering that it _misses_ symbolic links.

ordinary · on Aug 21, 2016

https://en.wikipedia.org/wiki/NTFS_symbolic_link

geocar · on Aug 21, 2016

These things are called symbolic links, but are not what UNIX calls a symbolic link. The name is unfortunate and perhaps cannot be helped.

icebraining · on Aug 21, 2016

What's the difference? Because MS claims it has created them "to function just like UNIX links", and I'm not seeing how they're wrong.

geocar · on Aug 21, 2016

On UNIX, anyone can create a symbolic link, and a symbolic link can contain arbitrary text, while Windows requires admin privileges and that the target be a valid file path.

The new WSL does have symbolic links, but these are only available to Linux programs.

icebraining · on Aug 21, 2016

The admin privileges is just the default security policy, you can configure it to allow regular users.

geocar · on Aug 21, 2016

That is really bad advice.

An application should not change the default security policy.

icebraining · on Aug 21, 2016

I'm not saying it should. I'm saying it's not a property of the symlink feature. Some Linux distro could implement the same policy, but symlinks would still be symlinks.

geocar · on Aug 21, 2016

> I'm not saying it should. I'm saying it's not a property of the symlink feature.

When someone uses the term "symbolic link" to mean something different than someone else using the term "symbolic link", we can assume they:

(a) do not understand what UNIX is calling a symbolic link, either because they have never heard the term, did not research it properly, or have some kind of learning disability

(b) are attempting to intentionally confuse users

We do not assume that "symbolic link ``features''" now means something else: We need justification to do that, and we don't have it:

A "link" is what POSIX calls a directory entry (§3.130), and a "symbolic link" is simply a directory entry (link) without a file associated with it.

When we are thinking about this clearly we can see what Microsoft did wrong, but we're under no obligation to support them in their definition because clearly their definition is less useful than the POSIX one.

> Some Linux distro could implement the same policy,

Users would revolt; they did revolt.

SELinux was popularized by a number of Linux distributions, and included a configuration that did exactly this, however people just turned it off because breaking applications pisses people off.

> but symlinks would still be symlinks.

No.

Just because they use the same words does not make them the same thing. You can use α-conversion to fix this without anyone else's help, and call Microsoft's "symbolic link" foo and UNIX's symbolic link bar, then you can find the sentence "foo would still be bar" the nonsense that it is.

lisivka · on Aug 21, 2016

No, it cannot be disabled via a policy in Linux, because it is regular FS operation, not a filter.

icebraining · on Aug 21, 2016

Sure it could, SELinux policies can operate on regular FS operations just fine.

lisivka · on Aug 21, 2016

Windows symbolic links to files are distinct from Windows symbolic links to directories. The default security settings in Windows Vista/Windows 7 disallow non-elevated administrators and all non-administrators from creating symbolic links. Symbolic links do not work at boot.

marcosdumay · on Aug 21, 2016

You know, that fixes the "May not be available for your operating system variant." disclaimer.

All the other still apply, and the GP forgot to tell you to check your HD model's documentation.

CyberShadow · on Aug 21, 2016

It goes much further than that:

https://en.wikipedia.org/wiki/Transactional_NTFS

josefbacik · on Aug 21, 2016

This is just an accident of how some file systems are implemented and isn't actually garunteed. If you did this on xfs you could still end up with the a 0 length symlink if you crashed at just the right time.

caf · on Aug 21, 2016

The article is talking about atomic with respect to processes running at the same time as the operation, not with respect to a system crash.

jjnoakes · on Aug 21, 2016

xfs doesn't guarantee atomic renames in the same directory?

I thought that requirement was from POSIX.

Does xfs not conform to POSIX?

adrianratnapala · on Aug 21, 2016

I think what the josefbacik means is that there is no guarantee that the original symlink under its temp name has actually been written to disk. After the `rename()`, there is still no guarantee.

I think this is true in many filesytems, not just `xfs`.

If your atomicity requirement you never want the file to disappear from the POV of an external process, then the OP's method is sufficient. If you want crash-proofing as well, then you will need an fsync() -- preferrably on the tempfile BEFORE the rename().

geocar · on Aug 21, 2016

No, you want to fsync() on the directory, not the "tempfile", and after the operation, not before. Consider:

            d=open("."), unlink("t");
    /* 1 */ symlink("new","t");
    /* 2 */ rename("t", "link");
    /* 3 */ fsync(d);
    /* 4 */ close(d);

Crashing, at (1) nothing has happened yet, (2) we might have "t" or we might not, (3) we might have "t", or not, and we might have "link" pointing to "new" or "old", but we can't have "link" pointing to anything else (or empty), and finally at (4) the change cannot be reverted.

You can insert a second fsync() where you suggest at point (2), but all this will guarantee is that we will have "t" in the directory because the symlink contents are part of the directory they live in. This might be useful for some applications, but the cost of two disk writes is high enough it may be worth redesigning your application.

adrianratnapala · on Aug 21, 2016

If you crash at (3) you can -- at least in principle -- have "link" pointing to garbage (most likely an empty file). That is, the dirent points to the new inode, but the actual link-text got lost with the crash.

Now on modern filesystems, a non-huge symlink will be stored in the inode itself and presumably enjoys some sort of atomicity. But there is nothing in the standard about that.

geocar · on Aug 21, 2016

> If you crash at (3) you can -- at least in principle -- have "link" pointing to garbage (most likely an empty file).

No, I don't think you can, bugs notwithstanding. A "link" (§3.130) is what POSIX calls a directory entry.

> But there is nothing in the standard about that.

The "standard" (POSIX) doesn't talk much about crashing, however if mkdir("a") could destroy "b" – even during a system crash (§3.387), then users would complain.

josefbacik · on Aug 21, 2016

The rename is atomic, the data being in the file is the problem, there is no garuntee unless you fsync.

trav4225 · on Aug 21, 2016

Sorry, would you mind clarifying? "The data being in the file"?

The way I understand the article's proposal is this:

1. Create new symlink pointing to desired file (assumed to already exist in a stable state).

2. Move new symlink over old symlink.

asdfaoeu · on Aug 21, 2016

Symlinks are just special files with a the contents being the link contents. His argument is that it is not atomic when considering a server crash. But I don't think that matters anyway.

geocar · on Aug 21, 2016

Symlinks are not files, they do not have inodes and you cannot open them to fsync their contents.

They exist as directory entries with a small in-directory content only, thus syncing the directory they exist in is sufficient to persist them.

icebraining · on Aug 21, 2016

Symlinks do have inodes, and their content (the link information) will be stored in that inode structure, if it fits, not in the directory.

See e.g. https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Symb...

geocar · on Aug 21, 2016

Sorry, you're right, however this is an implementation detail, and nothing to do with POSIX which did not define any use for the `d_ino` field for a symbolic link[1]. I think this is made unnecessarily difficult by Linux/ext4 calling something an "inode" that is not what UNIX traditionally called an inode (or what POSIX calls a "file serial number").

I think it is more useful to think of the symbolic link as "inside" the directory it is in, because this tells the programmer what to do (and what to sync: the directory).

[1]: http://pubs.opengroup.org/onlinepubs/009695399/functions/rea...

yuubi · on Aug 21, 2016

An inode, traditionally, is the per-file data structure on disk (or in memory) that contains the user/group IDs, file type (plain/directory/device/etc; not jpeg/png/executable/etc), permissions, file size, and the location of the content (either a list of block numbers, or the address of a block containing the list if it wouldn't all fit into the fixed-size inode), among other things. These structures were identified by a number. This is an implementation detail not specified by POSIX.

A directory, historically (on systems with 14- or 30-character file name limits like we had in the 1990s, at least 4.2 or 4.3 BSD-derived), consisted of a file containing a list of (16-bit inode number, file name) structs, thus making a nice round 16 or 32 bytes per directory entry, evenly dividing a 512-byte block. On some older systems you could see this structure by opening and reading a directory with open and read (I think every system everywhere prohibited writes to a directory because modifying a directory would allow you to access files in unreadable directories, uncleanly delete any file given its inode number, etc). The only special thing about a directory on disk was that its inode said it was a dir, so the filesystem code would trust it as a source of inode numbers and otherwise use it as a dir.

Some of the 14-byte-name systems supported symlinks. As there was no space in the directory entry for anything other than an inode number, the link contents would have to be accessed through an inode. (I think I've heard of systems that could store very small file contents like typical symlink target names in the inode instead of allocating a data block, but can't name one).

The number you see in ls -i, which is sometimes called just the "inode" (and apparently also called the "file serial number"), is the number in the directory entry that's used to find the inode. I guess someone somewhere had or intended to have a system that could store symlinks in the directory instead of consuming an inode.

caf · on Aug 21, 2016

The owning uid/gid has always been stored in the inode rather than the directory entry. Did symlinks traditionally have identical ownership to the containing directory?

geocar · on Aug 21, 2016

Yes. IIRC this occurred sometime around SUSv2 (maybe a little earlier) when many filesystem implementations were storing the contents of the symbolic link physically outside of the directory listing.

josefbacik · on Aug 21, 2016

You are confusing hardlinks with symlinks. They are similar but not the same. Symlinks in Linux most certainly get their own inode and can contain data blocks.

Rapzid · on Aug 21, 2016

Per the documentation this seems to be the case. However, I'd be curious to know if the implementation of the symlink syscall does or does not fsync the new file while in kernel space...

The fact that it does or does not should actually be in the documentation IMHO; probably a good enhancement request.

geocar · on Aug 21, 2016

Symlinks are directory entries, not files, so you need to sync the directory that contains the link.

POSIX does (for some strange reason) permit a symbolic link to have an "inode" (well, d_ino), but there is no way to open this "inode", and no UNIX implementations to my knowledge do this.

josefbacik · on Aug 21, 2016

Symlinks are implemented as files in Linux, we write the path into the file that you are pointing to, so there most certainly is data that must be fsync()'ed if you want it to be persistent.

Edit: If you want to fsync the symlink you can do an open with nofollow iirc (on my phone so I can't check) so you get the actual symlink and not its target.

geocar · on Aug 21, 2016

Yes, but Linux recognises the need to sync the file object when you fsync() the directory that it is in. There is no way to fsync() the symlink directly.

Rapzid · on Aug 21, 2016

I'm not sure if this is picking nits, but symlinks have their own inode number.. This ought to classify them as a file.

HOWEVER, looking into it it seems that the target may actually be stored in the inode. This mean the symlink, though a file, has no contents and thus requires no fsync. Does this sound about right?

geocar · on Aug 21, 2016

> I'm not sure if this is picking nits, but symlinks have their own inode number..

No they don't.

First of all, POSIX doesn't use the term "inode number" but "file serial number" which is silly wankery since there's no way to access a file using its "file serial number". They might as well say it was "implementation defined".

Secondly, POSIX used to specify[1] that their inode was "unspecified" because UNIX systems don't store symbolic links in separate inodes, but in the directory that contains them. Now, POSIX specifies symbolic links do have a "file serial number"[2], which could be implemented as a separate file block (confusingly Linux calls this an "inode" which has nothing to do with what UNIX called an inode -- a better term might be "virtual inode" but Linux uses that for something else entirely...)

To this end, I think the only sensible interpretation is the original one: UNIX symbolic links don't have inodes, and POSIX symbolic links might as well have a "hash" of the file contents in the "file serial number" since you can't do anything with it anyway.

[1]: http://pubs.opengroup.org/onlinepubs/009695399/functions/rea...

[2]: http://pubs.opengroup.org/onlinepubs/9699919799/functions/re...

> HOWEVER, looking into it it seems that the target may actually be stored in the inode. This mean the symlink, though a file, has no contents and thus requires no fsync. Does this sound about right?

You have to fsync() the directory that contains the symlink.

Rapzid · on Aug 21, 2016

I see what your saying, I believe. I also think we are possibly talking about different things(me linux and you POSIX). AFAICT my symlinks have unique inodes on linux...

geocar · on Aug 21, 2016

Yes.

What UNIX called an inode is unfortunately not what Linux now calls an inode. That thing might have better names, but we are stuck with it.

I don't know if it is useful to say "Linux inode" because that can refer to both what "ext4" calls an inode and what the Linux kernel refers to as a "virtual inode".

What POSIX called a "file serial number" (and goes in the `d_ino` field) is also not what Linux calls an inode. I sometimes think this is okay to refer to as a "POSIX inode" because POSIX doesn't otherwise use the term, and up until SUS6 the POSIX file serial number was compatible with a UNIX inode.

However I'm not trying to nit: Trying to develop a simplified mental model is essential to determining what to do (and what to sync!). From an application point of view, knowing that a symbolic link has a name (and contents) and lives in a directory, can help guide us.

caf · on Aug 21, 2016

What UNIX called an inode is unfortunately not what Linux now calls an inode. That thing might have better names, but we are stuck with it.

How so? Here's a v6 on-disk inode: https://github.com/hephaex/unix-v6/blob/master/ino.h

and here's an ext4 on-disk inode: http://lxr.free-electrons.com/source/fs/ext4/ext4.h#L704

The latter is considerably larger, to be sure, but still contains all of the fields from the former (many of them with the same name!).

geocar · on Aug 21, 2016

That may be, but Linux also uses the term "inode" to refer to what the BSD's refer to as a "vnode", which also have similar names.

caf · on Aug 22, 2016

UNIX v6 used the term "inode" to refer to that as well - it was BSD that changed the name.

UNIX called them "inode" and "permanent inode"; Linux calls them "in-memory inode" and "on-disk inode". The BSDs use "vnode" and "inode".

userbinator · on Aug 21, 2016

As much as the Win32 API is criticised for many of its functions containing often-unused parameters, symlink() having a flag parameter to control whether to overwrite the original link's contents if it exists would've made this much easier.

adrianratnapala · on Aug 21, 2016

You would need the extra parameter + the guarantee that the symlink syscall did the overwrite atomically. And I don't think proliferating such guarantees througought the API is going to end well.

In Unix we have this convention that `rename()` is our atomicity swiss-army knife. I guess it makes it easier for implementers to only have to make their strong guarantees for one system call (+ a few related friends like `renameat`).

Now you could argue that `rename()` is a bit too under-powered for this job, and maybe we want transactions or something. But NTFS tried that too and deprecated them.

What I would like to see in Unix is better support for anonymous files and directories. You could use this with things like `renameat()` so that the temporary never touches the filesystem and is automatically cleaned up if the rename fails.

JoshTriplett · on Aug 21, 2016

> What I would like to see in Unix is better support for anonymous files and directories. You could use this with things like `renameat()` so that the temporary never touches the filesystem and is automatically cleaned up if the rename fails.

Exactly. This approach makes the various calls orthogonal. Various syscalls create anonymous reference-counted inodes on a filesystem (which disappear when all references to them do), and syscalls like linkat or renameat or installs such an inode (referenced by file descriptor) into the filesystem with a name.

O_TMPFILE allows creating an anonymous file inode. Ideally, a syscall would exist to do the same for an anonymous symlink inode.

adrianratnapala · on Aug 21, 2016

The most general thing -- involving the fewest new interfaces, even if the implementation is a big deal, would be an anonymous tmpfs. Where applications could stage a sort of poor-mans transaction.

The hard bit is that for it to be useful, you would need atomic renames from the TMPFS to some other arbitrary filesystem.

I guess abolishing the EXDEV error would be possible where the target filesystem has journaling, but it sounds like a big job to me.

cyphar · on Aug 21, 2016

On GNU/Linux there's also the (badly designed) memfd_create() syscall which allows you to create an anonymous inode even if / is entirely read-only and you don't have mounting permission. While you could do it with a user+mount namespace and sending the fd over a UNIX socket, sometimes making a syscall is a better idea. :P

geocar · on Aug 21, 2016

memfd_create() does not create an inode. It only creates a file descriptor.

With a file descriptor you can create multiple mappings to the same physical address -- something otherwise impossible on UNIX or Linux (although mach allows you to vm_remap[1] which is often sufficient).

[1]: http://web.mit.edu/darwin/src/modules/xnu/osfmk/man/vm_remap...

JoshTriplett · on Aug 21, 2016

> memfd_create() does not create an inode. It only creates a file descriptor.

It creates both; memfd_create (and timerfd_create, signalfd, eventfd, userfaultfd, epoll, and others) use a kernel subsystem called "anon_inode" to create an in-memory inode to back the file descriptor.

JoshTriplett · on Aug 21, 2016

(Minor correction: memfd_create creates an in-memory inode without using anon_inode, while the various other syscalls use anon_inode.)

cyphar · on Aug 21, 2016

It creates a full inode using shmem[1], which eventually goes to ramfs[2]. It's a real inode in every sense of the word.

[1]: http://lxr.free-electrons.com/source/mm/shmem.c?v=4.2#L2949 [2]: http://lxr.free-electrons.com/source/fs/ramfs/inode.c?v=4.2#...

geocar · on Aug 21, 2016

I have been indelicate: What Linux calls an "struct inode" is not what UNIX traditionally called an inode[1], and I meant the latter without perhaps enough specificity.

[1]: https://en.wikipedia.org/wiki/Inode

JoshTriplett · on Aug 21, 2016

What don't you like about the design of memfd_create?

cyphar · on Aug 21, 2016

The fact that the first argument (the "name") is only used as a debugging interface so that the /proc/self/fd/... "symlinks" have the name you set. I don't agree that this is useful for any real debugging, and constructing your syscalls so that they "make debugging easier" is not something that I really understand the motivations behind -- your ABI should be consistent and logical, not based on what you think a good debugging interface should be.

JoshTriplett · on Aug 21, 2016

Ah, fair enough; I agree completely. It should only have taken a flags argument.

ashitlerferad · on Aug 21, 2016

Some related articles:

https://yakking.branchable.com/posts/moving-files-1-copying/ https://yakking.branchable.com/posts/moving-files-2-sparsene... https://yakking.branchable.com/posts/moving-files-3-faster/

johnwheeler · on Aug 21, 2016

As a Python/web developer whose never written a line of production C/C++ code in his 16 year career, it's always humbling to stumble into these threads on HN.

Rapzid · on Aug 21, 2016

I highly, highly recommend getting in the habit of consulting OS syscall documentation for stuff. The behaviour of the syscall determines most of the behaviour for this stuff in every language(few exceptions like stdlib buffering, etc). The linux syscall docs are pretty easy to consume even if you don't write C. "The Linux Program Interface" is an awesome book based on the man pages(by the main man page maintainer). Not sure what exists for Windows in the same vein book wise, but technet or something surely has what you NEED.

johnwheeler · on Aug 21, 2016

Thank you. I will check it out. I've been looking into the `select` call a little because I'm working with websockets, but I'm not sure if that's related to syscall. It's amazing what you can get away with not knowing with computers nowadays. It doesn't necessarily mean you should strive for ignorance, of course.