Flags as a system call API design pattern

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

February 12, 2014

This article was contributed by Michael Kerrisk.

The renameat2() system call recently proposed by Miklos Szeredi is a fresh reminder of a category of failures in the design of kernel-user-space APIs that has a long history on Linux (and, going even further back, Unix). A closer look at that history yields a lesson that should be kept in mind for all future system calls added to the kernel.

The renameat2() system call is an extension of renameat() which, in turn, is an extension of the ancient rename() system call. All of these system calls perform the same general task: manipulating directory entries to give an existing file a new name on the same filesystem. The original rename() system call took just two arguments: the old pathname and the new pathname. renameat() added two arguments, one associated with each pathname argument. Each of the new arguments can be a file descriptor that refers to a directory: if the corresponding pathname argument is relative, then it is interpreted relative to the associated directory file descriptor, rather than the current working directory (as is done by rename()).

renameat() was one of a raft of thirteen new system calls added to Linux in kernel 2.6.16 to perform various operations on files. The twofold purpose of the directory file descriptor argument is elaborated in the openat(2) manual page:

to avoid race conditions that could occur with the corresponding traditional system calls if one of the directory components in a (relative) pathname was changed at the same time as the system call, and
to allow the implementation of per-thread "current working directories" via directory file descriptors.

The next step, renameat2(), extends the functionality of renameat() to support a new use case: atomically swapping two existing pathnames. Although that use case is related to the earlier system calls, it was necessary to define a new system call for one simple reason: renameat() lacked a mechanism for the kernel to support (and the caller to request) variations in its behavior. In other words, it lacked the kind of flags bit-mask argument that is provided by system calls such as clone(), fcntl(), mremap(), and open(), all of which allow a varying number of arguments, depending on the bits specified in the flags argument.

renameat2() implements the new "swap" functionality and adds a new flags argument whose bits can be used to select variations in behavior of the system call. The first of these bits is RENAME_EXCHANGE, which selects the "swap" functionality; without that flag, renameat2() behaves like renameat(). The addition of the flags arguments hopefully forestalls the need to one day create a renameat3() system call to add other new functionality. And indeed, Andy Lutomirski soon observed that another flag could be added: RENAME_NOREPLACE, to prevent a rename operation from overwriting an existing file. Formerly, the only race-free way of preventing an existing file from being clobbered was to use link() (which fails if the target pathname exists) to create the new name, followed by unlink() to remove the old name.

Mistakes repeated

There is, of course, a sense of déjà vu about the renameat2() story, since the reason that the earlier renameat() system call was required was that rename() lacked the extensibility that would have been allowed by a flags argument. Consideration of this example prompts one to ask: "How many times have we made that particular mistake?" The answer turns out to be "quite a few."

One does not need to go far to find some other examples. Returning to the thirteen "directory file descriptor" system calls that were added in Linux 2.6.16, we find that, with no particular rhyme or reason, four of the new system calls (fchownat(), fstatat(), linkat(), and unlinkat()) added a flags argument that was not present in the traditional call, while eight others (faccessat(), fchmodat(), futimesat(), mkdirat(), mknodat(), readlinkat(), renameat(), and symlinkat()) did not. (The remaining call, openat(), retained the flags argument that was already present in open().)

Of the new calls that did not include a flags argument, one, futimesat(), was soon superseded by a new call that did have a flags argument (utimensat(), added in Linux 2.6.22), and renameat() seems poised to suffer the same fate. One is left wondering: would any of the remaining calls also have benefited from the inclusion of a flags argument? Studying this set of functions further, it is soon evident that the answer is "yes", in at least three cases.

The first case is the faccessat() system call. This system call lacks a flags flags argument, but the GNU C Library (glibc) wrapper function adds one. If bits are specified in that argument, then the wrapper function instead uses the fstatat() system call to determine file access permissions. It seems clear that the lack of a flags argument was realized too late, and the design problem was subsequently papered over in glibc. (The implementer of the "directory file descriptor" system calls was the then glibc maintainer.)

The second case is the fchmodat() system call. Like the faccessat() system call, it lacks a flags argument, but the glibc wrapper adds one. That wrapper function allows for an AT_SYMLINK_NOFOLLOW flag. However, the flag is not currently supported, because the kernel doesn't provide the necessary support. Clearly, the glibc wrapper function was written to allow for the possibility of an fchmodat2() system call in the future.

The third case is the readlinkat() system call. To understand why this system call would have benefited from a flags argument, we need to consider three of the system calls that were added in Linux 2.6.13 that do permit a flags argument—fchownat(), fstatat(), and linkat(). Those system calls added the AT_EMPTY_PATH flag in Linux 2.6.39. If this flag is specified in the call, and the pathname argument is an empty string, then the call instead operates on the open file referred to by the "directory file descriptor" argument (and in this case, that argument can refer to file types other than directories). This allows these system calls to provide functionality analogous to that provided by fchmod() and fstat() in the traditional Unix API. (There is no "flink()" in the traditional API.)

Strictly speaking, the AT_EMPTY_PATH functionality could have been supported without the use of a flag: if the pathname argument was an empty string, then these calls could have assumed that they are to operate on the file descriptor argument. However, the requirement to use a flag serves the dual purposes of documenting the programmer's intent and preventing accidents that might occur if the pathname argument was unintentionally specified as an empty string.

The "operate on a file descriptor" functionality also turned out to be useful for readlinkat(), which likewise added that functionality in Linux 2.6.39. However, readlinkat() does not have a flags argument; the call simply operates on the file descriptor if the pathname argument is an empty string, and thus does not have the benefits that the AT_EMPTY_PATH flag confers on the other system calls. Thus readlinkat() is another system call where a flags argument would have been desirable.

In summary, then, of the eight "directory file descriptor" system calls that lacked a flags argument, this lack has turned out to be a mistake in at least five cases.

Of course, Linux developers were not the first to make this kind of design error. Long before Linux appeared, there was wait() without flags and then wait3() with flags. And Linux has gone on to fix some instances of this design error in APIs inherited from Unix, adding, for example, dup3() as a successor to dup2(), and pipe2() as the successor to pipe() (both new system calls added in kernel 2.6.27).

Latter-day missing-flags examples

But, given the lessons of history, we've managed to repeat the mistake far too many times in Linux-specific system calls. As well as the directory file descriptor examples mentioned above, here are some other examples:

Original system call Successor

epoll_create() (2.6.0) epoll_create1() (2.6.27)

eventfd() (2.6.22) eventfd2() (2.6.27)

inotify_init() (2.6.13) inotify_init1() (2.6.27)

signalfd() (2.6.22) signalfd4() (2.6.27)

Original system call	Successor
`epoll_create()` (2.6.0)	`epoll_create1()` (2.6.27)
`eventfd()` (2.6.22)	`eventfd2()` (2.6.27)
`inotify_init()` (2.6.13)	`inotify_init1()` (2.6.27)
`signalfd()` (2.6.22)	`signalfd4()` (2.6.27)

The realization that certain system calls might need a flags argument sometimes comes in waves, as developers realize that multiple related APIs may need such an argument; one such wave occurred in Linux 2.6.13, when four of the "directory file descriptor" system calls added a flags argument.

As can be seen from the other examples shown just above, another such wave occurred in kernel 2.6.27, when a total of six new system calls were added. All of these new calls, as well as accept4(), which was added for the same reasons in Linux 2.6.28, return new file descriptors. The main reason for the addition of the new calls was to allow the caller the option of requesting that the close-on-exec flag be set on the new file descriptor at the time it is created, rather than in a separate step using the fcntl(F_SETFD) operation. This allows user-space applications to avoid certain race conditions when using the traditional counterparts of these system calls in multithreaded applications. Those races could occur when one thread tried to create a file descriptor and use fcntl(F_SETFD) to set its close-on-exec flag at the same time as another thread happened to perform a fork() plus execve(). (The socket() and socketpair() system calls also added this new functionality in 2.6.27. However, somewhat bizarrely, this was done by jamming bit flags into the high bytes of these calls' socket type argument, rather than creating new system calls with a flags argument.)

Turning to more recent Linux development history, we see that a number of new system calls added since kernel 2.6.28 have all included a flags argument, including fanotify_init(), fanotify_mark(), open_by_handle_at(), and name_to_handle_at(). However, in all of those cases, the flags argument was required at the outset, so no decision about future-proofing this aspect of the API was required.

On the other hand, there have been some misses or near misses for other system calls. The syncfs() system call added in Linux 2.6.39 does not have a flags argument, although one wonders whether some filesystem developer might have taken advantage of such a flag, if it existed, to allow the caller to vary the manner in which a filesystem is synced to disk. And the finit_module() system call added in Linux 3.8 only got a flags argument after some last minute prompting; once added, the flag proved immediately useful.

The conclusion from this oft-repeated pattern of creating new incarnations of system calls that add a flags argument is that a suitable question to ask during the design of every new system call is: "Is there a reason not to include a flags argument in the API?" Considering the question from that perspective is likely to more often lead developers to default to following the wise example of the process_vm_readv() and process_vm_writev() system calls added in Linux 3.2. The developers of those system calls included a (currently unused) flags argument on the suspicion that it may prove useful in the future. History suggests that they'll one day be proved right.

Index entries for this article
Kernel	Development model/Patterns
Kernel	System calls
GuestArticles	Kerrisk, Michael

(Log in to post comments)

Flags as a system call API design pattern

Posted Feb 13, 2014 8:16 UTC (Thu) by blackwood (guest, #44174) [Link] (10 responses)

So step 1 is to add a flags parameter everywhere, then step 2 is to have a testcase to check that the kernel indeed rejects still unused bits with -EINVAL. Since otherwise some userspace piece _will_ put random gunk in there, rendering your shiny new flags parameter immediately useless. At least that's been my experience with driver-private command submission interfaces for gpu drivers.

So nowadays this is one of the iron rules I have when adding new ioctls. We're not yet at the "actually bother to document the ioctl" stage because this is drm and we need to protect our claim of fame ;-)

Flags as a system call API design pattern

Posted Feb 13, 2014 9:17 UTC (Thu) by kugel (subscriber, #70540) [Link] (1 responses)

I agree, step 2 is equally important. This should be in some syscall cook book :)

Flags as a system call API design pattern

Posted Feb 13, 2014 17:26 UTC (Thu) by meuh (guest, #22042) [Link]

That could be article "Botching up ioctls", by Daniel Vetter:

http://blog.ffwll.ch/2013/11/botching-up-ioctls.html

Flags as a system call API design pattern

Posted Feb 13, 2014 10:14 UTC (Thu) by paulj (subscriber, #341) [Link] (7 responses)

Step 3 you need some way distinguish between mandatory flags and optional flags. So then you have to consider how to deal with that. E.g., yet another variant of the syscall, e.g. to have 2 different arguments for each set of flags?

See the O_TMPFILE open fun in: https://lwn.net/Articles/558940/

Flags as a system call API design pattern

Posted Feb 13, 2014 14:35 UTC (Thu) by kugel (subscriber, #70540) [Link] (1 responses)

Nah, O_TMPFILE is a showcase why step 2 is important and that it hasn't been done for open().

Optional flags (those that you would like to be ignored if unknown or unsupport for whatever reason) should can be handled in user space, for example by retrying the syscall without the flag, without adding more measures into syscall interface.

Flags as a system call API design pattern

Posted Feb 13, 2014 15:08 UTC (Thu) by paulj (subscriber, #341) [Link]

The problem with open() and O_TMPFILE was that the kernel treated unknown flags as optional, but an application using O_TMPFILE would want it as *mandatory*. Such an application would have no way to tell whether O_TMPFILE actually was honoured, because the open() would succeed, regardless of whether kernel recognised that flag (e.g. if you run the application on an older kernel). You can't test and retry, because open generally didn't treat unknown flags as a failure.

Generally, to be able to introduce new mandatory flags, while allowing optional flags, you need either to distinguish between mandatory flags and optional in the API in some way, or you need some other way to allow the application to feature-test at runtime (but what if it forgets to do this, and then gets run on an old kernel?).

Otherwise, you need to rely on being able to find an API useage-specific hack that happens to work, as was done for open/O_TMPFILE, by also setting some other unrelated flags that *would* together cause an error on older kernels. ;) These kind of API-specific hacks might not always be available.

Flags as a system call API design pattern

Posted Feb 14, 2014 18:28 UTC (Fri) by giraffedata (guest, #1954) [Link] (4 responses)

So then you have to consider how to deal with that. E.g., yet another variant of the syscall, e.g. to have 2 different arguments for each set of flags?

I'm not sure what issue this describes, but what I do when I design an interface with extra flags for forward compatibility is I add a word of flag space and declare the first half to be for mandatory flags and the second half to be for optional flags. The recipient rejects any nonzero reserved bits in the first half and ignores any reserved bits in the second half.

The mandatory/optional flag forward compatibility issue hasn't received much attention, but it's really just a special case of a larger compatibility validation issue. Imagine a web server written by someone who knows only Firefox and tested only with Firefox. The server detects at run time that the browser is "Iceweasel." The author never heard of Iceweasel. Should the program send the Firefox-oriented data to Iceweasel and assume it is smart enough to emulate Firefox, or tell the user it doesn't know how to drive Iceweasel and avoid a possible disaster?

I know storage servers that refuse to use a SCSI disk drive of a model number not in a list with which the server is known to work. And SCSI is a standard carefully designed to make that never necessary. These designers, working on a system call processor, might refuse to recognize any unknown flag as optional.

Flags as a system call API design pattern

Posted Feb 14, 2014 20:01 UTC (Fri) by nybble41 (subscriber, #55106) [Link] (3 responses)

> The recipient rejects any nonzero reserved bits in the first half and ignores any reserved bits in the second half.

While I appreciate the elegance of this approach, it does have a major flaw: since the recipient ignores anything it doesn't recognize in the second half, senders are free to put whatever random data they want there. Later, when new optional flags are defined, these applications break.

This has happened multiple times in the Linux userspace APIs, and since breaking previously-working user applications isn't allowed no matter how they abuse the APIs, you effectively can't redefine any bit you've previously ignored. If you don't require a specific value for unused bits, you won't be able to use them in any later versions. Better to just reject unrecognized bits and leave userspace to implement a fallback when the syscall fails.

> Imagine a web server written by someone who knows only Firefox and tested only with Firefox. The server detects at run time that the browser is "Iceweasel." The author never heard of Iceweasel. Should the program send the Firefox-oriented data to Iceweasel and assume it is smart enough to emulate Firefox, or tell the user it doesn't know how to drive Iceweasel and avoid a possible disaster?

To answer that you would need a protocol specification. Doing this properly requires senders and receivers to work from the same spec. If you're just inferring one possible spec from the way you've seen Firefox behave then you can make up whatever arbitrary rules you want, so long as Firefox passes them, though it's safest to bail out early rather than continue after seeing something unexpected.

Normally, of course, you'd write your web server to the HTTP specification, not a particular browser, and a browser reporting itself as "Iceweasel" is still acting within the spec and thus not giving you any reason to error out.

Flags as a system call API design pattern

Posted Feb 14, 2014 20:34 UTC (Fri) by paulj (subscriber, #341) [Link] (1 responses)

How do you reject unrecognised flags while still allowing for optional flags?

That isn't really optional then. Rather, using the API becomes potentially a hand-shaking process ("let me try see if the kernel knows this new flag.. Hmm, no. What about this one ..." etc.). Better then to have a single call that lets the application query for the accepted flags once.

In network protocols too, specifying unused flags as "Must Be Zero" has meant that later, when people wanted to use them, they often effectively could not (sometimes it is not possible to fall-back, there may be no opportunity to probe for supported flags). MBZ bits often end up being completely useless and wasted.

Flags as a system call API design pattern

Posted Feb 15, 2014 11:41 UTC (Sat) by khim (subscriber, #9252) [Link]

How do you reject unrecognised flags while still allowing for optional flags?

Optional flags do not exist period. There are only “flags you don't care about” and “flags you do care about”. Think FUTEX_PRIVATE. It was added as very much “optional” flag to make pthreads faster. For pthreads implementation it's “optional” flag. But for something like NaCl that same flag is very much a mandatory flag because it's use prevents information leaks.

Better then to have a single call that lets the application query for the accepted flags once.

Why? “Let me try see if the kernel knows this new flag” is very simple and cheap if you do it right (take a look on GLibC—it contains dozeons of such cases).

Flags as a system call API design pattern

Posted Feb 14, 2014 22:42 UTC (Fri) by giraffedata (guest, #1954) [Link]

This has happened multiple times in the Linux userspace APIs, and since breaking previously-working user applications isn't allowed no matter how they abuse the APIs, you effectively can't redefine any bit you've previously ignored.

Has this really happened with fields that are documented as "reserved for future use - must be zero" and someone put random garbage in there?

It isn't really true that you can't break previously-working user applications with new kernel code. There are a few cases of API abuse becoming the standard that make the news because the abuse was so widespread to be worth tolerating, but I'm sure there are thousands of instances where some application bug that was innocuous in Linux N expressed itself in Linux N+1 and everyone agreed breaking the application was appropriate.

The widespread abuses usually were somewhat deliberate - it saved someone significant effort or seemed to be legal. In contrast, failing to initialize memory is more likely to be in the rare and unforgiven category.

Flags as a system call API design pattern

Posted Feb 13, 2014 12:49 UTC (Thu) by sorokin (guest, #88478) [Link] (1 responses)

I wonder why function for atomically swapping two existing files is renameat2() with special flag and not swapat().

Using flags to completely change function semantic is a bad thing I suppose.

Flags as a system call API design pattern

Posted Feb 13, 2014 14:25 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

What is renameat2's behavior with paths that reside on different filesystems? I'd assume it fails due to the atomicity guarantees it can no longer make. Since rename already has some restrictions in that regard, swapat sounds, to me, like it might not care by default (and would need a flag for atomicity).

As for behavior change based on flags, one coworker was working with a tool which output to stdout/stderr by default, was silent with -E (no argument), but -EE took an argument for where to write the output (but just the output data from the conversion, not logging), so there's some insanity out there. I hope the kernel avoids such...behavior for a single syscall (outside of *ctl calls).

Flags as a system call API design pattern

Posted Feb 13, 2014 14:31 UTC (Thu) by jezuch (subscriber, #52988) [Link]

I look at the escalating list of combined filesystem operations which do more and more things atomically, and I think that they surely must have considered (and rejected?) filesystem transactions? :)