A new system call restart mechanism

[Posted December 10, 2002 by corbet]

System calls often have to wait for things - I/O completion, availability of a resource, or simply for a timeout to expire, for example. Normally the process making the system call becomes unblocked at the appropriate time, and the call completes its work and returns to user space. What happens, though, if a signal is queued for the process while it is waiting? In that case, the system call needs to abort its work and allow the actual delivery of the signal. For this reason, kernel code which sleeps tends to follow the sleep with a test like:

    if (signal_pending(current))
	return -ERESTARTSYS;

After the signal has been handled, the system call will be restarted (from the beginning), and the user-space application need not deal with "interrupted system call" errors. For cases where restarting is not appropriate, a -EINTR return status will cause a (post-signal) return to user space without restarting the system call.

In general, this mechanism works reasonably well. But, what about cases where the system call should not just be restarted from the beginning? The case which raised that question is the nanosleep() system call, which puts the process to sleep for a (potentially) short time. By the POSIX standard, nanosleep() should not return early as a result of a signal if the process has no handler for that signal. So the call should be restarted. The problem is that the argument to nanosleep() tells how long the process wants to sleep - not when it wants to wake up. When the call is restarted, it must take into account how long the process had slept before the signal, and how long it took to deal with the signal, and adjust the sleep time accordingly. In other words, it should save the absolute time when the process wanted to wake up, and the restarted call should sleep until that time (or just return if the time has already passed). But there is no easy place for a system call to save that sort of information.

To solve this problem, Linus added a new mechanism to the 2.5.51 kernel, based on work by George Anzinger. This mechanism allows interrupted system calls to specify a different function to run when the call is restarted, along with information to be passed to that function.

Specifically, the thread_info structure now includes a restart_block structure. A system call needing different restart behavior can put a restart handler function into that structure, along with some arguments for that function. Then, if interrupted, the system call should return -ERESTARTSYS_RESTARTBLOCK. After the signal is dispatched, and if there was no handler specified by the process (and the process still lives), the function in the restart block will be called, with the block itself as an argument.

nanosleep(), which is currently the only user of this mechanism, need only save the wakeup time in the restart block, along with pointers to the user arguments. Interrupted sleeps will now be handled properly. It is not clear how many other system calls will make use of the new restart system; in most cases it is better to just return -EINTR in complicated situations. But, for cases where you really need to see the operation through, the new mechanism should help.

(Log in to post comments)