Futex Reacquainted

# tech, 2021-01-30

To be fair, the futex (Fast Userspace muTEX) syscall has a rather misleading name, in the sense that it simply isn’t a mutex. In the original 2002 paper (Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux), it was invented specifically to implement a fast mutex, but capable of much more. I prefer to think of futex as a minimal useful subset that can fulfil most high level synchronization constructs.

Minimal Useful Subset

Before futex was born, people in need of a proper mutex (waiter actually blocks instead of spinning) have to use heavy kernel objects like file lock (fcntl) or System V semaphore. Since system calls are toxic to performance, kernel developers then seeked to compress kernel involvement in userspace synchronization by inventing futex.

So essentially this design process is a code refactor that encapsulates the thread blocking functionality into a new lightweight syscall. And such blocking can be easily elided when the lock isn’t contended.

Here by “lightweight” I mean the resource allocated for a futex should be minimal, so that thousands of futexes could live happily within a commodity setup. This requirement implicitly forbids the use of file descriptor as futex handle.

To meet all these standards, futex provides an interface like this (minimal form):

int futex_wake(int *addr);
int futex_wait(int *addr, int val);

Without doubt this is a piece of elegance. The idea of using userspace address as an unique identifier isn’t rare in nowaday system code (see boost’s thread local pointer). What’s brilliant is, futex further incorporates the CAS-like (compare and swap) semantics into this address – the thread is blocked when *addr is indeed val. This turns out to be essential to enforce the atomicity of user~kernel transition:

static int a = NO_SIGNAL;
// properly wait for a signal
let cached_a = atomic_load(&a);
while (cached_a != HAS_SIGNAL) {
  futex_wait(&a, cached_a);
  cached_a = atomic_load(&a);
}
// incorrect version
if (atomic_load(&a) != HAS_SIGNAL) {
  futex_wait_racy(&a);
}

In the incorrect version, this thread can block indefinitely when the signal has already arrived at a. The conditional loop also defends against spurious wakeups, which further propagates itself to the use of condition variable.

Technical Details

... ---------- | +--------+ tid1 tid2 tid3 |spinlock| | | | |--------| ~~~~~~~~~ --------- ~~~~~~~~~ | key +->| addr1 |->| addr2 |->| addr1 |->... +--------+ ~~~~~~~~~ --------- ~~~~~~~~~ | ----------

Kernel uses a fixed-size, seperate chaining, central hash table to store mappings between user provided address and the waiting threads (wait queue). A futex won’t occupy any resource until some threads start blocking on it. What puzzles me a little is why they didn’t bother to maintain private wait queues for individual futexes, guess the benefit of fewer key comparisons isn’t worth the complexity.

Unsurprisingly, this table becomes a performance bottleneck as computing nodes scale out, and poses challenge to system predictability. In a talk by SUSE engineer, three major issues were identified:

Not optimized for NUMA access
Hash table collisions
Hash bucket spinlock contention

When problems are defined, solutions aren’t that far either. Below is a list of the major optimizations made to mitigate these issues over the years:

FUTEX_PRIVATE_FLAG: fast path for single-process use. (2007 patch: new PRIVATE futexes)

When a futex is declared private within current process, which is true for most cases, there is no need for kernel to generate and maintain a global UID.

This new command managed to avoid: (a) Taking the mmap_sem semaphore, conflicting with other subsystems. (b) Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

Benchmark shows 20% less instructions, 4x throughput under 4 threads stress test, 16x under 16 threads.
Lockless get_futex_key() (a series of patches)
Make hash table size adaptive (2014 commit: Increase hash table size for better performance)

After this patch, global hash table is sized to 256 * cpu_num instead of 256.
Lockless wakeups (2015 commit: Implement lockless wakeups)

Internally acknowledge one or more tasks are to be awoken, then do the actual wakeups outside the critical section.
Use MCS lock as bucket spinlock (32-bit qspinlock is merged in 2015)

Unlike ticket spinlock, one acquires MCS lock by spinning on a private word part of a larger linked queue, which avoids lots of cache line bouncing and remote socket access.
Use private hash table for each process (rfc)

Inside Condition Variable

I was going to take a closer look at how pthread (NPTL) implements condition variable with futexes, and unveil some of the mysteries that prompted me to write this post in the first place.

To look “closer”, I skimmed through the commits of NPTL’s condvar from 2002 to 2016. Even though during this period the code skeleton is relatively stable, the minor details kept changing. Plus the commit messages being chaotic, it’s almost impossible to reason those changes.

This leaves me no choice but to read the latest implementation from 2016, which is simplified to this pseudocode:

/** cv's data members **
 * lock: internal mutex
 * wait_seq: sequence number for waiter
 * wakeup_seq: sequence number for wakeup signaled
 * woken_seq: sequence number for woken thread
 * broadcast_seq: sequence number for broadcast signaled
 * mutex_ref: reference to user mutex
 */
fn cond_wait(cv, mutex):
  lock(cv.lock);
  unlock(mutex);
  cv.wait_seq ++;
  cv.futex ++;
  cv.mutex_ref = mutex;
  let wakeup_seq_before = cv.wakeup_seq;
  let broadcast_seq_before = cv.broadcast_seq;
  loop {
    let futexval = cv.futex;
    unlock(cv.lock);
    let ret = FUTEX_WAIT(cv.futex, futexval);
    lock(cv.lock);
    if broadcast_seq_before != cv.broadcast_seq {
      break;
    }
    if cv.wakeup_seq != wakeup_seq_before && cv.wakeup_seq != cv.woken_seq {
      cv.woken_seq ++;
      break;
    }
    if ret == TIMEDOUT {
      cv.wakeup_seq ++;
      cv.woken_seq ++;
      break;
    }
  }
  unlock(cv.lock);
  lock(mutex);

fn cond_signal(cv):
  lock(cv.lock);
  if cv.wait_seq > cv.wakeup_seq {
    cv.wakeup_seq ++;
    cv.futex ++;
    if FUTEX_WAKE_OP(cv.futex, 1, cv.lock, 1, FUTEX_OP_CLEAR_WAKE_IF_GT_ONE).is_err() {
      FUTEX_WAKE(cv.futex, 1);
    }
  }
  unlock(cv.lock);

fn cond_broadcast(cv):
  lock(cv.lock);
  if cv.wait_seq > cv.wakeup_seq {
    cv.wakeup_seq = cv.wait_seq;
    cv.woken_seq = cv.wait_seq;
    cv.futex = cv.wait_seq * 2;
    let futexval = cv.futex;
    cv.broadcast_seq ++;
    unlock(lock);
    if FUTEX_CMP_REQUEUE(cv.futex, 1, ALL, cv.mutex_ref, futexval).is_err() {
      FUTEX_WAKE(cv.futex, ALL);
    }
  } else {
    unlock(cv.lock);
  }

What interests me the most is how condvar interplays with futexes. Unlike what I expected, condvar owns two futexes: the first is cv.lock, used to synchronize internal modifications; the second is cv.futex, used to park threads. Besides them, condvar also interacts with the user provided mutex, which has one futex inside.

It is important for condvar to manage its own lock, because cond_signal can be called without lock, stated by POSIX:

The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().

And a condvar can even be paired with multiple mutexes (not at the same time), quote POSIX again:

The effect of using more than one mutex for concurrent pthread_cond_wait() or pthread_cond_timedwait() operations on the same condition variable is undefined; that is, a condition variable becomes bound to a unique mutex when a thread waits on the condition variable, and this (dynamic) binding ends when the wait returns.

It’s also important to notice that multiple options are available to send out wakeups:

cond_signal first attempts to call FUTEX_WAKE_OP before falling back to traditional FUTEX_WAKE. This particular command was introduced to kernel in 2005 specifically for optimizing cond_signal, which has the capability to modify and conditionally wakes up a second futex. The optimization targets at avoiding “hurry up and wait” situation, where “this waiter wakes up and after a few instructions it attempts to acquire the cv internal lock, but that lock is still held by the thread calling pthread_cond_signal”. It works by moving the whole unlock procedure into kernel space:

// fn cond_signal [userspace]                       // fn cond_wait [userspace]
FUTEX_WAKE_OP(cv.futex, 1, cv.lock, 1);             FUTEX_WAIT(cv.futex);
  // [kernel]                                       /* blocked */
  let (key1, key2) = (key(cv.futex), key(cv.lock)); /*         */
  spin_lock(min(key1, key2));                       /*         */
  spin_lock(max(key1, key2));                       /*         */
  let ret = OP_UNLOCK(cv.lock);                     /*         */
  wake(key1, 1);  // ------------------------------>
                                                    lock(cv.lock);  // already unlocked
  if ret {
    wake(key2, 1);  // ---------------------------->  in case the lock is contended
  }

One thing I haven’t figured out though is why calling FUTEX_WAKE(1) without lock is racy, which seems like a straightforward fix to this issue.

Similarly, cond_broadcast tries to use FUTEX_CMP_REQUEUE, which is the kernel’s implementation of “wait morphing”. The wait queue of cv.futex is moved directly to user’s mutex, so that when the first woken thread finishes its work (by releasing mutex), rest of the threads can be woken up in sequence. In this way, we won’t encounter “thundering herd” problem where multiple threads wake up at the same time competing for the same mutex.

// fn cond_wait [userspace]
pthread_mutex_unlock_usercnt(mutex, 0);
FUTEX_WAIT(cv.futex);
// blocked //                         // fn cond_broadcast [userspace]
                                      FUTEX_CMP_REQUEUE(cv.futex, cv.mutex_ref);
// not the first in line <------------+-----------> // the first in line, woken
/* queued to mutex */                               lock(mutex);
FUTEX_WAIT(mutex);                                  /* do things with mutex */
/* blocked */                                       unlock(mutex);
/*         */                                         // examine if mutex is contended
/*         */                                         FUTEX_WAKE(mutex);  // -- (1)
// <--------------------------------------------------+
lock(cv.lock);
/* modify internals */
unlock(cv.lock);
pthread_mutex_cond_lock(mutex);  // --- (2)

Notice that instead of normal mutex functions, condvar uses pthread_mutex_unlock_usercnt and pthread_mutex_cond_lock to avoid changing mutex’s user count. When there are multiple waiters, mutex is guaranteed to stay in contended state, so that (1) could happen, and (2) won’t park the woken thread again.

What’s More

We’ve seen how futex is driven to its limit by pthread, multiple syscall commands are brought to kernel for that purpose. But in 2016, in order to satisfy a stronger ordering requirement, a redesign from the ground up was proposed and merged into the library, abandoning these eye-catching optimizations from decades before. You are welcome to revisit their debates on this matter.

Futex blocks threads, which is powerful yet dangerous. That’s why there are much more complexities behind their innocent looks. To go deeper, I found these kernel writeups to be most informative:

PI-futex to avoid priority inversion, which replies on RT-mutex (design) internally.

Robust futex to handle thread cancellation.

Futex Reacquainted

Minimal Useful Subset

Technical Details

Inside Condition Variable

What’s More

See Also