Engineering April 2026

Work-Stealing from First Principles

TL;DR

Work-stealing is the scheduling algorithm behind most serious parallel runtimes, from Tokio and Rayon in Rust to Go's goroutine scheduler and Java's ForkJoinPool. I learned the algorithm by implementing the Chase-Lev variant from scratch in Rust, starting from the load distribution problem, working through the deque design and lock-free synchronization with CAS, and ending at CPU-level memory ordering semantics. This post walks through that process. The implementation is on GitHub.

Work-stealing shows up in most serious parallel runtimes: Tokio, Rayon, Go's goroutine scheduler, Java's ForkJoinPool, Intel TBB. The concept is straightforward (idle workers take tasks from busy ones), but the implementation requires careful coordination across threads, all the way down to CPU memory ordering semantics.

I wanted to understand what that coordination actually looks like at the instruction level, so I implemented the Chase-Lev algorithm from scratch in Rust. This post walks through the ideas you need to stack on top of each other to arrive at a working implementation, starting from the distribution problem, working through the deque design, and ending at CAS and memory ordering.

The distribution problem

Imagine you are a factory manager. To speed up production, you buy five identical machines. Today you have 100 items to process.

The naive approach

You dump all 100 items onto Machine 1 and let the other four sit idle. Machine 1 is working overtime while Machines 2 through 5 collect dust. You paid for five machines but only one is doing anything useful. That is underutilization.

The "fair split" approach

You divide the items evenly: 20 per machine. That works until task durations vary. Some items are simple and take 1 minute, others are complex and take 10 minutes.

Machine 1 gets unlucky. It receives mostly complex items and takes 180 minutes to finish. Meanwhile, Machine 4 got all the easy items and finishes in 20 minutes, then sits idle for 160 minutes waiting for Machine 1 to catch up.

You divided the work evenly by count, but the effort was lopsided. The result is the same: underutilization.

Static vs dynamic distribution

Both approaches above are static distribution. You decide how to split the work before any machine starts processing. The assignment is fixed. Once Machine 4 finishes its batch, it has no way to help Machine 1 with the remaining work.

Static distribution works when every task takes roughly the same amount of time. In practice, task durations are unpredictable. Network requests, database queries, recursive computations: you rarely know in advance how long each one will take.

Dynamic distribution takes a different approach. Instead of assigning all the work upfront, you let workers grab tasks as they become available. When a worker finishes, it takes the next available task. Work-stealing is one specific strategy for dynamic distribution, and it turns out to be one of the most efficient ones ever designed.

The goal: no machine sits idle while work remains. If Machine 4 finishes early, it should grab some of Machine 1's remaining items. The work flows from where there is too much to where there is none.

How work-stealing works

Every worker has its own local queue of tasks. When a worker finishes everything in its queue, instead of sitting idle, it looks at another worker's queue and steals a task from it.

Loading diagram...

Idle workers steal from busy workers. Every worker stays utilized, and tasks redistribute across the system.

Why each worker gets its own queue

You might wonder: why not use a single shared queue that all workers pull from? That would solve the distribution problem too.

The answer is contention. If all workers pull from one shared queue, they constantly fight over access to it. Every pop requires synchronization. With 16 workers all hammering the same queue, the queue itself becomes the bottleneck.

Loading diagram...

With work-stealing, each worker has its own queue. In the common case, a worker just pops from its own queue with zero contention. Stealing only happens when a worker runs out of work, which is the uncommon case. Workers only pay synchronization costs during steals.

Loading diagram...

The thread pool structure

In a work-stealing thread pool, a fixed number of threads each run in a loop. Every thread owns a deque (double-ended queue) that holds its tasks. On each iteration, a thread pops from its own deque. If the deque is empty, it scans other threads' deques for something to steal. If nobody has work, it yields and tries again.

Here is the minimal pseudocode:

let num_of_threads = 4
let deques = create num_of_threads deques, each with capacity 1024

for i in 0..num_of_threads:
    spawn_thread(|| {
        loop {
            if shutdown -> stop

            let task = deques[i].pop()       // try my own deque first
            if task -> task()                 // got work, run it
            else {
                for j in 0..num_of_threads {
                    if i == j -> continue     // skip myself

                    let task = deques[j].steal()
                    if task -> {
                        task()               // stole work, run it
                        break
                    }
                }
            }
        }
    })

Each worker first tries pop() on its own deque (fast, no contention), and only falls back to steal() from others when its own deque is empty.

The work-stealing deque

The deque is the core data structure. It is a double-ended queue with a twist: only one thread (the owner) can access one end, while other threads (the thieves) can only access the other end.

pub struct WorkStealDeque<T> {
    bottom: usize           // the owner's end
    top: usize              // the thieves' end
    buffer: Vec<Option<T>>  // circular buffer holding tasks
}

impl WorkStealDeque {
    pub fn new(capacity) -> create empty deque with given capacity
    pub fn push(item)    -> owner adds a task (writes to bottom)
    pub fn pop()         -> owner takes a task (reads from bottom)
    pub fn steal()       -> thief takes a task (reads from top)
}

The owner interacts with the bottom end. To add work, it writes an item at bottom and increments the index. To take work, it decrements bottom and reads from that position. Thieves interact with the opposite end: they read the item at top and increment the index. The owner is the only thread that ever touches bottom, while any worker can call steal.

Why opposite ends?

This is the most important design decision in the entire algorithm. The owner works from the bottom, thieves steal from the top.

top                              bottom
 |                                 |
 v                                 v
[task1, task2, task3, ..., task99, task100]
 ^                                  ^
 thieves steal here                 owner pushes and pops here

If both the owner and thieves operated on the same end, every operation would require synchronization. They would constantly collide. By using opposite ends, the owner and thieves only conflict when there is exactly one item left in the deque. In all other cases, they access completely different parts of the data structure. The owner can push and pop at full speed without any synchronization overhead in the common case.

This separation is what makes work-stealing lock-free. Every thread is always making progress because they rarely touch the same memory.

The circular buffer

The deque stores items in a circular buffer. When the bottom index reaches the end of the array, it wraps around using the modulo operator. The bottom and top indices keep growing forever, and modulo maps them back into valid array positions.

buffer capacity = 4

push(A): buffer[0 % 4] = A, bottom = 1
push(B): buffer[1 % 4] = B, bottom = 2
push(C): buffer[2 % 4] = C, bottom = 3
push(D): buffer[3 % 4] = D, bottom = 4

// a thief steals A and B, so top moves to 2
// slots 0 and 1 are now free

push(E): buffer[4 % 4] = buffer[0] = E, bottom = 5   // wraps around!
push(F): buffer[5 % 4] = buffer[1] = F, bottom = 6   // reuses slot 1

A small fixed-size buffer can handle an endless stream of tasks, as long as the number of active tasks at any moment does not exceed the buffer's capacity.

Compare and swap

Everything above works perfectly in a single-threaded world. The moment multiple threads access the deque at the same time, things break.

Here is what the non-thread-safe version looks like:

pub fn push(item) {
    let index = bottom % buffer.len()
    buffer[index] = Some(item)
    bottom += 1
}

pub fn pop() -> Option<T> {
    if bottom == top -> return None

    bottom -= 1
    let index = bottom % buffer.len()
    buffer[index].take()
}

pub fn steal() -> Option<T> {
    if top >= bottom -> return None

    let index = top % buffer.len()
    top += 1
    buffer[index].take()
}

Looks clean. Now simulate what happens when two thieves try to steal at the same time:

Initial state: top = 0, bottom = 3, buffer = [taskA, taskB, taskC]

Thief 1:                          Thief 2:
  reads top = 0                     reads top = 0
  reads buffer[0] = taskA          reads buffer[0] = taskA
  sets top = 1                      sets top = 1    <-- overwrites!

Both thieves read top = 0 before either one updated it. Both grabbed taskA. When both set top = 1, one overwrites the other. Result: taskA was stolen twice, and taskB was skipped entirely. This will happen in practice. Modern CPUs run billions of operations per second, and multiple threads will absolutely hit the window between reading and writing top.

What can go wrong

The first problem is duplicate steals. Two thieves read the same top value and both take the same task. One task gets executed twice. If that task has side effects (writing to a file, sending a network request), you corrupt data.

The second is lost tasks. When two thieves both write top += 1, the second write overwrites the first. The counter only advances by 1 instead of 2, which means a task gets skipped and sits in the buffer forever, never executed.

Then there is the owner vs thief race on the last item. When only one item remains, top and bottom point to the same slot. The owner calls pop and the thief calls steal at the same time, both reaching for the same item. Without coordination, both could read it, or the item could be corrupted mid-read.

The root cause

All of these problems share the same root cause. The operation "read a value, make a decision, then update the value" consists of multiple steps, and another thread can interfere between them.

Step 1: read top        -> top is 0
                            <- another thread changes top to 1
Step 2: use top          -> we still think top is 0 (stale!)
Step 3: write top += 1  -> we write 1, but it should be 2

We need a way to make "read and conditionally update" happen as one indivisible operation.

What CAS does

CAS (Compare and Swap) is a CPU instruction that does exactly this. It checks if the value at a memory location equals what you expect. If it does, it replaces the value with a new one. If it does not, it leaves the value untouched. Either way, it tells you whether the swap happened. The entire sequence is one atomic operation.

compare_and_swap(location, expected_value, new_value) -> success or failure

If top is still 0, the swap happens and returns success. If another thread already changed top to something else, the swap does not happen and returns failure. The check and the update are one atomic operation.

CAS in action

Here is the same two-thief scenario replayed with CAS:

Loading diagram...

Thief 1 wins and takes taskA. Thief 2's CAS fails because top is no longer 0. Thief 2 knows it lost the race and backs off. No duplicates, no lost tasks.

The same principle applies to the owner vs thief race on the last item. When only one item remains, both the owner's pop and the thief's steal use CAS on top. Whoever's CAS succeeds gets the item. The other sees the failure and returns empty.

Lock-free vs lock-based

The alternative to CAS is using a lock (like a mutex). With a lock, only one thread can access the deque at a time. Everyone else waits in line. If 15 threads are waiting for 1 thread to release the lock, you have 15 idle cores. That defeats the entire purpose of parallelism.

With CAS, every thread attempts the operation simultaneously. The losers find out instantly and move on. All cores stay busy. Instead of preventing conflicts (locks), you detect and resolve them (CAS).

Memory ordering

Even with CAS, there is a subtle problem. Modern CPUs and compilers reorder instructions for performance. You write code in a specific order, but the CPU might execute it differently. For single-threaded code, this is invisible because the CPU guarantees the result looks the same. For multi-threaded code, it can be catastrophic.

The visibility problem

Consider the push operation. You expect "write the data" to happen before "publish the new index." But the CPU might reorder them. A thief could see the updated bottom before the data is actually written. The thief reads from the buffer and gets garbage.

Ordering guarantees

To prevent this, we use memory ordering annotations on atomic operations. The simplest is Relaxed: just do the atomic read or write, with no ordering constraints. This works when no other thread depends on the order of surrounding operations. For example, the owner reading its own bottom value, since nobody else writes to bottom.

When ordering matters, we use Release and Acquire as a pair. A Release store guarantees that everything the thread wrote before this point is visible to other threads before they see this write. The push operation uses Release when storing the new bottom value, so the buffer write is visible before the index update.

On the reading side, an Acquire load guarantees that the thread can see everything the writer wrote before their Release store. The steal operation uses Acquire when loading bottom, so it sees the data the owner wrote into the buffer.

Think of Release and Acquire as a handoff:

Owner (push):                      Thief (steal):
  buffer[i] = task    // write
  store bottom (Release) --------->  load bottom (Acquire)
                        handoff       // guaranteed to see buffer[i]
                                      read buffer[i] -> gets the task

Without this handoff, the thief might see the new bottom value but read stale or empty buffer contents. Release/Acquire guarantees that when the thief sees the updated index, all the data behind that index is already there.

Loading diagram...

The Rust implementation

Here is what these ideas look like as real, runnable Rust. The implementation lives in two files: queue.rs (the deque) and pool.rs (the thread pool).

push: owner adds work

The owner writes data into the buffer, then publishes the new bottom index with Release ordering. Any thief that later loads this value with Acquire is guaranteed to see the buffer write.

pub fn push(&self, item: T) {
    let bottom = self.bottom.load(Ordering::Relaxed);
    let index = bottom % self.buffer.len();

    unsafe {
        *self.buffer[index].get() = Some(item);
    }

    // Release: guarantees the buffer write above is visible
    // to any thread that later reads this bottom value
    self.bottom.store(bottom + 1, Ordering::Release);
}

steal: thief takes work

A thief loads both indices with Acquire and reads the item. It then uses CAS to claim it. If the CAS fails because another thief already incremented top, this thief returns empty.

pub fn steal(&self) -> Option<T> {
    // Acquire: pairs with the Release in push(),
    // guarantees we see the data behind these indices
    let top = self.top.load(Ordering::Acquire);
    let bottom = self.bottom.load(Ordering::Acquire);

    if top >= bottom {
        return None;
    }

    let index = top % self.buffer.len();
    let item = unsafe { (*self.buffer[index].get()).take() };

    // CAS: only one thief wins. Losers get None.
    match self.top.compare_exchange(
        top, top + 1, Ordering::SeqCst, Ordering::SeqCst
    ) {
        Ok(_) => item,
        Err(_) => None,
    }
}

pop: owner takes work

The most complex operation. When multiple items remain, there is no conflict with thieves, so the owner just takes the item. When exactly one item is left, both the owner and a thief might reach for it, so the owner uses CAS on top to resolve the race. If the deque was already empty, it restores bottom and returns.

pub fn pop(&self) -> Option<T> {
    let bottom = self.bottom.load(Ordering::Relaxed);

    if bottom == 0 {
        return None;
    }

    // Decrement first: signals to thieves "I'm claiming this slot"
    self.bottom.store(bottom - 1, Ordering::SeqCst);
    let top = self.top.load(Ordering::Acquire);
    let new_bottom = bottom - 1;

    if new_bottom > top {
        // Multiple items remain. No conflict with thieves possible.
        unsafe { (*self.buffer[new_bottom % self.buffer.len()].get()).take() }
    } else if new_bottom == top {
        // Last item. Owner and thief might both reach for it.
        // Resolve with CAS on top.
        let item = unsafe {
            (*self.buffer[new_bottom % self.buffer.len()].get()).take()
        };
        let result = match self.top.compare_exchange(
            top, top + 1, Ordering::SeqCst, Ordering::SeqCst
        ) {
            Ok(_) => item,    // owner wins
            Err(_) => None,   // thief already took it
        };
        self.bottom.store(top + 1, Ordering::Relaxed);
        result
    } else {
        // Deque was already empty before we decremented.
        self.bottom.store(top, Ordering::Relaxed);
        None
    }
}

The worker loop

The thread pool spawns workers that each own a deque. Each worker tries its own deque first. If that is empty, it scans other deques looking for something to steal. When there is no work anywhere, it yields the thread and loops back.

for i in 0..num_threads {
    let deques = Arc::clone(&deques);
    let shutdown = Arc::clone(&shutdown);

    thread::spawn(move || {
        loop {
            if shutdown.load(Ordering::Relaxed) {
                return;
            }

            // Try own deque first (fast, no contention)
            if let Some(task) = deques[i].pop() {
                task();
            } else {
                // Own deque empty. Try stealing from others.
                for j in 0..num_threads {
                    if j == i { continue; }

                    if let Some(task) = deques[j].steal() {
                        task();
                        break;
                    }
                }
                // Nothing anywhere. Yield and retry.
                thread::yield_now();
            }
        }
    });
}

Five ideas, one system

Stepping back, here is how all the pieces connect. The problem is load distribution: how do you split work across multiple workers without leaving some idle? The strategy is work-stealing, where each worker has its own deque and idle workers take tasks from busy ones. The deque makes this efficient by letting the owner and thieves operate on opposite ends, so they almost never touch the same memory. CAS resolves the rare races that do happen, without locks. And memory ordering (Release/Acquire) guarantees that when a thief sees updated indices, the data behind those indices is already visible.

These five ideas, layered on top of each other, give you a lock-free work-stealing system. Simple in concept, carefully engineered at every level to avoid the bottlenecks it was designed to solve.

Where you will find this in production

This is the algorithm described by David Chase and Yossi Lev in their 2005 paper "Dynamic Circular Work-Stealing Deque." It forms the foundation of parallel runtimes used in production today:

In Rust, Tokio uses work-stealing across its async task scheduler, and Rayon uses it for data-parallel iterators. Go's goroutine scheduler steals goroutines across OS threads. Java's ForkJoinPool is a direct implementation of the Chase-Lev paper. Intel Threading Building Blocks uses the same approach for its task scheduler.

The full Rust implementation is on GitHub. Clone it, run the tests, read the code alongside this post.

git clone https://github.com/4RSIM3R/worksteal.git
cd worksteal
cargo run

Related Notes

GitNexus Benchmark: Graph-Indexed Retrieval vs Claude Code's Native Exploration

Want to read the code?

The full Rust implementation. Clone it, run it, break it.

View on GitHub