continuation passing style (idea) by pfft

Continuation passing style is a programming idiom possible in functional languages like Scheme or ML which support functions as first-class objects and take care to implement tail recursion efficiently. The idea is to rewrite recursive functions to represent "what to do next" after returning (the continuation) explicitly as a function object.

Consider the classic example of a recursive function, the factorial (all the program examples are in Standard ML):

fun fact 0 = 1
|   fact n = n * fact(n-1)

After the recursive function call fact (n-1) has returned, we need to continue by multiply the return value by n, and then we are done. If we like, we could think of this continuation as a function fn x => n*x that is applied to the return value. In computing say the factorial of 5, the recursive calls stack up, and the function call fact(1) is waited on by a function that takes the return value, multiplies it by 2 and passes it on to a function that multiplies it by 3 and passes it on to...

On a less abstract level, one can consider what data the implementation needs to save away when making the recursive call. It certainly needs to remember the value of n, as well as some notion of where to continue when the function returns. The typical implementation is to push the values of all local variables as well as a return address onto the call stack when making a function call. But consider by contrast the tail recursive variant:

fun ifact (0, a) = a
|   ifact (n, a) = ifact(n-1, n*a)

Here there is nothing to be done once the recursive call has finished - the returned value can be passed on directly. Indeed, programming language implementations take care not to grow the call stack in cases like this. When the final call to ifact(0,a) has finished there is no unwinding through the recursive calls, the result is directly given back to the function that originally called ifact

The idea of the continuation passing style is to represent the continuation "waiting for" a function explicitly as a function object. More specifically we add an extra argument (customarily called k) to each function and instead of directly returning a value, we call k with the return value in a tail call position. All recursive calls are made in tail positions. Recursive calls that would ordinarily be tail calls correspond to passing on k unchanged, while "properly" recursive calls build a new continuation function that does something to its argument and then passes the result to the original continuation. Here is the result of applying this idea to fact:

fun cfact (0, k) = k(1)
|   cfact (n, k) = cfact(n-1, fn x => k(n*x))

In order to find the factorial of 5, one would write cfact(5, fn x=>x), giving the identity function fn x=>x as the second argument. This in turn calls cfact(4, fn x=>k(5*x)), etc, until cfact(0,k) is reached. This finally calls its continuation, which multiplies its argument by 1 and calls the continuation built by cfact(2,k), which multiplies the argument by 2 and ... until the continuation built by cfact(5,k) is called, multiplies its argument and passes it to the identity function, which finally returns it.

Coming from a C background one might expect this to be quite unpractical because the entire sequence makes 10 recursive calls. But note that all of these are in tail positions, so there is in fact no unwinding of the stack to be done once our identity function has returned. All the function calling was made in constant space, and the recursive nature of the function is encoded only in the five function objects that were created. Recall that when an implementation is creating a function object what it must allocate is a closure storing the address of the function code and the value of its free variables, in this case n and k. This is essentially the same as the information that was allocated on the call stack with ordinary recursion, but we can now manipulate it as any other object.

As as another simple example to clarify the idea, here is a function to sum all the leaves in a binary tree of integers, first written in the normal way and then transformed into continuation passing style.

datatype tree = Leaf of int | Branch of (tree*tree)

fun sum(Leaf(n)) = n
|   sum(Branch(t1,t2)) = sum(t1) + sum(t2)

fun csum(Leaf(n), k) = k(n)
|   csum(Branch(t1,t2), k) = csum(t1, fn x => csum(t2, fn y => k(x+y)))

As the reader may have noticed, the translation is fairly mechanical and can be specified by a few rules. This is known as the CPS-transform. Some compilers (following the Rabbit Scheme compiler written at MIT in the 70s) translate the input program into CPS and use that as their internal representation. Having the "return addresses" made explicit in this way makes compilation easier in some ways, and there is in fact a connection between CPS and single static assignment, another popular internal representation. But I will not comment further on this, mainly because I don't know enough about compilers.

So far, it might be clear that it is possible to program in continuation passing style but not why it would be desirable. One reason is that by treating continuations as objects in the language, one might do things like storing them in data structures, "returning" through them more than once, or discard the continuation and immediately abort a deep recursion. Another reason is that it becomes natural to write functions that can return values to two different contexts. For example, many functions can fail, which is often dealt with either by having them return an error value to the caller or by throwing an exception. In continuation passing style, it is natural to have the function take two continuation arguments, one to be called in case of success and one for failure. Finally, for some problems it just seems natural to think explicitly in terms of "possible futures" of a computation. It should be noted that unlike the intermediate code used by compilers, programs written by human programmers seldom do away completely with the call stack; instead they mix explicit continuations and normal recursion as convenient.

The rest of this write-up consists of three examples of how continuations can be used to help express various tasks.

Regular expression matching

This is a rather contrived toy example: the algorithm uses a naïve brute-force search which gives the right result, but is extremely inefficient in some cases. (Matching the expression a*a*a*a*a*...a against the string aaaaa...a, where there are n as, will take time Θ(n!).) To simplify the presentation, it requires the regular expression to be given as a parse tree and the string as a list of characters. For example match_regex(Concat(Star(Char #"a"), Char #"b"), explode "aaab") returns true, because a*b matches aaab. Programs like these are not completely useless: for example it is a good idea to test a more efficient algorithm by making sure that it returns the same answer as the short and "obviously correct" solution in various tricky cases.

The idea here is that in order to match a regular expression e₁e₂ against a string, the program must first match e₁ against the beginning of the string and then e₂ against the rest. In other words, e₁ is executing in a context waiting to see how much will be left over. Now there might be many ways to match e₁, which will consume more or less of the string, so we need some way to backtrack and try a different way of matching it. The solution below uses a function match(r,s,k) which expects a regex r, a string s, and a continuation function k representing the rest of the computation. It will return true if it can match r against a prefix of s such that k(rest) returns true, where rest is the remainder of the string. (One way of looking at this function is to say that it has two continuations: the explicitly manipulated k, and the implicit failure continuation invoked by returning false to the caller). Finally, the function match_regex executes match with a continuation which verifies that all of the string was "used up" by the regex. Note that andalso and orelse are the short-circuiting boolean operators in ML, analogous to && and || in C.

datatype regex = Char of char
               | Choise of (regex*regex)
               | Concat of (regex*regex)
               | Star of regex

fun match(Char a, b::s, k) = (a=b) andalso k(s)
|   match(Char a, [], k) = false
|   match(Concat(r1,r2), s, k) = match(r1, s, fn rest=>match(r2,rest,k))
|   match(Choise(r1,r2), s, k) = match(r1,s,k) orelse match(r2,s,k)
|   match(Star r, s, k) = match(r, s, fn rest=>match(Star r,rest,k)) orelse k(s)

fun match_regex(r,s) = match(r,s,fn rest => rest=[])

Unification and Prolog interpretation

Unification is an problem which crops up at various places in computer science: it is used in automatic theorem-proving for first-order logic, type-inference for ML-like languages, parsing of natural languages using feature grammars, and is the basic method for "function calls" in the Prolog programming language.

A term is defined as a constant, a variable, or a tuple of terms. A substitution is a mapping from variables to terms. The unification problem is to take two terms and find a substitution that makes them equal. For example (x, (3,4)) and (2, y) unify to (2, (3, 4)) by the substitution {x→3, y→(3,4)}. A pair of terms can fail to unify in two different ways, "clashes" and "loops". For example (3, x) and (4, 5) clash, because there is no substitution that can make 3 equal 4. Likewise 42 and (6,7) clash because a constant can never equal a tuple. An example of a loop is (y,y) and (x, (3,x)). A successful unification would need to find a value for x such that x = (3,x), but no finite term can satisfy that (we would in effect be constructing the term (3,(3,(3, ...)))).

The naïve algorithm simply traverses the two terms recursively, and as it goes along it builds up a substitution which is kept in a separate data structure such as a hash table or association list. If, when the recursion finishes, the two terms turned out to be unifiable we can then apply the substitution to one of them to get the unified result. This is slow!

A nicer way is to represent variables as reference cells (boxed pointers) pointing to subterms. If the pointer is non-NULL, we ignore the cell and treat it exactly as if it was just the subterm it contains, but if it is NULL it represents a variable which can be given a value by assigning to it. We can then just walk through the terms once, making assignments to the cells as we go along. (Of course, care must be taken to represent e.g. the term (x, (y,x)) using the same reference cell for both xs, so that an assignment to one is automatically mirrored in the other). This is a bit like the idea in the union-find algorithm. Here is an ML datatype that can be used to represent terms, together with a function to pretty-print them in a Lisp-like human-readable form:

datatype term = Cnst of string
              | Tuple of (term list)
              | Var of term option ref

(* pretty print a single term *)
fun term_to_string (Cnst s) = s
|   term_to_string (Tuple xs) = "(" ^ terms_to_string xs ^ ")"
|   term_to_string (Var(ref(NONE))) = "X"
|   term_to_string (Var(ref(SOME(t)))) = term_to_string t
and
(* pretty-print a list of terms *)
    terms_to_string (t::[]) = term_to_string t
|   terms_to_string (t::ts) = term_to_string t ^ " " ^ terms_to_string ts
|   terms_to_string [] = ""

What makes this relevant to continuation-passing-style is what happens if we discover that two terms don't unify: then we need to backtrack and undo all the assignments we made. Exercise: Think about how to write a function to do this. It is as if we want a standard recursive function for walking down the tree, except it can unwind in two different ways: if the recursive call was successful we want to continue down other branches of the tree, but if it was not we want to undo all the assignments we made.

In a first-order language we might decide to build a list of undo actions to be carried out if the unification fails at any stage. But in light of the above, a more direct solution comes to mind: write the function in continuation-passing style, and explicitly maintain two different continuations waiting for each recursive call.

What follows is an implementation of this idea. occurs is an uninteresting helper function which is used to test for the "loop" case described above (this is the so-called occurs check). The real work is done in the two mutually recursive functions unify_term and unify_list, which do unification on single terms and lists of terms respectively. unify_term takes as argument the two terms to unify, and two extra arguments. The argument fail represent the continuation that waits for a unification failure to unwind and undo all the assignments so far. The argument succ represents the continuation waiting for a successful recursive call. Unlike its twin, the success continuation expects a return value, namely a newly constructed failure continuation. Both chains of waiting continuation have bounded length: succ by the height of the tree and fail by the number of assignments that have been done. All calls in the functions are made in tail-call positions, so the unification functions use constant space on the normal call stack. New continuations are constructed only in two places, marked (* ! *) in the code. These are the places where a non-continuation-style function would want to use proper recursion.

(* does a given free variable x occur somewhere in the term? *)
fun occurs(x, y as Cnst _) = false
|   occurs(x, Tuple ys) = occurs_list(x,ys)
|   occurs(x, Var(ref(SOME term))) = occurs(x, term)
|   occurs(x, y as Var(ref(NONE))) = (x = y)
and
    occurs_list(x, y::ys) = occurs(x,y) orelse occurs_list(x,ys)
|   occurs_list(x, []) = false

(* unify_term(t1,t2,k) returns fail() if t1 does not unify with t2.
   If they do unify, it returns succ(f), where f is a procedure that undoes all
   side effects and then calls fail().
   unify_terms does the same, but unifies lists of terms. *)
fun unify_term(t1, Var(ref(SOME t2)), succ, fail) = unify_term(t1, t2, succ, fail)
|   unify_term(Var(ref(SOME t1)), t2, succ, fail) = unify_term(t1, t2, succ, fail)
|   unify_term(t1 as Var(box as ref(NONE)), t2, succ, fail) =
      if (occurs(t1, t2)) then
        fail()
      else
        (box := SOME t2;
         succ (fn()=>(box := NONE; fail())))  (* ! *)
|   unify_term(t1, t2 as Var(ref(NONE)), succ, fail) = unify_term(t2, t1, succ, fail)
|   unify_term(Tuple l1, Tuple l2, succ, fail) =
      unify_terms(l1, l2, succ, fail)
|   unify_term(t1,t2,succ,fail) = if (t1=t2) then succ fail else fail()
and
    unify_terms(t::ts, s::ss,succ,fail) =
      unify_term(t, s, fn(fail)=>unify_terms(ts,ss,succ,fail), fail)  (* ! *)
|   unify_terms([],[],succ,fail) = succ fail
|   unify_terms(_,_,succ,fail) = fail()

Having written a continuation-passing function, we can wrap it so the user does not have to worry about supplying explicit continuations. The following function unify checks if its two arguments are unifiable. If they are, they are mutated so that both represent the most general unifier, and the function returns true. Otherwise it returns false and leaves the terms unchanged.

fun unify(t1,t2) = unify_term(t1,t2, fn x=>true, fn()=>false)

For example:

- unify(Tuple[Var(ref(NONE)), Tuple[Cnst"3", Cnst"4"]], Tuple[Cnst"2", Var(ref(NONE))]);

> val it = true : bool
- unify(Tuple[Cnst"3", Var(ref(NONE))], Tuple[Cnst"4", Cnst"5"]);
> val it = false : bool
- let val x = Var(ref(NONE)) val y = Var(ref(NONE)) in unify(Tuple[x,x], Tuple[y, Tuple[Cnst"3", y]]) end;
> val it = false : bool

As it turns out, having a working unification procedure we are more than halfway towards a sketch Prolog implementation. This supplies an example of yet another use for explicit continuations.

The semantics of Prolog are simple. A program consists of a "database", a list of rules each on the form Consequent :- Proviso₁, Proviso₂ ... . (the list after :- is allowed to be empty), where Consequent and Proviso_i are terms (possibly containing variables), meaning that the program is allowed to deduce the consequent if it can deduce all the provisos. For example, here is a Prolog database implementing the append predicate; append(L1,L2,L3) can be deduced iff L1 and L2 are two lists that can be appended to form L3 (the syntax [H|L] means the same as H::L in ML.):

append([],L,L).
append([H|L1], L2, [H|L]) :- append(L1,L2,L).

The task of a Prolog interpreter when executing a query, ignoring efficiency concerns, is to do a depth-first search for deductions. It maintains a goal stack, initially containing just the term it was asked to deduce. It then scans through all the rules in the database in order, testing whether the consequent of the rule unifies with the item on the top of the goal stack. If it does, that item is removed and instead the provisos of the rule are pushed, and we start scanning the database from the beginning again. If no rule matches the interpreter must backtrack, i.e. go back and undo an earlier successful unification and instead try the next rule in the database. If we arrive at an empty stack, the term of the original query has been deduced and the variables in it has been updated to reflect the necessary substitutions. Full Prolog also includes some "control predicates" to guide the search, but we will ignore them here.

The Prolog syntax is a bit nonuniform in writing append(X,Y,X) instead of (append, X, Y, Z). This is justified since the first element of tuples are treated specially: to aid indexing of the database they must be plain constants and not variables or tuples. Ignoring that distinction, it seems the database should be represented as a plain list of (conclusion, provisos) pairs, i.e. of type (term * term list) list. There is however an additional wrinkle in that each time a Prolog interpreter makes use of a rule, it needs a fresh copy where the variables are unaffected by previous unifications. We avoid this issue by representing the database as a list of functions of no arguments that constructs rule instances, i.e. of type (unit -> term * term list) list. (High-performance Prolog systems similarly represent the database in the form of pre-compiled snippets of code to build terms). In this representation, the append database above becomes:

val database =
 [ fn()=>
   let val l = ref(NONE) in
     (Tuple[Cnst "append", Cnst "nil", Var l, Var l],
      [])
   end,
   fn()=>
    let val l1 = ref(NONE) val l2 = ref(NONE)
        val h  = ref(NONE) val l  = ref(NONE)
    in
     (Tuple[Cnst "append",
            Tuple[Cnst "cons", Var h, Var l1],
            Var l2,
            Tuple[Cnst "cons", Var h, Var l]],
      [Tuple[Cnst "append", Var l1, Var l2, Var l]])
    end ]

At first we might expect to be able to use an ordinary recursive function to handle queries: we have a single function prove(goals, rules), which is given a list of goals (the goal stack) to prove and a list of rules to use for the top goal. If it can find a solution it returns true, otherwise it undoes all modifications and returns false. The function would work by recursively calling itself for proving subgoals, systematically trying all rules in the database when a recursive call gets stuck.

The problem with that idea is that a Prolog query can "return more than once", using backtracking to find all possible answers. So not only do we need a way to signal failure, we also need the possibility to return several different values to the caller. But the implicit continuation on the call stack is "used up" when you return through it; unlike in Scheme or Prolog, an ML function can return only once.

Explicitly passed continuations don't have that limitation: for example we can stash away the return continuation in some datastructure before "returning through it", i.e. calling it. If someone later calls that stored continuation function, control will pass back to the waiting context just as if we had returned again. Inspired by that idea, our new interface will be as follows: if it could find no derivation, prove(goals, rules, fail) invokes its failure continuation. Otherwise it does the appropriate side-effecting unifications, and returns a function of no arguments which, if called, behaves like prove(goals, rules, fail) except it tries to find the next derivation.

What is the the return type of prove? As described above, it returns a function that returns a function that returns a function that... and the failure continuation is of that type as well. To avoid infinite regress, we introduce a special value No indicating that no more solutions could be found, and a recursive datatype choicepoint. We then also need a function backtrack to invoke these functions. Here I chose to make it do nothing when called on No; an alternative would be to raise an exception.

datatype choicepoint = More of unit->choicepoint | No
fun backtrack (More choice) = choice()
|   backtrack No = No

The actual function to search for derivations can now be easily implemented (though it requires a bit of care to get right):

fun prove (goal::goals, rule::rules, fail) =
  let val (conclusion, provisos) = rule() in
    unify_term(goal, conclusion,
               fn fail=>prove(provisos@goals, database, More(fail)),
               fn   ()=>prove(goal::goals, rules, fail))
  end
|  prove (goal::goals, [], fail) = backtrack fail
|  prove ([], _, fail) = fail  (* Success! Return. *)

fun query (goal) = prove([goal], database, No)

As an example, here is how to search for the three pairs of lists that append to form [a,b]. (In Prolog, this is the query append(X,Y,[a,b]).).

- val sample =
    Tuple[Cnst"append",
          Var(ref(NONE)),
          Var(ref(NONE)),
          Tuple[Cnst"cons", Cnst"a", Tuple[Cnst"cons", Cnst"b", Cnst"nil"]]];

- val (bt, result) = (query(sample), term_to_string(sample));

> val bt = More fn : choicepoint
  val result = "(append nil (cons a (cons b nil)) (cons a (cons b nil)))" : string

- val (bt, result) = (backtrack bt, term_to_string(sample));
> val bt = More fn : choicepoint
  val result = "(append (cons a nil) (cons b nil) (cons a (cons b nil)))" : string

- val (bt, result) = (backtrack bt, term_to_string(sample));
> val bt = More fn : choicepoint
  val result = "(append (cons a (cons b nil)) nil (cons a (cons b nil)))" : string

- val (bt, result) = (backtrack bt, term_to_string(sample));
> val bt = No : choicepoint
  val result = "(append X X (cons a (cons b nil)))" : string

A reader with some prior knowledge of Prolog and ML might want to extend this interpreter to handle explicit failures and the Prolog "not" operator, or (harder and less pretty) the cut operator. While using continuation passing style to implement a Prolog interpreter in this way is a neat functional programming trick, it isn't very practical in terms of performance. Things can get a lot more pure and idealistic than this, however: see the textbook Structure and Interpretation of Computer Programs for an implementation in terms of lazy lists! Practical implementations normally use the Warren abstract machine structure, which squishes all the state into a single stack with pointers between entries.

Kernel threads in Mach

If the previous section seemed taken straight from a functional-programming ivory tower, this final example shows how similar ideas can be applied to grubby low-level systems programming - in C no less! The following is a summary of the paper [1] by Draves et al.

The traditional Unix design maintains a separate stack in kernel space for each kernel thread. When a thread traps into kernel mode, for example due to a system call or an interrupt, the corresponding kernel stack is installed. Since each thread has a separate stack, we can suspend it at any time by just saving the register set. This is useful because we can be deep into the execution of some system call and suddenly realize we need to block, for example because a memory page containing some kernel datastructure is not swapped into physical memory or a file operation has not completed. This is the model that Mach followed up to version 3.0, when it was restructured to use a "explicit continuation" style.

The obvious disadvantage of keeping all these stacks around is that they take up memory. A running system will typically support hundreds of processes, each of which uses at least one kernel thread. Maintaining a 4kB stack for each of these is very bad for the footprint of the kernel, practically guaranteeing a cache miss to pull in the stack each time a program traps into kernel mode.

Furthermore, in the majority of cases maintaining a separate stack to store the state of the thread is overkill. From time to time a thread will recurse deeply into the handling of a syscall only to discover that a necessary kernel datastructure is paged out and has to be waited for, but most suspended threads represent much simpler situations, e.g. a read() system call waiting for data to become available on a pipe, after which it will immediately return to user space. In this case, the state of the thread is very simple, so it seems wasteful to use an entire 4kB stack to represent it.

Given the over-all theme of this article, the obvious thing to do is to represent "what to do next" explicitly, rather than relying on the function call stack. Of course, since Mach is written in C we can't use function objects to store the values of local variables. Instead, we have to make do with plain function pointers, and explicitly save any state that is needed by the continuation to a scratch area in the thread's data structure. (Mach allocates 28 bytes in the thread structure for this purpose; a function that needs to save more state than that has to allocate an additional data structure). The following example, from the paper, demonstrates the idea:

Before:

/* A frequently used system call */
example(arg1, arg2) {
  P1(arg1, arg2);
  if (need_to_block) {
    thread_block();
    P2(arg1);
  } else {
    P3();
  }
  /* return control to user */
  return SUCCESS;
}

After:

example(arg1, arg2) {
  P1(arg1, arg2);
  if (need_to_block) {
    save context in thread;
    thread_block(example_continue);
    /* not reached */
  } else {
    P3();
  }
  /* return control to user */
  thread_syscall_return(SUCCESS);
}

example_continue() {
  recover context from thread;
  P2(recovered_arg1);

  /* return control to user */
  thread_syscall_return(SUCCESS);
}

As in the Standard ML case the transformation is straight-forward, except that here we use two "magic" functions. Firstly, instead of returning directly we now call thread_syscall_return. In the old design, the entry handler for a system call would call example to do the actual work, note the return value, and then do some low-level stuff to ensure that the the CPU is restored to the state it had before the trap into kernel space, with the return value appearing in the right register. In the new design, the entry handler has explicitly save the data necessary when going back to user mode into the thread structure. When thread_syscall_return is called it uses that data to return to user space, and never returns to its kernel-space caller.

Secondly, the implementation of thread_block has changed. In both the old and the new design, it puts the current thread on the wait queue, and then selects a new kernel thread from the queue of runnable threads. The old design then does some low-level magic to save the cpu registers (including the stack pointer) into the old thread structure and restore the new thread's saved registers. In the new design, thread_block only has to store the function pointer for the continuation into the old thread structure, since that is all the state needed to restart the thread later. The old stack is no longer needed and can be freed. Even better, if the thread we are switching to also was suspended using the new continuation system, we can keep using the old stack and just reset the stack pointer to the base of it. This is a big win since the old stack will already be in the cache and the translation look-aside buffer (TLB).

A nice thing about this approach is that it can be mixed freely with the traditional system. If there is a potential blocking point where it would be inconvenient to split the function into explicit continuations, one can simply call the old thread_block there. In that case the stack can of course not be discarded, but that does not matter so much if it is a rare case. Draves et al. found that out of 60 possible blocking points in Mach 3.0, the 6 most common points accounted for 99% of all blocked threads. By changing only those 6 points, they could reduce the average number of kernel stacks in use from 43 for a computer running 43 processes to 1.002.

Restructuring the kernel in this way also has another benefit: by choosing a high-level representation (function pointers) for continuations, rather than a low-level one (stacks in an architecture-dependent format), it becomes much easier to implement optimisations that treats certain often-called continuations as special cases. Draves et al. describes how they were able to optimise the mach_msg IPC primitive.

mach_msg is used to exchange messages between processes in different addresses spaces using so-called ports. Messages can be plain blocks of bytes, but may also include capabilities (roughly equivalent to Unix file descriptors) in which case the kernel needs to look at the content of the message. The user-level visible prototype looks like

mach_msg_return_t   mach_msg
                    (mach_msg_header_t                msg,
                     mach_msg_option_t             option,
                     mach_msg_size_t            send_size,
                     mach_msg_size_t        receive_limit,
                     mach_port_t                     name,
                     mach_msg_timeout_t           timeout,
                     mach_port_t                   notify);

Depending on if option includes the bit MACH_SEND_MSG or MACH_RCV_MSG, the same system call can be used to send a message, receive one, or first send one and then wait for a reply.

In the old design, before the restructuring, the kernel level implementation works roughly as follows. On the sending side, the kernel parses the message and adds it to the message queue of the port. It then looks for a suitable receiving thread. If it finds one, it suspends itself and directly switches context to that thread. (This is faster than just marking the receiving thread as runnable and leaving it to the scheduler to pick it).

The receiving side first checks if there are any outstanding messages on the queue of the port. If not, it blocks until one is added. It then removes the first message from the queue and parses it. At this point there might be some additional processing to do depending on what options were specified in the mach_msg call, but in the common case it can just copy the contents of the message into user space and then return to user level.

The common case, which we want to optimise for, is when there already exists a sever thread blocking to receive a message on the the port when another thread sends a message. In the old design, this means that both the sending and the receiving thread has to parse the same message, acquire the queue mutex, and enqueue/dequeue the message.

In the new design, when the receiver blocks, it checks whether any complicated processing will need to be done on the message after it arrives. If so, it specifies a "slow path" continuation to thread_block which can handle all the corner cases. Often, however, it can use the "fast path" continuation mach_msg_continue, which simply copies the message into user space and returns.

The new design of the sender still first parses the message. Before enqueueing it, however, it checks if there is a suitable blocked receiving thread, and if that thread's continuation is mach_msg_continue. In that case there's no need to actually pass control to mach_mgs_continue: instead the sending thread can do the same work and directly copy the message into the receiving thread's user space buffer. The message is only parsed once, and the message queue is completely bypassed. The sending thread can then suspend itself (with a continuation which will return to user level), and directly switch to the receiving thread which can now also immediately return.

In Draves et al.'s experiment, this new design speeded up the remote procedure call by 14%, most of which is due to not having to switch to a new stack (with the consequent TLB flushes etc).

It should be noted that the "explicit continuation" design, though very cute, works because Mach is a microkernel. Mach provides a very small set of system-calls to user-level processes, and almost all OS functionality is accessed by using the mach_msg call to do remote procedure calls to user-level servers, which are then responsible for things like file systems or network interfaces. Thus localised changes to how mach_msg is treated can give big benefits.

Monolithic designs like Linux, on the other hand, has lots of entry points into the kernel, corresponding to different functionality. There is not 60 potential blocking points but many hundreds, and there is no small set of "hot spots" that particularly benefit from optimisation. Trying to restructure the Linux kernel to use explicit continuations would be a huge, and hugely disruptive, project which would require rewriting a large number of functions for a much smaller gain.

Reference:

[1] Richard P. Draves, Brian N. Bershad, Richard F. Rashid, Randall W. Dean. Using Continuations to Implement Thread Management and Communication in Operating Systems. Technical Report CMU-CS-91-115, Carnegie Mellon University, October 1991. Available electronically at ftp://ftp.cs.cmu.edu/project/mach/doc/published/threadmgnt.ps

Translation Look-Aside Buffer	Functional programming	lie group	Structure and Interpretation of Computer Programs
regular expression	tail recursion	Prolog	John Coltrane
November 13, 2019	January 17, 2019	CPython	union find
call stack	Unification	Continuation	call/cc
door game	microkernel	Mach	ML
Scheme	LISP	C	Linux