Continuation passing style is a programming idiom possible in functional languages like Scheme or ML which support functions
as first-class objects and take care to implement tail recursion
efficiently. The idea is to rewrite recursive functions to represent "what to
do next" after returning (the continuation) explicitly as a function
object.
Consider the classic example of a recursive function, the factorial (all
the program examples are in Standard ML):
fun fact 0 = 1
| fact n = n * fact(n-1)
After the recursive function call fact (n-1) has returned, we
need to continue by multiply the return value by n, and then we
are done. If we like, we could think of this continuation as a function
fn x => n*x that is applied to the return value. In computing say
the factorial of 5, the recursive calls stack up, and the function call
fact(1) is waited on by a function that takes the return value,
multiplies it by 2 and passes it on to a function that multiplies it by 3 and
passes it on to...
On a less abstract level, one can consider what data the implementation
needs to save away when making the recursive call. It certainly needs to
remember the value of n, as well as some notion of where to
continue when the function returns. The typical implementation is to push the
values of all local variables as well as a return address onto the call
stack when making a function call. But consider by contrast the tail recursive
variant:
fun ifact (0, a) = a
| ifact (n, a) = ifact(n-1, n*a)
Here there is nothing to be done once the recursive call has finished - the
returned value can be passed on directly. Indeed, programming language
implementations take care not to grow the call stack in cases like this. When
the final call to ifact(0,a) has finished there is no unwinding
through the recursive calls, the result is directly given back to the function
that originally called ifact
The idea of the continuation passing style is to represent the continuation
"waiting for" a function explicitly as a function object. More specifically we
add an extra argument (customarily called k) to each function and
instead of directly returning a value, we call k with the return
value in a tail call position. All recursive calls are made in tail positions.
Recursive calls that would ordinarily be tail calls correspond to passing on
k unchanged, while "properly" recursive calls build a new
continuation function that does something to its argument and then passes the
result to the original continuation. Here is the result of applying this idea
to fact:
fun cfact (0, k) = k(1)
| cfact (n, k) = cfact(n-1, fn x => k(n*x))
In order to find the factorial of 5, one would write cfact(5, fn
x=>x), giving the identity function fn x=>x as the
second argument. This in turn calls cfact(4, fn x=>k(5*x)), etc,
until cfact(0,k) is reached. This finally calls its
continuation, which multiplies its argument by 1 and calls the continuation
built by cfact(2,k),
which multiplies the argument by 2 and ... until the continuation
built by cfact(5,k) is called, multiplies its argument
and passes it to the identity function, which finally returns it.
Coming from a C background one might expect this to be quite
unpractical because the entire sequence makes 10 recursive calls.
But note that all of these are in tail positions, so there is in
fact no unwinding of the stack to be done once our identity
function has returned. All the function calling was made in
constant space, and the recursive nature of the function is encoded
only in the five function objects that were created. Recall that when
an implementation is creating a function object what it must
allocate is a closure storing the address of the function code
and the value of its free variables, in this case n
and k. This is essentially the same as the information
that was allocated on the call stack with ordinary recursion, but
we can now manipulate it as any other object.
As as another simple example to clarify the idea, here is a
function to sum all the leaves in a binary tree of integers, first
written in the normal way and then transformed into continuation
passing style.
datatype tree = Leaf of int | Branch of (tree*tree)
fun sum(Leaf(n)) = n
| sum(Branch(t1,t2)) = sum(t1) + sum(t2)
fun csum(Leaf(n), k) = k(n)
| csum(Branch(t1,t2), k) = csum(t1, fn x => csum(t2, fn y => k(x+y)))
As the reader may have noticed, the translation is fairly mechanical
and can be specified by a few rules. This is known as the
CPS-transform. Some compilers (following the Rabbit
Scheme compiler written at MIT in the 70s) translate the input
program into CPS and use that as their internal representation.
Having the "return addresses" made explicit in this way makes
compilation easier in some ways, and there is in fact a connection
between CPS and single static assignment, another popular
internal representation. But I will not comment further on this,
mainly because I don't know enough about compilers.
So far, it might be clear that it is possible to program in
continuation passing style but not why it would be
desirable. One reason is that by treating continuations as
objects in the language, one might do things like storing them in data
structures, "returning" through them more than once, or discard the
continuation and immediately abort a deep recursion. Another reason is that
it becomes natural to write functions that can return values to two
different contexts. For example, many functions can fail, which is often
dealt with either by having them return an error value to the caller or by
throwing an exception. In continuation passing style, it is natural to have
the function take two continuation arguments, one to be called in case of
success and one for failure. Finally, for some problems it just seems
natural to think explicitly in terms of "possible futures" of a
computation. It should be noted that unlike the intermediate code used by
compilers, programs written by human programmers seldom do away completely
with the call stack; instead they mix explicit continuations and normal
recursion as convenient.
The rest of this write-up consists of three examples of how continuations can
be used to help express various tasks.
This is a rather contrived toy example: the algorithm uses a
naïve brute-force search which gives the right result, but is extremely
inefficient in some cases. (Matching the expression a*a*a*a*a*...a against the
string aaaaa...a, where there are n as, will take time
Θ(n!).) To simplify the presentation, it
requires the regular expression to be given as a parse tree and the string as a
list of characters. For example match_regex(Concat(Star(Char #"a"), Char
#"b"), explode "aaab") returns true, because a*b matches
aaab. Programs like these
are not completely useless: for example it is a good idea to test a more
efficient algorithm by making sure that it returns the same answer as the short
and "obviously correct" solution in various tricky cases.
The idea here is that in order to match a regular expression
e1e2 against a string, the
program must first match e1 against the beginning of
the string and then e2 against the rest. In other
words, e1 is executing in a context waiting to see
how much will be left over. Now there might be many ways to match
e1, which will consume more or less of the string, so
we need some way to backtrack and try a different way of matching it. The
solution below uses a function match(r,s,k) which expects a regex
r, a string s, and a continuation function k representing the rest of the
computation. It will return true if it can match r against a prefix of s such
that k(rest) returns true, where rest is the remainder of the string. (One way
of looking at this function is to say that it has two continuations: the
explicitly manipulated k, and the implicit failure continuation invoked by
returning false to the caller). Finally, the function
match_regex executes match with a continuation which
verifies that all of the string was "used up" by the regex. Note that
andalso and orelse are the short-circuiting boolean
operators in ML, analogous to && and || in C.
datatype regex = Char of char
| Choise of (regex*regex)
| Concat of (regex*regex)
| Star of regex
fun match(Char a, b::s, k) = (a=b) andalso k(s)
| match(Char a, [], k) = false
| match(Concat(r1,r2), s, k) = match(r1, s, fn rest=>match(r2,rest,k))
| match(Choise(r1,r2), s, k) = match(r1,s,k) orelse match(r2,s,k)
| match(Star r, s, k) = match(r, s, fn rest=>match(Star r,rest,k)) orelse k(s)
fun match_regex(r,s) = match(r,s,fn rest => rest=[])
Unification is an problem which crops up at various places in computer
science: it is used in automatic theorem-proving for first-order logic,
type-inference for ML-like languages, parsing of natural languages using
feature grammars, and is the basic method for "function calls" in the Prolog
programming language.
A term is defined as a constant, a variable, or a tuple of terms. A
substitution is a mapping from variables to terms. The unification
problem is to take two terms and find a substitution that makes them equal. For
example (x, (3,4)) and (2, y) unify to (2, (3, 4)) by the substitution {x→3,
y→(3,4)}. A pair of terms can fail to unify in two different ways,
"clashes" and "loops". For example (3, x) and (4, 5) clash, because there is no
substitution that can make 3 equal 4. Likewise 42 and (6,7) clash because a
constant can never equal a tuple. An example of a loop is (y,y) and (x, (3,x)).
A successful unification would need to find a value for x such that x = (3,x),
but no finite term can satisfy that (we would in effect be constructing
the term (3,(3,(3, ...)))).
The naïve algorithm simply traverses the two terms recursively, and as it
goes along it builds up a substitution which is kept in a separate data
structure such as a hash table or association list. If, when the recursion
finishes, the two terms turned out to be unifiable we can then apply the
substitution to one of them to get the unified result. This is slow!
A nicer way is to represent variables as reference cells (boxed pointers)
pointing to subterms. If the pointer is non-NULL, we ignore the cell and treat
it exactly as if it was just the subterm it contains, but if it is NULL it
represents a variable which can be given a value by assigning to it. We
can then just walk through the terms once, making assignments to the cells as
we go along. (Of course, care must be taken to represent e.g. the term (x,
(y,x)) using the same reference cell for both xs, so that an assignment to one
is automatically mirrored in the other). This is a bit like the
idea in the union-find algorithm. Here is an ML datatype that can be used to
represent terms, together with a function to pretty-print them in a Lisp-like
human-readable form:
datatype term = Cnst of string
| Tuple of (term list)
| Var of term option ref
(* pretty print a single term *)
fun term_to_string (Cnst s) = s
| term_to_string (Tuple xs) = "(" ^ terms_to_string xs ^ ")"
| term_to_string (Var(ref(NONE))) = "X"
| term_to_string (Var(ref(SOME(t)))) = term_to_string t
and
(* pretty-print a list of terms *)
terms_to_string (t::[]) = term_to_string t
| terms_to_string (t::ts) = term_to_string t ^ " " ^ terms_to_string ts
| terms_to_string [] = ""
What makes this relevant to continuation-passing-style is what happens if we
discover that two terms don't unify: then we need to backtrack and
undo all the assignments we made. Exercise: Think about how to write a function to do
this. It is as if we want a standard recursive function for walking down the
tree, except it can unwind in two different ways: if the recursive call was
successful we want to continue down other branches of the tree, but if it was
not we want to undo all the assignments we made.
In a first-order language we might decide to build a list of undo actions to be
carried out if the unification fails at any stage. But in light of the above, a more
direct solution comes to mind: write the function in continuation-passing style, and
explicitly maintain two different continuations waiting for each recursive call.
What follows is an implementation of this idea. occurs is an uninteresting
helper function which is used to test for the "loop" case described above (this
is the so-called occurs check). The real work is done in the two mutually
recursive functions unify_term and unify_list, which do
unification on single terms and lists of terms respectively.
unify_term takes as argument the two terms to unify, and two extra
arguments. The argument fail represent the continuation that waits for
a unification failure to unwind and undo all the assignments so far. The
argument succ represents the continuation waiting for a successful
recursive call. Unlike its twin, the success continuation expects a return
value, namely a newly constructed failure continuation. Both chains of waiting
continuation have bounded length: succ by the height of the tree and fail by
the number of assignments that have been done. All calls in the functions are
made in tail-call positions, so the unification functions use constant space on
the normal call stack. New continuations are constructed only in two places,
marked (* ! *) in the code. These are the places where a
non-continuation-style function would want to use proper recursion.
(* does a given free variable x occur somewhere in the term? *)
fun occurs(x, y as Cnst _) = false
| occurs(x, Tuple ys) = occurs_list(x,ys)
| occurs(x, Var(ref(SOME term))) = occurs(x, term)
| occurs(x, y as Var(ref(NONE))) = (x = y)
and
occurs_list(x, y::ys) = occurs(x,y) orelse occurs_list(x,ys)
| occurs_list(x, []) = false
(* unify_term(t1,t2,k) returns fail() if t1 does not unify with t2.
If they do unify, it returns succ(f), where f is a procedure that undoes all
side effects and then calls fail().
unify_terms does the same, but unifies lists of terms. *)
fun unify_term(t1, Var(ref(SOME t2)), succ, fail) = unify_term(t1, t2, succ, fail)
| unify_term(Var(ref(SOME t1)), t2, succ, fail) = unify_term(t1, t2, succ, fail)
| unify_term(t1 as Var(box as ref(NONE)), t2, succ, fail) =
if (occurs(t1, t2)) then
fail()
else
(box := SOME t2;
succ (fn()=>(box := NONE; fail()))) (* ! *)
| unify_term(t1, t2 as Var(ref(NONE)), succ, fail) = unify_term(t2, t1, succ, fail)
| unify_term(Tuple l1, Tuple l2, succ, fail) =
unify_terms(l1, l2, succ, fail)
| unify_term(t1,t2,succ,fail) = if (t1=t2) then succ fail else fail()
and
unify_terms(t::ts, s::ss,succ,fail) =
unify_term(t, s, fn(fail)=>unify_terms(ts,ss,succ,fail), fail) (* ! *)
| unify_terms([],[],succ,fail) = succ fail
| unify_terms(_,_,succ,fail) = fail()
Having written a continuation-passing function, we can wrap it so the user
does not have to worry about supplying explicit continuations. The following
function unify checks if its two arguments are unifiable. If they are,
they are mutated so that both represent the most general unifier, and the
function returns true. Otherwise it returns false and leaves the terms
unchanged.
fun unify(t1,t2) = unify_term(t1,t2, fn x=>true, fn()=>false)
For example:
- unify(Tuple[Var(ref(NONE)), Tuple[Cnst"3", Cnst"4"]], Tuple[Cnst"2", Var(ref(NONE))]);
> val it = true : bool
- unify(Tuple[Cnst"3", Var(ref(NONE))], Tuple[Cnst"4", Cnst"5"]);
> val it = false : bool
- let val x = Var(ref(NONE)) val y = Var(ref(NONE)) in unify(Tuple[x,x], Tuple[y, Tuple[Cnst"3", y]]) end;
> val it = false : bool
As it turns out, having a working unification procedure we are more than
halfway towards a sketch Prolog implementation. This supplies an
example of yet another use for explicit continuations.
The semantics of Prolog are simple. A program consists of a "database",
a list of rules each on the form Consequent :-
Proviso1,
Proviso2 ... . (the list after
:- is allowed to be empty), where Consequent and
Provisoi are terms (possibly containing variables),
meaning that the program is allowed to deduce the consequent if it can
deduce all the provisos. For example, here is a Prolog database implementing
the append predicate;
append(L1,L2,L3) can
be deduced iff L1 and L2 are two lists that can be appended to
form L3 (the
syntax [H|L] means the same as H::L in ML.):
append([],L,L).
append([H|L1], L2, [H|L]) :- append(L1,L2,L).
The task of a Prolog interpreter when executing a query, ignoring efficiency
concerns, is to do a depth-first search for deductions. It maintains a goal
stack, initially containing just the term it was asked to deduce.
It then scans through all the rules in the database in order, testing whether
the consequent of the rule unifies with the item on the top of the goal stack.
If it does, that item is removed and instead the provisos of the rule are
pushed, and we start scanning the database from the beginning again. If no rule
matches the interpreter must backtrack, i.e. go back and undo an earlier
successful unification and instead try the next rule in the database. If we
arrive at an empty stack, the term of the original query has been deduced and
the variables in it has been updated to reflect the necessary substitutions.
Full Prolog also includes some "control predicates" to guide the search, but we
will ignore them here.
The Prolog syntax is a bit nonuniform in writing append(X,Y,X) instead of
(append, X, Y, Z). This is justified since the first element of tuples are
treated specially: to aid indexing of the database they must be plain
constants and not variables or tuples. Ignoring that distinction, it seems the
database should be represented as a plain list of (conclusion, provisos)
pairs, i.e. of type (term * term list) list. There is however an
additional wrinkle in that each time a Prolog interpreter makes use of a rule,
it needs a fresh copy where the variables are unaffected by previous
unifications. We avoid this issue by representing the database as a list of
functions of no arguments that constructs rule instances, i.e. of type
(unit -> term * term list) list. (High-performance Prolog systems
similarly represent the database in the form of pre-compiled snippets of code
to build terms). In this representation, the append database above
becomes:
val database =
[ fn()=>
let val l = ref(NONE) in
(Tuple[Cnst "append", Cnst "nil", Var l, Var l],
[])
end,
fn()=>
let val l1 = ref(NONE) val l2 = ref(NONE)
val h = ref(NONE) val l = ref(NONE)
in
(Tuple[Cnst "append",
Tuple[Cnst "cons", Var h, Var l1],
Var l2,
Tuple[Cnst "cons", Var h, Var l]],
[Tuple[Cnst "append", Var l1, Var l2, Var l]])
end ]
At first we might expect to be able to use an ordinary recursive function
to handle queries: we have a single function prove(goals, rules),
which is given a list of goals (the goal stack) to prove and a list of rules to
use for the top goal. If it can find a solution it returns true, otherwise it
undoes all modifications and returns false. The function would work by
recursively calling itself for proving subgoals, systematically trying all
rules in the database when a recursive call gets stuck.
The problem with that idea is that a Prolog query can "return more than
once", using backtracking to find all possible answers. So not only do we need
a way to signal failure, we also need the possibility to return several
different values to the caller. But the implicit continuation on the call stack
is "used up" when you return through it; unlike in Scheme or Prolog, an ML
function can return only once.
Explicitly passed continuations don't have that limitation: for example we
can stash away the return continuation in some datastructure before "returning
through it", i.e. calling it. If someone later calls that stored continuation
function, control will pass back to the waiting context just as if we had
returned again. Inspired by that idea, our new interface will be as follows:
if it could find no derivation, prove(goals, rules, fail) invokes its failure continuation.
Otherwise it does the appropriate side-effecting
unifications, and returns a function of no arguments which, if called, behaves
like prove(goals, rules, fail) except it tries to find the next derivation.
What is the the return type of prove? As described above, it returns
a function that returns a function that returns a function that... and the
failure continuation is of that type as well. To avoid infinite regress, we
introduce a special value No indicating that no more solutions could
be found, and a recursive datatype choicepoint. We then also need a function
backtrack to invoke these functions. Here I chose to make it do nothing when called on
No; an alternative would be to raise an exception.
datatype choicepoint = More of unit->choicepoint | No
fun backtrack (More choice) = choice()
| backtrack No = No
The actual function to search for derivations can now be easily implemented
(though it requires a bit of care to get right):
fun prove (goal::goals, rule::rules, fail) =
let val (conclusion, provisos) = rule() in
unify_term(goal, conclusion,
fn fail=>prove(provisos@goals, database, More(fail)),
fn ()=>prove(goal::goals, rules, fail))
end
| prove (goal::goals, [], fail) = backtrack fail
| prove ([], _, fail) = fail (* Success! Return. *)
fun query (goal) = prove([goal], database, No)
As an example, here is how to search for the three pairs of lists that append to form
[a,b]. (In Prolog, this is the query append(X,Y,[a,b]).).
- val sample =
Tuple[Cnst"append",
Var(ref(NONE)),
Var(ref(NONE)),
Tuple[Cnst"cons", Cnst"a", Tuple[Cnst"cons", Cnst"b", Cnst"nil"]]];
- val (bt, result) = (query(sample), term_to_string(sample));
> val bt = More fn : choicepoint
val result = "(append nil (cons a (cons b nil)) (cons a (cons b nil)))" : string
- val (bt, result) = (backtrack bt, term_to_string(sample));
> val bt = More fn : choicepoint
val result = "(append (cons a nil) (cons b nil) (cons a (cons b nil)))" : string
- val (bt, result) = (backtrack bt, term_to_string(sample));
> val bt = More fn : choicepoint
val result = "(append (cons a (cons b nil)) nil (cons a (cons b nil)))" : string
- val (bt, result) = (backtrack bt, term_to_string(sample));
> val bt = No : choicepoint
val result = "(append X X (cons a (cons b nil)))" : string
A reader with some prior knowledge of Prolog and ML might want to extend
this interpreter to handle explicit failures and the Prolog "not" operator, or
(harder and less pretty) the cut operator. While using continuation passing
style to implement a Prolog interpreter in this way is a neat functional
programming trick, it isn't very practical in terms of performance. Things
can get a lot more pure and idealistic than this, however: see the textbook
Structure and Interpretation of Computer Programs for an implementation in
terms of lazy lists! Practical implementations normally use
the Warren abstract machine structure, which squishes all the state into a
single stack with pointers between entries.
Kernel threads in Mach
If the previous section seemed taken straight from a functional-programming
ivory tower, this final example shows how similar ideas can be applied to
grubby low-level systems programming - in C no less! The following is a
summary of the paper [1] by Draves et al.
The traditional Unix design maintains a separate stack in kernel space for
each kernel thread. When a thread traps into kernel mode, for example due to
a system call or an interrupt, the corresponding kernel stack is installed.
Since each thread has a separate stack, we can suspend it at any time by just
saving the register set. This is useful because we can be deep into the
execution of some system call and suddenly realize we need to block, for
example because a memory page containing some kernel datastructure is not
swapped into physical memory or a file operation has not completed. This is
the model that Mach followed up to version 3.0, when it was restructured to
use a "explicit continuation" style.
The obvious disadvantage of keeping all these stacks around is that they
take up memory. A running system will typically support hundreds of processes,
each of which uses at least one kernel thread. Maintaining a 4kB stack for
each of these is very bad for the footprint of the kernel, practically
guaranteeing a cache miss to pull in the stack each time a program traps into
kernel mode.
Furthermore, in the majority of cases maintaining a separate stack to store
the state of the thread is overkill. From time to time a thread will recurse
deeply into the handling of a syscall only to discover that a necessary kernel
datastructure is paged out and has to be waited for, but most suspended
threads represent much simpler situations, e.g. a read() system call waiting
for data to become available on a pipe, after which it will immediately return
to user space. In this case, the state of the thread is very simple, so it
seems wasteful to use an entire 4kB stack to represent it.
Given the over-all theme of this article, the obvious thing to do is to
represent "what to do next" explicitly, rather than relying on the function
call stack. Of course, since Mach is written in C we can't use function
objects to store the values of local variables. Instead, we have to make do
with plain function pointers, and explicitly save any state that is needed by
the continuation to a scratch area in the thread's data structure. (Mach
allocates 28 bytes in the thread structure for this purpose; a function that
needs to save more state than that has to allocate an additional data
structure). The following example, from the paper, demonstrates the idea:
Before:
/* A frequently used system call */
example(arg1, arg2) {
P1(arg1, arg2);
if (need_to_block) {
thread_block();
P2(arg1);
} else {
P3();
}
/* return control to user */
return SUCCESS;
}
After:
example(arg1, arg2) {
P1(arg1, arg2);
if (need_to_block) {
save context in thread;
thread_block(example_continue);
/* not reached */
} else {
P3();
}
/* return control to user */
thread_syscall_return(SUCCESS);
}
example_continue() {
recover context from thread;
P2(recovered_arg1);
/* return control to user */
thread_syscall_return(SUCCESS);
}
As in the Standard ML case the transformation is straight-forward, except
that here we use two "magic" functions. Firstly, instead of returning directly
we now call thread_syscall_return. In the old design, the entry
handler for a system call would call example to do the actual work,
note the return value, and then do some low-level stuff to ensure that the the
CPU is restored to the state it had before the trap into kernel space, with
the return value appearing in the right register. In the new design, the entry
handler has explicitly save the data necessary when going back to user mode
into the thread structure. When thread_syscall_return is called it
uses that data to return to user space, and never returns to its kernel-space
caller.
Secondly, the implementation of thread_block has changed. In both
the old and the new design, it puts the current thread on the wait queue,
and then selects a new kernel thread from the queue of runnable threads. The
old design then does some low-level magic to save the cpu registers (including
the stack pointer) into the old thread structure and restore the new thread's
saved registers. In the new design, thread_block only has to store
the function pointer for the continuation into the old thread structure, since
that is all the state needed to restart the thread later. The old stack is no
longer needed and can be freed. Even better, if the thread we are switching
to also was suspended using the new continuation system, we can keep
using the old stack and just reset the stack pointer to the base of it. This
is a big win since the old stack will already be in the cache and the
translation look-aside buffer (TLB).
A nice thing about this approach is that it can be mixed freely with the
traditional system. If there is a potential blocking point where it would be
inconvenient to split the function into explicit continuations, one can simply
call the old thread_block there. In that case the stack can of course
not be discarded, but that does not matter so much if it is a rare case.
Draves et al. found that out of 60 possible blocking points in Mach 3.0, the 6
most common points accounted for 99% of all blocked threads. By changing only
those 6 points, they could reduce the average number of kernel stacks in use
from 43 for a computer running 43 processes to 1.002.
Restructuring the kernel in this way also has another benefit: by
choosing a high-level representation (function pointers) for continuations,
rather than a low-level one (stacks in an architecture-dependent format), it
becomes much easier to implement optimisations that treats certain
often-called continuations as special cases. Draves et al. describes how they
were able to optimise the mach_msg IPC
primitive.
mach_msg is used to exchange messages between processes in
different addresses spaces using so-called ports. Messages can be plain
blocks of bytes, but may also include capabilities (roughly equivalent to
Unix file descriptors) in which case the kernel needs to look at the content
of the message. The user-level visible prototype looks like
mach_msg_return_t mach_msg
(mach_msg_header_t msg,
mach_msg_option_t option,
mach_msg_size_t send_size,
mach_msg_size_t receive_limit,
mach_port_t name,
mach_msg_timeout_t timeout,
mach_port_t notify);
Depending on if option includes the bit MACH_SEND_MSG or
MACH_RCV_MSG, the same system call can be used to send a message, receive
one, or first send one and then wait for a reply.
In the old design, before the restructuring, the kernel level
implementation works roughly as follows. On the sending side, the
kernel parses the message and adds it to the message queue of the port. It
then looks for a suitable receiving thread. If it finds one, it suspends
itself and directly switches context to that thread. (This is faster than
just marking the receiving thread as runnable and leaving it to the scheduler
to pick it).
The receiving side first checks if there are any outstanding messages on
the queue of the port. If not, it blocks until one is added. It then removes
the first message from the queue and parses it. At this point there might be
some additional processing to do depending on what options were specified in
the mach_msg call, but in the common case it can just copy the
contents of the message into user space and then return to user level.
The common case, which we want to optimise for, is when there already
exists a sever thread blocking to receive a message on the the port when
another thread sends a message. In the old design, this means that both the
sending and the receiving thread has to parse the same message, acquire the
queue mutex, and enqueue/dequeue the message.
In the new design, when the receiver blocks, it checks whether any
complicated processing will need to be done on the message after it arrives.
If so, it specifies a "slow path" continuation to thread_block which
can handle all the corner cases. Often, however, it can use the "fast path"
continuation mach_msg_continue, which simply copies the message into
user space and returns.
The new design of the sender still first parses the message. Before
enqueueing it, however, it checks if there is a suitable blocked receiving
thread, and if that thread's continuation is mach_msg_continue. In
that case there's no need to actually pass control to
mach_mgs_continue: instead the sending thread can do the same work
and directly copy the message into the receiving thread's user space buffer.
The message is only parsed once, and the message queue is completely bypassed.
The sending thread can then suspend itself (with a continuation which will
return to user level), and directly switch to the receiving thread which can
now also immediately return.
In Draves et al.'s experiment, this new design speeded up the remote
procedure call by 14%, most of which is due to not having to switch to a new
stack (with the consequent TLB flushes etc).
It should be noted that the "explicit continuation" design, though very
cute, works because Mach is a microkernel. Mach provides a very
small set of system-calls to user-level processes, and almost all OS
functionality is accessed by using the mach_msg call to do remote
procedure calls to user-level servers, which are then responsible for things
like file systems or network interfaces. Thus localised changes to how
mach_msg is treated can give big benefits.
Monolithic designs like Linux, on the other hand, has lots of entry
points into the kernel, corresponding to different functionality. There is not
60 potential blocking points but many hundreds, and there is no small set of
"hot spots" that particularly benefit from optimisation. Trying to restructure
the Linux kernel to use explicit continuations would be a huge, and hugely
disruptive, project which would require rewriting a large number of functions
for a much smaller gain.
Reference:
[1] Richard P. Draves, Brian N. Bershad, Richard F. Rashid, Randall
W. Dean. Using Continuations to Implement Thread Management and
Communication in Operating Systems. Technical Report CMU-CS-91-115,
Carnegie Mellon University, October 1991. Available electronically at
ftp://ftp.cs.cmu.edu/project/mach/doc/published/threadmgnt.ps