MOVE processor

The idea of a MOVE processor can be stated as a Transport-triggered architecture (as opposed to the currently used Operation-triggered architectures in CISC and RISC processors.) Specifically, the only real instruction in the machine language is the MOVE operation. Because of this, MOVE itself is not specified, it is an implied part of every instruction! This becomes useful when certain locations in the microprocessor core are designed as Function Units. Each Function Unit performs one of the tasks possible in a standard ALU, such as ADD, MULT, XOR, etc.

This seems like a logical progression of microprocessor design. First you have CISC, its complex instruction set is amenable to hand-crafted assembly language. The intructions form a much higher level language than the actual operations inside the processor. Microcode hardwired into the silicon looks up each instruction and outputs a set of lower level instructions: for example loading a value from the instruction into the ALU, then moving a value from a register into the ALU, then issuing a command to add those two values, another low-level instruction to load the result into the load-store unit, and finally one last instruction to specify the location in RAM to store the result. In most cases, each of these lower level instructions went through another lookup table, the nanocode. This was responsible for setting up the control signals and timing to actually perform each microinstruction. All of those lookups took clock cycles to complete, regardless of what instruction was issued. And it should be noted that processor design problems often resulted from errors in creating the tables, not the logical blocks of the core.

RISC architectures are the result of research into how most programs actually execute in practice.

Extensive research into patterns of computer usage reveals that general-purpose computers spend up to 80% of their time executing simple instructions such as load, store, and branch. The more complex instructions are used infrequently. On architectures with large, complex instruction sets, the simple, often executed instructions incur a performance penalty by the overhead of additional instruction decoding, the use of microcode, and the longer cycle time resulting from increased functionality. -----Hewlett-Packard's "Precision Architecture and Instruction Reference" manual.

Most of the burden was moved from the hardware onto the compilers. The goal was fairly straightfoward: Make processors that have less complexity, able to perform the simple tasks more directly and quickly. Any complex instruction could be written out as a series of simple instructions by a compiler so the processor gives the same results anyway. RISC processors should be more efficient overall, providing either a reduced cost or improved performance. Even so, instructions could be pushed into the core faster than a single all-purpose ALU could complete them. The ALU could be pipelined more efficiently with RISC, but could also be made faster by breaking it into several parts, each capable of specific operations. Superscalar design allowed the RISC processor to actually complete more than one instruction per clock cycle, with properly tuned code. Note that a strictly RISC design will have minimal instruction decoding via microcode, with those results being tied more closely to the hardware than in a CISC design.

One of the people responsible for the paradigm shift to RISC architectures thought things a little further and came up with the MOVE processor¹. The thinking behind processor design had already shifted toward superscalar design and more focused logic in the core. His proposal was to distribute each part of the ALU into seperately addressible components. Each Function Unit would be responsible for a single operation. To add the value of a register to the previous result in the ADD unit, simply MOVE that value there. There would be a flag in each instruction that specifies whether to execute or set the value for future use. Note that moving values to and from registers is less necessary if there are enough redundant Function Units.

To make the processor complete, branching is necessary. If you've read this far without your eyes glazing over, you can probably guess how it's accomplished - the result of a subtraction could be moved into a BEQ, BNE, BLZ, or what-have-you unit which has already been set to the destination PC value. In other words, if the branch is to be taken, a comparison unit would be like an alias to the PC itself. Setting the conditional unit to an address has no immediate effect. This unit's version of execute is to evaluate the conditional and either set the PC or not.

Each instruction would consist only of a source and destination addresses. Each address is either a register, or a Function Unit. Even with RISC processors there is a somewhat complicated mapping between machine code and the logic that carries out the instruction. With a MOVE processor, the instruction is about as close as one can sanely get to what's hardwired in the core. (See OISC for what happens when you ignore the sanity check.) Also, the instructions themselves should be quite short compared to other designs, even allowing for future developments. Considering the few special cases where an instruction isn't really an instruction (like inline data) it is intuitive that 32-bit instruction size would be overkill. Perhaps it would be well suited to VLIW, even allowing non-byte-aligned instruction sizes.

This may seem to put a heavy burden on compiler design. But it also frees up some things. For example, graph coloring to allocate registers may benefit from the use of Function Units as registers temporarily, eliminating several moves to achieve the same result. Additional units in the core may make context-switching more costly though, having roughly double the amount of state to move to and from the stack. Also, it would be difficult to leave room for growth within the instruction set itself. An upper-limit on the number of Function Units/Registers would be pretty much fixed from the first incarnation of such a processor. There must be other valid reasons why this design philosophy never took off... I would think that having a compiler make presentable, well-behaved VLIW output would eliminate the need for much of the out-of-order execution stuff in current designs, making room for the extra Function Units. I'd love to see what one of these could do with 64 mini-ALUs working in parallel :)

See also: CISC, RISC, DSP, Transmeta, OISC, machine code, VLIW, and compiler theory.

(1) His name escapes me, please /msg me if you know where I can find the reference, I believe he was a professor at Berkeley at the time.

Note: The majority of this writeup is from memory, what's left I reasoned out on-the-fly. If you notice an error, would like to see this simplified/shortened, or prefer a better explanation of something, let me know!

OISC	Transport-triggered architecture	Every tool is a weapon if you hold it right	compiler theory
Peephole optimization	parallel processing	Operation-triggered architecture	Arithmetic Logic Unit
RISC	function unit	CISC vs RISC	shift left (or right) logical
VLIW	DSP	Pentium 4	refactoring
Finite state machine	superscalar	move	NHS
Marker System (A drinker's guide to avoiding Coyote Arm)	perq	F00F	microcode

Sample MOVE instruction

Refining

Caveats