Here's a rule of thumb: for hardware CPU implementations, simple instructions operating exclusively on explicit registers, and only load, store, and instruction read operations working on memory, is the most efficient design. In a pseudo-machine (such as JVM), the opposite is true - complex operations with no
registers (e.g., a stack machine) work best.
There are exceptions to both, but OISCs aren't among them. The main one right now is, of course, x86, where the complex instructions get broken down, re-merged, re-ordered, and who knows what else by a massively complex instruction pipelining and interpretation mechanism. I am still not convinced that a RISC cannot do better, but right now, none of them do, because no one has any reason
to make them better.
Realistically, reconciling either of them into a single model for both native and interpreted code is probably impossible, even for the embodiment of the abstract principle of The Reconciliation of Opposites fnord (don't ask, I don't understand it and never wanted to be it - thanks for the psychological booby prize, Eris!). Reconciling either of them with an OISC is even more daunting.
As an aside, I am in favor of a completely different model than either, one I came up with almost 20 years ago and called RSVP, but I am not sure if a hardware implementation would be feasible. Permit me to post the description:
Me, Myself, and I wrote:
The Reduced State Vector Processor (RSVP)
The current emphasis in CPU design on fast instruction performance is, in the opinion of this author, misplaced. The primary overhead in current software is in I/O and interrupt latency, not instruction execution; most systems today are bound to a slow turnaround in context switching, and are usually idle while waiting for user events. Further, most of the activity in modern systems is bound to processes that involve large vector and matrix calculation, particularly those needed for audio and video processing. While most such work is usually offloaded to DSPs and other co-processors, some still must be handled by the CPU, and the CPU ought to be designed to handle such calculations efficiently.
To these ends, I propose a CPU design meant to minimize processor state, maximize parallel processing efficiency and enhance vector calculations. I call this the Reduced State Vector Processor (RSVP) concept.
The basic concept is to eliminate the majority of the registers in the CPU, replacing them with large (>1M) write-through caches. The processor would have separate caches for code, data and stack, and would have at least two separate caches for each of these functions. The intention is that each process would be allocated a separate cache set, and that in the case of a cache miss, the CPU would automatically trigger a task switch to a process which is in another cache set while the missed cache is updated in background.
All operations would be direct-to-memory, with the ALU operating directly on the caches; in effect, the caches would act as an extremely large register set. This allows the CPU to be reduced to a set of four state registers : Instruction Pointer, Stack Pointer, Frame Pointer and Cache Pointer. Given this minimal state, and the fact that all operations are cached, allows for a single-cycle, single instruction context switch.
As said earlier, the actual calculations are performed directly upon the caches, the ALU has to be able to operate on any section of the cache as needed. Taking this idea further, it can be seen that, logically, the ALU should be bound to the cache it works on, not the CPU, leaving the CPU to perform only the control functions. ALU operations that require more than one cycle should trigger a context switch, just as a a cache miss would, allowing the ALU to operate independently. Further, since it can operate on arbitrary sections of the cache, it should be possible for it to operate on multiple operands, or operands of arbitrary size. This allows for efficient matrix and VLW operations as a natural extension of the ALU design. It should be able, in principle, to operate upon the entire addressable memory; the ALU would call cache refills and continue operating completely independent of the CPU itself.
The final logical step of this design is to add multiple CPUs. Since there are already redundant caches, and the CPU structure is so minimal, it should be possible to have several of these simplified CPUs on a single chip, switching between a large collection of semi-autonomous cache/ALU sets as needed. A reasonable plan, given current chip densities, is a 4 CPU, 8 Cache-set design, which would have a total of 24M of cache memory on-chip. While this is a very large amount of memory, it is well within current densities, and the simplified structure of the processor overall eliminates most of the processor hardware.
My ideas have evolved quite a bit since then, as I was wearing a lemon juice mask on the topic of real-world hardware design at the time (and probably still am).
Despite this, the basic idea - that throughput and task switching were bigger bottlenecks today than sheer processing speed - still seem valid to me. I could, however, be wrong - in fact, I probably am. I know
for a fact that this design presents critical, possibly show-stopping, problems in the process management and how the hardware recognizes tasks.
In any case, none of this really matters, because, who the hell is going to spend $5 billion on the silicon wafer development and design work for it? It's the same problem that Geri has, but at least I recognize that it is
a problem. He seems to think that a 'simpler' ISA means a simpler and hence less expensive development program, which is absolutely false (in both particulars - it would require a great deal of silicon to implement, and the baked-in costs of new chip development approach or even outweigh the particular design cost in any case - $2 billion is about the opening cost for any
new CPU regardless of complexity, because of how ICs are made). Developing silicon for an OISC - any
OISC - as a hardware CPU, regardless of performance, would cost as much as the development of a new x86 CPU microarchitecture, if not more - at least Intel and AMD are on familiar ground with that piece of trash.