In a pure asynchronous system there's no thread switching, a "thread" is just a function call and has no state on its own to be preserved, so the only context switching becomes the process itself instead. Actually this would be exactly the same thing as the worker threads idea (where a "control thread" issues tasks to whatever thread is idling), just enforced at the system level.
I don't think it'd be possible to build a useful system like this if the function calls have no state of their own. For a simple example, imagine implementing something that receives "increment counter value" messages and "get counter value" messages - the counter has to be somewhere.
Erlang's solution to this is "light weight processes", which aren't really processes in the normal sense (especially not from a traditional OS's point of view) but more like a convenient language construct to capture actor state (e.g. the "counter that has to be somewhere" in the simple example above). In this case a "light weight process ID" can be more like an object reference in OOP (e.g. "this" in C++) where the process' state is the object's local variables.
The only real difference (excluding mere terminology) between something like Erlang's light weight processes and (e.g.) C++ objects is the semantics of communication between processes/actors/objects (e.g. putting some kind of "structure containing function pointer plus function args" onto the receiving process/actor/object's queue for later vs. direct method calls).
It's this "difference in semantics of communication between processes/actors/objects" that makes it possible (e.g. for Erlang) to support SMP and distributed systems without changing any of the code (and impossible to support SMP and distributed systems in C++ without significantly changing almost everything, deciding it's easier to rewrite from scratch, then quitting your job because it's too hard
Hmmm, I suppose the real problem here is kernel/userspace switching actually (which if I recall correctly is indeed quite expensive on x86). I guess you could make it so the operating system provides its own "main loop" running in the userspace part of the process, leaving switching only for syscalls. Actually the part about buffering syscalls here would be useful, software would not expect them to work immediately due to the async nature, and buffering them like that would allow for multiple syscalls to be queued up to be sent all at once (reducing the switching).
Depending on how it's done; kernel/userspace switching on 80x86 is typically measured in the "tens of cycles" range. All the stuff after the raw kernel/userspace switching (e.g. to make a C compiler's ABI happy, to call the right function for the "kernel API function number" , etc) tends to (very roughly) double this (or worse if you go with "one kernel stack per CPU", or account for "CPU time spent in user-space" and "CPU time spent in kernel space" separately, or ...).
The real problem with "no threads" is anything that involves any processing. You send a message to something asking it to find the first 4 billion digits of PI, and the entire OS grinds to a halt because nothing else can respond to messages until the CPU finishes finding those digits of PI.