Geri's platform

Combuster · **Posted:** Thu Aug 01, 2013 6:29 pm

In other words, self-modifying code, and therefore encoding-specific implementation details are mandatory for even the simplest things such as array indexing.

Have fun doing that with W^X.

zeitue · **Posted:** Thu Aug 01, 2013 7:00 pm

Geri wrote:

zeitue wrote:

You claim your CPU is easy to emulate, what method do you plan to use to emulate it(dynamic translation, visualization, JIT, or some kind of a mix of methods)?

since subleq does not have multiple opcodes, there is no need for dynamic translation, or jit, since the code can be executed directly on the cpu/gpu using directly the cpu's/gpu's instruction set. maybe 2-5 clock per op is possible. i am not yet sure, since i not yet have tried it. basically, executing subleq instruction is:

mem[b]-=mem[a];
if(mem[b]<=0) eip=c;

(there is no need to dynamically compile x86 code from this, since its alreday native)

How can the X86 code run negatively on your platform when your platform is a different architecture all together?
The X86 is a CISC architecture and you said yours was URISC do you plan for a compatibility layer or is there something special I'm missing?

Geri wrote:

zeitue wrote:

Would this be usable in a bytecode virtual machine as the CPU of a process virtual machine?

this will be usable in any kind of virtual machine, emulator, including a full system emulator, or an user-mode-only bytecode that runs in -for example, in a game software, to process your strips.

also, the compiler will be able to create non-os binary that does not requires an operating system under it. the operating system's kernel will be compiled with this, and its able to compile itself too, so it will be hopefully self-hosting.

i will release the specifications of the platform and the virtual machine that requires to run it.

do you mean that the code will be like a dynamically expanding and changing?
If you manage this you could possibly produce features of an A.I. in your platform.

Geri wrote:

zeitue wrote:

How does this architecture preform as far as speed and power in comparison to the X86, Arm, Power PC, ....?

-since this architecture have large ,,instructions'', and requires a lot of operation sometimes, its approx 40-95% slower than a modern x86, arm, mips, or powerpc core, on the same clock speed. (depends on the algo). however, the alu and fpu of a modern 64 bit x86 or arm system requires around 100-200 million transistor per core, alu of this architecture requires only a few hundred of transistor.

-when counting 1 billion transistor for cache, we have room for ~5000 smp capable core with the current manufacturing technologies.

-when emulating, its much faster than emulating arm, x86 or mips.

-when we target extreme low power consumption, such like mobiles, tetrises, any kind of extreme-low-end chinese cpu manufacturer can build it for the same price like they current fixedfunction processors, on the same speed. and also, its easy to implement it in hardware

Aren't large instructions a feature of the CISC architecture?
So it's faster when running on platforms like ARM, MIPS, or X86? Or just in general?
There are a lot of Chinese MIPS clones out there and they are usually used in $99 Laptops

Geri · **Joined:** Sun Jul 14, 2013 6:01 pm **Posts:** 442

Combuster wrote:

Geri wrote:

mem[b]-=mem[a];
if(mem[b]<=0) eip=c;

In that formulation, it doesn't even look turing-complete. You can't access memory unit x if the number isn't hardcoded into the machine.

you maybe alreday heard of ,,softwares'', do you?

Geri · **Joined:** Sun Jul 14, 2013 6:01 pm **Posts:** 442

zeitue:

part1: no, you misunderstood my answer.
part2: i cant give more precise detials, since i know neither. when i will see the first performance results, i will inform the community about it. for now, the speed seems okay.

Geri · **Joined:** Sun Jul 14, 2013 6:01 pm **Posts:** 442

okay, so i have implemented the stack, partially, and the function calls. however, this not yet translates to subleq, it translates to an assembly like language, wich later can be compiled easilly to subleq by a postprocessor. the c compiler will have 20-30 pass totally, which will not make it to be a fast c compiler, but since this is the most important - and probably the only usable part - of the project, this is what must be done as a primary goal, the others are just encore.

Geri · **Joined:** Sun Jul 14, 2013 6:01 pm **Posts:** 442

i have definied the passes well. the compilation will have 27 pass, and also, i have added c++ compatibility.

so lets see what the compiler will support, according to the current coderudiments:

data types:

void: nothing

int: 64 bit signed integer (full performance)
unsigned int: 64 bit unsigned integer (full performance)
signed int: 64 bit signed integer (full performance)

long int: 64 bit signed integer (full performance)
unsigned long int: 64 bit unsigned integer (full performance)
signed long int: 64 bit signed integer (full performance)

long long int: 64 bit signed integer (full performance)
unsigned long long int: 64 bit unsigned integer (full performance)
signed long long int: 64 bit signed integer (full performance)

long long: 64 bit signed integer (full performance)
unsigned long long: 64 bit unsigned integer (full performance)
signed long long: 64 bit signed integer (full performance)

long: 64 bit signed integer (full performance)
unsigned long: 64 bit unsigned integer (full performance)
long: 64 bit signed integer (full performance)

short int: 16 bit signed integer (~60% performance drop, for compatibility purposes only)
short unsigned int: 16 bit unsigned integer (~60% performance drop, for compatibility purposes only)
short signed int: 16 bit signed integer (~60% performance drop, for compatibility purposes only)

char: 8 bit signed integer (~60% performance drop, for compatibility purposes only)
unsigned char: 8 bit unsigned integer (~60% performance drop, for compatibility purposes only)
signed char: 8 bit signed integer (~60% performance drop, for compatibility purposes only)

int32_t: 32 bit signed integer (~60% performance drop, for compatibility purposes only)
uint32_t: 32 bit unsigned integer (~60% performance drop, for compatibility purposes only)

float: 64 bit fixed point real number (full performance in almost all cases)
double: 64 bit fixed point real number (full performance in almost all cases)
long double: 64 bit fixed point real number (full performance in almost all cases)

float_srsly: 32 bit floating point number (brutally slow (~2000 clock.). only for compatibility resons)
double_srsly: 64 bit floating point number (brutally slow (~2000 clock.). only for compatibility resons)
long double_srsly: 80 bit floating point number (brutally slow (~2000 clock.). only for compatibility resons)

#define - works of course
#idef - #endif
arrays - up to 12 dimension.
pointer pointer pointers - up to 12 ,,dimension''
functions, function calls, stack is implemented
static variables just converted to global variables
const - ignored
volatile - ignored
register - ignored
inline - ignored for now.
class - wrappen on structs
struct - wrapped to arrays
namespace - works
continue - works
break - works
using namespace - will result compilation error.
switch - case -> converted to if
goto -> i dont know if i woult support this or not. if it will, it only will work very limited
if -> supported
for -> converted to while
do -loop -> converted to while
while - supported
sizeof - supported
virtual - not supported. compilation error will emitted.
public - ignored
private - ignored
protected - ignored
* / + - % | || & && ( ) , -

implemented fully or partly as os-call, or directly compiled in as a runtime library:
malloc - implemented through OS (stdlib.h)
free - implemented through OS (stdlib.h)
new - constructor call + wrapped on malloc (stdlib.h)
delete - destructor call + wrapped on free (stdlib.h)
fputc - through os (stdlib.h)
printf - nowhere to print - therefore, compatilation error. (stdlib.h)
snprintf - implemented internally in compiler for compatibility purposes, however, such syntax is not supported (stdlib.h)
sprintf - implemented internally in compiler for compatibility purposes, however, such syntax is not supported (stdlib.h)
fputs - wrapped on fputc (stdlib.h)
fprintf - wrapped on fputc (stdlib.h), however, such syntax is not supported
fgetc - though os (stdlib.h)
fgets - wrapped on fgetc (stdlib.h)
isnan - not implemented, compilation error
isinf - not implemented, compilation error
isnum - implemented (stdlib.h)
atoi - implemented (math.h)
atof - implemented (math.h)
atol - wrapped on atoi (math.h)
atod - wrapped on atof (math.h)
cos - cos through internal compiled-in (math.h)
cosf, fcos - low precisity version through internal compiled-in library (math.h)
sin - through internal compiled-in library (math.h)
sinf, fsin - low precisity version through internal compiled-in library (math.h)
tan - tangent, internal compiled-in library (math.h)
tanf - faster tangent, internal compiled-in library (math.h)
atan, atan2 - through internal compiled-in library (math.h)
atanf, atanf2 - low precisity version through internal compiled-in library (math.h)
sqrtf - low precisity sqrt through internal compiled-in library (math.h)
sqrt - square root (math.h)
expf - low precisity through internal compiled-in library (math.h)
exp - through internal compiled-in library (math.h)
strcat - through internal compiled-in library (stdio.h)
strlen - through internal compiled-in library (stdio.h)
strcpy - through internal compiled-in library (stdio.h)
memcpy - through internal compiled-in library (stdio.h)
strstr - through internal compiled-in library (stdio.h)
strcmp - through internal compiled-in library (stdio.h)
div - integer division through internal compiled-in library (math.h)
fdiv - float division through internal compiled-in library (math.h)
fdivf - low precisity version through internal compiled-in library (math.h)
mem.h - ignored
time.h - ignored
createthread - through os-call (???.h)
createprocess - through os-call (???.h)
loadlibrary - through os-call (???.h)
getprocaddress - through os-call (???.h)
closelibrary - throug os-call (???.h)
killthread - througs os-call (???.h)
get_permission_for - priviliges and permission handling through os (???.h)
get_permission_ask - priviliges and permission handling through os (???.h)
get_realmode_permissions - priviliges and permission handling through os (???.h)
handle_realmode - driver operations (???.h)

default stack size: 128 kbyte. can be adjusted on compilation. only user larger if necessary, it can limit thread creation. on large arrays, its recommended to use malloc/free instead. maybe will lifted up to 256kbyte.

Geri · **Joined:** Sun Jul 14, 2013 6:01 pm **Posts:** 442

i am writing a layered compiler, its possible to have it multithreaded later, if its required. now i am about 40% done with the compiler. its only ~3000 line, it will be approx ~8000 line totally

AndrewAPrice · **Posted:** Tue Oct 08, 2013 1:22 pm

Combuster wrote:

Geri wrote:

mem[b]-=mem[a];
if(mem[b]<=0) eip=c;

In that formulation, it doesn't even look turing-complete. You can't access memory unit x if the number isn't hardcoded into the machine.

You could do a memory-triggered MMU for that.
E.g. Write the address into 0xFFFF FFF8
Read to or write from 0xFFFF FFF0 to access what is at that address.

I once though about drawing up plans for a "move machine."

The only instruction is MOV [dst], [src]. The program counter is memory-mapped so you can branch by writing into it. The ALU is also memory mapped.

To perform an addition, you can write to MMU_A, MMU_B, then read from MMU_ADD_RESULT.

To perform a conditional branch, you can write to a boolean to MMU_COND, the true branch address to MMU_A, false branch address to MMU_B, then copy MMU_COND_RESULT to PC.

Such a processor would be very trivial to implement.

I would also add an MMU that allows me to map a window that I can slide up/down, so I can perform recursive calls.

AndrewAPrice · **Posted:** Tue Oct 08, 2013 2:02 pm

MessiahAndrw wrote:

I once though about drawing up plans for a "move machine."

After researching into this more, this is apparently how many microprocessors work at the microcode level (the code that interprets opcode.)

Let's say we invent a single-operation basic processor that has 16 ports (addressed as 0-15) - each port is either mapped to a register, or an IO unit (your ALU, your RAM chip, etc).

Port 0: Program Counter
Port 1-9: 8 general purpose registers

Port 10: ALU: Operand 1
Port 11: ALU: Operand 2
Port 12: ALU: Operation to perform
Port 13: ALU: Result

Port 14: MMU: RAM address we want to access.
Port 15: MMU: Read/write what is at the RAM address in port 14.

In this example processor, a port address takes up 4 bits. A src, dst instruction can be encoded in 8-bits.

That's 248 possible combinations (8 NOP commands, as copy src->dst would do nothing), you could hardcode 248 paths between the 16 ports, as a result, every instruction would take exactly 1 cycle to execute.

So, our instructions may be 8-bits each, but we can still use 16-bit/32-bit/64-bit words, which we can write into port 14, which gives us a full 16/32/64-bit address range we can write to/read from.

With our MMU working with 64-bits at a time, we can read in 8 instructions in a single memory read. Asking our compiler to memory align our code blocks/branches at 8 bytes, we could forego caching and have a very fast processor.

AndrewAPrice · **Posted:** Tue Oct 08, 2013 4:25 pm

I would love to invent the above processor in Logisim or something, but the tool is quite limited (I'd love to tie it to a graphical output beyond the simple dot matrix display.)

However, that would make me want to start a new hobby project to build a better digital circuit simulator, that would allow me to run the digital circuit in 'high speed' mode without any of the fancy graphics, just letting me pass input/send output, while still allowing me to watch the wires on the circuit light up if I wanted to, plug in the ROM component to a file, etc.

Then comes the question, what platform should I build this for? Java, C++, .Net, web? My OS, of course! Now I have motivation to get my OS working, so I can develop this great simulator that people will want to use my OS for.

qw · **Joined:** Mon Jan 26, 2009 2:48 am **Posts:** 792

Nice idea, MessiahAndrw. Don't you need a flags register?

Geri · **Joined:** Sun Jul 14, 2013 6:01 pm **Posts:** 442

MessiahAndrw: if i can get this thing working, i will give you more details why i have decided precisely with this current design. it would be nice to run it on a circuit simulator, and its simply enough to keep the things clear.

OSwhatever · **Joined:** Mon Jul 05, 2010 4:15 pm **Posts:** 595

MessiahAndrw wrote:

After researching into this more, this is apparently how many microprocessors work at the microcode level (the code that interprets opcode.)

Let's say we invent a single-operation basic processor that has 16 ports (addressed as 0-15) - each port is either mapped to a register, or an IO unit (your ALU, your RAM chip, etc).

Port 0: Program Counter
Port 1-9: 8 general purpose registers

Port 10: ALU: Operand 1
Port 11: ALU: Operand 2
Port 12: ALU: Operation to perform
Port 13: ALU: Result

Port 14: MMU: RAM address we want to access.
Port 15: MMU: Read/write what is at the RAM address in port 14.

In this example processor, a port address takes up 4 bits. A src, dst instruction can be encoded in 8-bits.

That's 248 possible combinations (8 NOP commands, as copy src->dst would do nothing), you could hardcode 248 paths between the 16 ports, as a result, every instruction would take exactly 1 cycle to execute.

So, our instructions may be 8-bits each, but we can still use 16-bit/32-bit/64-bit words, which we can write into port 14, which gives us a full 16/32/64-bit address range we can write to/read from.

With our MMU working with 64-bits at a time, we can read in 8 instructions in a single memory read. Asking our compiler to memory align our code blocks/branches at 8 bytes, we could forego caching and have a very fast processor.

So if you prep your CPU for doing an ALU operation, what triggers it to write the result in the result port? You probably want to pipe all of this and that means that the result will just be there a cycle or so until the next operation comes along. You don't really need to trigger your ALU operations but in that case you really have an ISA which is very dependent on timing and you also probably want to do several moves each cycle, but that's is perhaps what you had in mind.

Why not buy an FPGA board and try to realize this approach.

AndrewAPrice · **Posted:** Wed Oct 09, 2013 9:04 am

Hobbes wrote:

Nice idea, MessiahAndrw. Don't you need a flags register?

Possibly, I did not consider that. You could get creative with the values in those 4 ALU registers.

AndrewAPrice · **Posted:** Wed Oct 09, 2013 9:19 am

OSwhatever wrote:

So if you prep your CPU for doing an ALU operation, what triggers it to write the result in the result port? You probably want to pipe all of this and that means that the result will just be there a cycle or so until the next operation comes along. You don't really need to trigger your ALU operations but in that case you really have an ISA which is very dependent on timing and you also probably want to do several moves each cycle, but that's is perhaps what you had in mind.

It is not perfect - there are many flaws in this design that need to be worked out. Most of the time I just throw ideas out there.

One problem I noticed with a transport triggered architecture like that is that you would easily break compatibility - add an extra port for IO, you loose a general register, all compiled code breaks. However, if you're like me and believe the future of general purpose computers lies in JIT systems (JVM, CLR, JS) then this would have very little relevancy.

OSDev.org

Geri's platform

Who is online