Custom Programming Language

mathematician · **Posted:** Fri Nov 28, 2014 8:55 am

Off and on, I have spent the last few years looking for a C compiler which was:

a.) 64 bit
b.) Run under Windows
c.) Produced ELF files.

If there had been a 64 bit version Open Watcom might have filled the bill. They have now got around to writing a 64 bit version of their compiler, but it is anybody's guess when it will appear.

Anyway, to cut a long story short, I have finally bitten the bullet, and decided to write my own compiler. I have also decided to come up with my own language whilst I was about it, and one very specifically intended for osdev. I have a specification for it, but before I get too far advanced with a compiler, I thought I might get some other people's comments, if they are interested to make any. The specification can be found here:

http://www.leslie-dellow.co.uk/aspl/ASPL_Language_Specification.pdf

Kevin · **Posted:** Fri Nov 28, 2014 9:30 am

Quote:

There is no equivalent to C’s ! operator, because De Morgan’s Law seems to me to make it more or less redundant, and there is no point putting in things which seldom, or in my case never, get used. One other difference is that a typecast looks like:

What? You're seriously saying that nobody needs a logical not?

The syntax looks familiar enough that I guess I would be able to write programs in it without too much difficulty. I do think that important features are missing (to name just one: no elseif in a language that requires end if?), but you'll probably see what you need when you actually write some code in it.

One small inconsistency is that you have end if/while/repeat/for, but <struct-type> end. Perhaps you should make that end <struct-type> as well. It's also funny that while you use mostly the same constructs as Pascal except that you removed the "begin", you actually add a "begin" for switch (and a redundant "end repeat", too), where Pascal doesn't have one.

mathematician · **Posted:** Fri Nov 28, 2014 10:02 am

Kevin wrote:

Quote:

There is no equivalent to C’s ! operator, because De Morgan’s Law seems to me to make it more or less redundant, and there is no point putting in things which seldom, or in my case never, get used. One other difference is that a typecast looks like:

What? You're seriously saying that nobody needs a logical not?

I Know I never use it.

Quote:

The syntax looks familiar enough that I guess I would be able to write programs in it without too much difficulty. I do think that important features are missing (to name just one: no elseif in a language that requires end if?), but you'll probably see what you need when you actually write some code in it.

The majority of procedural languages seem to be either C like or Pascal like. I have pinched some ideas from Ada. I didn't see the point in running the two words else if into the single word elseif.

Quote:

One small inconsistency is that you have end if/while/repeat/for, but <struct-type> end. Perhaps you should make that end <struct-type> as well. It's also funny that while you use mostly the same constructs as Pascal except that you removed the "begin", you actually add a "begin" for switch (and a redundant "end repeat", too), where Pascal doesn't have one.

I will probably fix the inconsistency you mention. The begin after switch isn't strictly necessary; I just preferred 'begin' as a delimiter, rather than 'case', and the end repeat is there to delimit the conditional expression. C uses parentheses.

Kevin · **Posted:** Fri Nov 28, 2014 10:10 am

mathematician wrote:

The majority of procedural languages seem to be either C like or Pascal like. I have pinched some ideas from Ada. I didn't see the point in running the two words else if into the single word elseif.

Without it, your code will look like this:

Code:

if a = 42 then
    ...
else
    if b > 0 then
        ...
    else
        if c != a then
            ...
        end if;
    end if;
end if;

You could certainly write "else if" to hide that you're not formatting your source code like the syntax is defined, but you pile up lots of "end if"s at the end. The "end if" is the reason why Ada has an elsif in its syntax, whereas Pascal and C don't need it.

Quote:

the end repeat is there to delimit the conditional expression. C uses parentheses.

Isn't the semicolon good enough, like in Pascal?

mathematician · **Posted:** Fri Nov 28, 2014 10:32 am

Kevin wrote:

mathematician wrote:

The majority of procedural languages seem to be either C like or Pascal like. I have pinched some ideas from Ada. I didn't see the point in running the two words else if into the single word elseif.

Without it, your code will look like this:

Code:

if a = 42 then
    ...
else
    if b > 0 then
        ...
    else
        if c != a then
            ...
        end if;
    end if;
end if;

You could certainly write "else if" to hide that you're not formatting your source code like the syntax is defined, but you pile up lots of "end if"s at the end. The "end if" is the reason why Ada has an elsif in its syntax, whereas Pascal and C don't need it.

Quote:

the end repeat is there to delimit the conditional expression. C uses parentheses.

if a = 42 then
...
else
if b > 0 then
...
else
if c != a then
...
end if;
end if;
end if;[/quote]

If I wrote it, it would look like:

Code:

if a = 42 then
......
else if b > 0
......
else if c != a
......
endif

Running the two words into one would just seem to be stylistic.

Quote:

Isn't the semicolon good enough, like in Pascal?

I am not sure how long the semi colon at the end of a line is going to survive. For the purposes of parsing the source, a newline would seem to be good enough. In fact I am not sure why I need to know when a new line occurs, except for the purposes of making sure there is a semi colon there. The tokens emerging from my lexer carry line numbers with them.

Kevin · **Posted:** Fri Nov 28, 2014 2:14 pm

mathematician wrote:

If I wrote it, it would look like:

Code:

if a = 42 then
......
else if b > 0
......
else if c != a
......
endif

Running the two words into one would just seem to be stylistic.

Yes. Except that you changed more than just making two words out of one, and therefore your code wouldn't compile. The last line needs to be "endif endif endif" rather than just "endif" with your syntax.

mathematician · **Posted:** Fri Nov 28, 2014 2:28 pm

Kevin wrote:

mathematician wrote:

If I wrote it, it would look like:

Code:

if a = 42 then
......
else if b > 0
......
else if c != a
......
endif

Running the two words into one would just seem to be stylistic.

Yes. Except that you changed more than just making two words out of one, and therefore your code wouldn't compile. The last line needs to be "endif endif endif" rather than just "endif" with your syntax.

Since I am writing the compiler, it will compile if I say so. The only question is whether or not it is unambiguous, and, so far as I can see, it is.

Schol-R-LEA · **Posted:** Fri Nov 28, 2014 7:07 pm

If someone doesn't either get this discussion back on topic, or lock the thread, I gonna start discussing the advantages and disadvantages of hygienic macro systems as applied to assembly-level kernel development. At length. And no one wants that.

mathematician · **Posted:** Fri Nov 28, 2014 8:31 pm

Schol-R-LEA wrote:

If someone doesn't either get this discussion back on topic, or lock the thread, I gonna start discussing the advantages and disadvantages of hygienic macro systems as applied to assembly-level kernel development. At length. And no one wants that.

I'm game if you are. One possibility I am considering is to use the output of the front end as input for an interpreter, which can double as a debugger. The tokens which emerge from the scanner already have the beginnings of a byte code, because it is easier and quicker to compare numbers than it is to compare strings. Then the compiler could be written in its own language, with the result that, by the time it is finished, it would already have been used to compile one non trivial program.

Brendan · **Posted:** Fri Nov 28, 2014 11:18 pm

Hi,

This is just random notes about the ASPL language specification.

If a character is always 4 bytes, then a character is UTF-32. If a string is always UTF-8, then doing something like "myString[2] = '9';" becomes a massive disaster. In my opinion it's better to only say "string literals are UTF-8" and "character constants are UTF-32" (and possibly also support UTF-32 string literals); and then don't provide any string or character data type at all and let the programmer build their own (using "typedef", arrays, whatever).

If the only other data types are byte, word, dword and qword; then how are people supposed to do floating point? Also don't forget that pointers and arrays are normally data types too; and that most languages support "user defined data types" (structures, unions, etc).

Also note that word, dword and qword are really bad names (they only make sense on 80x86 and any programmer familiar with other CPUs may mistake "word" for 32-bit or anything else. I'd recommend using u8, u16, u32, u64 instead. I'd also recommend getting rid of the "signed" type qualifier and having s8, s16, s32, s64 too. Finally, for bitfields, why not have "u3" for a 3-bit bitfield?

If assigning 9 to a 3-bit field is guaranteed to cause a compile time error, then how about "const byte foo := 9;" followed by "page.avail := foo;" or maybe "const byte foo := 3;" followed by "page.avail := foo*3;"?

If you do this for bitfields, do you also do it for other data types (e.g. "const word hexnum := 0x12345678;")? What about "word dayOfWeek range 1 to 7 := 9;"?

For integer ranges, why does the compiler need to be told the size? For example, instead of "word dayOfWeek range 1 to 7" why can't you just do "range 1 to 7 dayOfWeek" and let the compiler choose an underlying data type capable of storing values from 1 to 7?

For structures, can I do this:

Code:

struct myStructure align 9
    myStructure * next;
    word something;
myStructure end;

For bitsets (continued from the earlier "u3 for a 3-bit bitfield" suggestion), why can't I just do:

Code:

struct pgLevel1_t
    u1 present;
    u1 readWrite;
    ...
    u40 pageAddress;
    u11 avail;
    u1 noExecute;
pgLevel1_t end;

The increment and decrement operators are a mistake. Because they have side effects they become a pain in the neck. A simple example would be "index += myArray[ incr index]".

How would De Morgan re-arrange something like 'if( !myBool1 or myBool2 )" to avoid the not? Is something like 'if( (myBool1 ^ true) or myBool2 )" easier to read? Given that your boolean operators are already words (and not symbols like "||"), is there any sane reason to refuse to support 'if( not myBool1 or myBool2 )"?

Where did all the relational operators go (==, <, <=, >, >=, !=)? Did you forget to put them in the "operator precedence" table, or is it impossible to do things like "if( myValue > 4)"?

If you only forgot the relational operators; what would "myBool := left < middle > right;" mean? Would it be the like "myBool := (left < middle) > right;" or would it be like "myBool := left (<middle>right);" where "<middle>" is a typecast, or would it depend on whether "middle" is a type or a variable?

For the "machine code" (assembly language and not machine code?); what does "movzx" look like? For example, would you do "movzxd eax,[esi]" and not know if it's a byte being extended or a word being extended? Does this apply to all instructions and not just "mov"?

What happens if I do "movd al,bl"? Does the assembler accept this (and if it does, which instruction would I actually get?), or does it give me an "incompatible operands" error. If it gives me an "incompatible operands" error then the assembler must've known the sizes of the operands; and if the assembler knows the sizes of the operands why does it need to be told the size to begin with?

Cheers,

Brendan

Schol-R-LEA · **Posted:** Sat Nov 29, 2014 8:55 am

OK, I've got the first part of my monologue on macros ready. Let me begin by defining some terms:

A macro definition (or simply a macro) is a piece of code within
a program which describes a set of transformations to be performed on another
set of code prior to that code being translated (either compiled or
interpreted).

A macro application is an instance of a macro being used to alter a
section of code.

The macro expansion of a given macro application is the resulting
code that is actual translated and executed.

To give a simple example using the familiar C pre-processor, the following
macro definition,

Code:

#define square(x) x * x

is illustrative of a typical textual macro as seen in C, C++ and many
assembly languages. It defines the name of the macro, a parameter, and a simple
set of string changes to make to the argument. One could apply it to a piece of
code, in this case a numerical literal,

Code:

a = square(42);

and get the following macro expansion:

Code:

a = 42 * 42;

Now, it has to be understood that this simple type of macro, called a textual
macro, is simply performing a string interpolation on the argument; that is to
say, it is replacing the the parameter with the text of the argument given to
the macro application, and inserting it verbatim into the body of the macro,
which then replaces the macro application. one could have written

Code:

a = square(48 - 6);

and gotten the expansion

Code:

a = 48 - 6 * 48 - 6;

The result of which is clearly not the same as the previous application, even
though they would appear on the surface to be identical semantically. This
highlights one of the flaws in naive macros of this sort, and introduced the
most basic issue of macro hygiene. avoiding duplicate evaluation.

The macros possible with the C preprocessor are extremely limited; they do not
compose effectively without heroic effort, they are generally restricted to a
single line of code (though they can span multiple lines with a line extension)
and they are rather simplistic in how they are expanded. They also represent
something of an exceptional case for the C language, as they do not match the
syntactic structures of the language overall. Macros, and the C pre-processor
directives in general, form a separate mini-language independent from C itself,
and indeed the pre-processor can be run separately from the compiler in most
implementations.

Similarly, the m4 macro processor, frequently used in conjunction with the
Unix as assembler, is actually a stand-alone program in it's own right, and
invoked automatically by the assembler. It is a significantly more sophisticated
macro processor than CPP, but it is still basically a textual macro expansion.

A key reason why these textual macros are limited is because they are separate
from the languages they pre-process. However, it is possible, in languages which
possess the property of homoiconicity - that is to say, the code
in which the language is written is itself a data structure of the language and
can be manipulated programmatically within the code without needing to call
out to a separate translator - can support a more powerful form of macro, known
as lexical macros. The classic examples of lexical macros are to be
found in the various dialects of the Lisp family of languages.

In Common Lisp. a macro is a piece of Lisp code the same as any other, except
that the compiler recognizes that it is a macro when it scans it and calls the
macro expander to perform the transformation at compile time, prior to compiling
the code. To use the simple example from before,

Code:

(defmacro square (x)
  `(* ,x ,x))

As it happens, Common Lisp provides a simple means with which to check what the
expansion of a given macro application would be, the macroexpand-1
special form. So, if we write

Code:

(macroexpand-1 '(square 3))

at the Lisp listener, we get

Code:

(* 3 3)

as expected. Unfortunately, we still haven't solved the duplicate evaluation
problem, as shown here:

Code:

 > (macroexpand-1 '(square (- 3 1)))
(* (- 3 1) (- 3 1)) ;

(the right angle bracket here is the listener prompt.) As it happens, this does
not have the consequences it did in C, thanks to a quirk of Lisp
syntax: because Lisp functions are always fully parenthesized, the specific
fault of mismatching the order of operations doesn't happen, though this
doesn't mean we're free of the duplicate evaluation problem.

If you look at the macro again, you should note the backquote and the commas;
these are relevant, as they indicate what parts of the code should be evaluated
at compile time and which at run time. To put it succinctly, a Lisp macro is a
Lisp function that is run before compile time, with it's expansion being the
function's output. The backquote says that the following part of the code
should be passed through to the expansion unaltered, except that it may
contain elements which should be evaluated, which are indicated by the commas.
Had we omitted these indicators,

Code:

(defmacro square2 (x)
  (* x x))

It would result in the whole code being evaluated a compile time,

Code:

> (macroexpand-1 '(square2 3))
9 ;
> (macroexpand-1 '(square2 (- 3 1)))

*** - *: (- 3 1) is not a number

You can see how, in the first case, it is can be a very useful thing to have the
expression evaluated at compile time, as the expression essentially becomes a
simple constant; but you can see in the second example the limitations of this,
as the arguments passed to the macro are not evaluated at all.

Because they have the full Common Lisp language available to them for
generating their expansion, Common Lisp macros do not have many of the
limitations that textual macros had. For example, one could write a macro that
does evaluate an arbitrary numeric argument to its square at compile
time:

Code:

(defmacro square3 (x)
  (let ((y (eval x)))
    (if (numberp y)
      (* y y)
      nil)))

which does indeed expand as desired:

Code:

> (macroexpand-1 '(square3 (- 3 1)))
4 ;

Furthermore, because it evaluates the value of x prior to binding it
to y before it is squared, it avoids re-evaluating the expression;
no matter how complex the expression is, so long as the value is a number,
it will be evaluated exactly once, at compile time. Of course, this still has its
limitations: the expression must be evaluable at compile time, so it can't
depend on any variables entered by the user at run time, for example. Still,
this can be a useful property for certain applications.

Kevin · **Posted:** Sat Nov 29, 2014 2:44 pm

mathematician wrote:

Since I am writing the compiler, it will compile if I say so. The only question is whether or not it is unambiguous, and, so far as I can see, it is.

The other question is just whether you're still implementing the language described by your spec... The grammar of the spec doesn't allow it, so if your compiler compiles it, it implements a different language.

mathematician · **Posted:** Sat Nov 29, 2014 6:09 pm

Brendan wrote:

Hi,

This is just random notes about the ASPL language specification.

If a character is always 4 bytes, then a character is UTF-32. If a string is always UTF-8, then doing something like "myString[2] = '9';" becomes a massive disaster. In my opinion it's better to only say "string literals are UTF-8" and "character constants are UTF-32" (and possibly also support UTF-32 string literals); and then don't provide any string or character data type at all and let the programmer build their own (using "typedef", arrays, whatever).

As I understand it, UTF-32 is no encoding at all, but just the code point - myString[2] = '9' would need the compiler to put in a function call.

Quote:

Also don't forget that pointers and arrays are normally data types too

pntr pntr word data[10]; will appear in the symbol table as a variable of type word, but two other fields will be set to:

levels (of indirection) = 2
elems = 10

Quote:

If the only other data types are byte, word, dword and qword; then how are people supposed to do floating point? Also don't forget that pointers and arrays are normally data types too; and that most languages support "user defined data types" (structures, unions, etc).

As I say, ASPL is intended for one very specific task, and I cannot imagine that there will be very much call for floating point arithmetic in the kernel of an operating system. If there is, it can go into version 2.

Quote:

Also note that word, dword and qword are really bad names (they only make sense on 80x86 and any programmer familiar with other CPUs may mistake "word" for 32-bit or anything else. I'd recommend using u8, u16, u32, u64 instead. I'd also recommend getting rid of the "signed" type qualifier and having s8, s16, s32, s64 too. Finally, for bitfields, why not have "u3" for a 3-bit bitfield?

To be honest, they don't appeal to me for aesthetic reasons alone.

Quote:

If assigning 9 to a 3-bit field is guaranteed to cause a compile time error, then how about "const byte foo := 9;" followed by "page.avail := foo;" or maybe "const byte foo := 3;" followed by "page.avail := foo*3;"?

If you do this for bitfields, do you also do it for other data types (e.g. "const word hexnum := 0x12345678;")? What about "word dayOfWeek range 1 to 7 := 9;"?

There are all sorts of things which can and should be checked for at compile time. The fact that I haven't mentioned them does not mean that they will be left out. There are so many syntactic and semantic mistakes which could be made, the difficulty will be in making sure I have covered all the possibilities. I have already written part of the parser, and about every sixth line is error checking.

Quote:

For integer ranges, why does the compiler need to be told the size? For example, instead of "word dayOfWeek range 1 to 7" why can't you just do "range 1 to 7 dayOfWeek" and let the compiler choose an underlying data type capable of storing values from 1 to 7?

Maybe it could, but typing "word" (or byte) doesn't seem any great hardship.

Quote:

For structures, can I do this:

Code:

struct myStructure align 9
    myStructure * next;
    word something;
myStructure end;

No, the only values the compiler will buy are 1, 2, 4 and 8.

Quote:

For bitsets (continued from the earlier "u3 for a 3-bit bitfield" suggestion), why can't I just do:

Code:

struct pgLevel1_t
    u1 present;
    u1 readWrite;
    ...
    u40 pageAddress;
    u11 avail;
    u1 noExecute;
pgLevel1_t end;

Why would that be an improvement?

Quote:

The increment and decrement operators are a mistake. Because they have side effects they become a pain in the neck. A simple example would be "index += myArray[ incr index]".

I tried writing some code in ASPL, although I obviously can't compile it yet, and x := x + 1, cropped up so often that it had to have special provision made for it. Admittedly that was before I had thought of x := + 1.

Quote:

How would De Morgan re-arrange something like 'if( !myBool1 or myBool2 )" to avoid the not? Is something like 'if( (myBool1 ^ true) or myBool2 )" easier to read? Given that your boolean operators are already words (and not symbols like "||"), is there any sane reason to refuse to support 'if( not myBool1 or myBool2 )"?

The only time I envisage the "or" and "and" keywords being used is in conditional expressions, and not((x = 2) or (y > 10)) could just as easily (and in my opinion more naturally) be written as x != 2 and y <=10.

Quote:

Where did all the relational operators go (==, <, <=, >, >=, !=)? Did you forget to put them in the "operator precedence" table, or is it impossible to do things like "if( myValue > 4)"?

I didn't forget them but it seemed weird to lump them in with the arithmetical and bitwise operators; especially in a language without a boolean data type. In C I write if(x < (y + 2)) just to be on the safe side, but it seems to me intuitively obvious that the calculation y + 2 should come before the comparison. For (x < y) + 2 to even make sense, you have to postulate a boolean data type, cast it into an integer, and then add two to it.

Quote:

If you only forgot the relational operators; what would "myBool := left < middle > right;" mean?

Good question. Have you ever written an expression like that? Because I haven't. Any compiler I write will only look for conditional operators in connection with if,while.for or repeat instructions.

Quote:

For the "machine code" (assembly language and not machine code?); what does "movzx" look like? For example, would you do "movzxd eax,[esi]" and not know if it's a byte being extended or a word being extended? Does this apply to all instructions and not just "mov"?

What happens if I do "movd al,bl"? Does the assembler accept this (and if it does, which instruction would I actually get?), or does it give me an "incompatible operands" error. If it gives me an "incompatible operands" error then the assembler must've known the sizes of the operands; and if the assembler knows the sizes of the operands why does it need to be told the size to begin with?

Writing the assembler isn't going to happen any time soon. I will be giving the mnemonics my fuller consideration nearer the time.

Brendan · **Posted:** Sun Nov 30, 2014 4:50 am

Hi,

mathematician wrote:

Brendan wrote:

If a character is always 4 bytes, then a character is UTF-32. If a string is always UTF-8, then doing something like "myString[2] = '9';" becomes a massive disaster. In my opinion it's better to only say "string literals are UTF-8" and "character constants are UTF-32" (and possibly also support UTF-32 string literals); and then don't provide any string or character data type at all and let the programmer build their own (using "typedef", arrays, whatever).

As I understand it, UTF-32 is no encoding at all, but just the code point - myString[2] = '9' would need the compiler to put in a function call.

In my opinion (for low level languages), everything should do what it looks like it does and additional overhead should not be hidden (including things like "myStruct1 = myStruct2", which should be banned as it hides a potentially costly copying operation).

Your function call would need to convert the codepoint into a variable length string of bytes. For something like this:

Code:

dword foo = 0x87654321;
myString[2] = foo;

The codepoint is not valid and can't be converted to UTF-8; but you have no way to return an error from your function.

For something like this:

Code:

myString[0] = firstChar;
myString[??] = secondChar;

There's no way to determine how many bytes the first character consumed, and no way to figure out where to put the second character.

For both of these reasons, the first thing any sane programmer is going to do is to write their own function that tests for errors and does return a "number of bytes". If that's the case, then why bother supporting it in the first place (e.g. rather than providing a library function that does it properly)?

mathematician wrote:

Brendan wrote:

If the only other data types are byte, word, dword and qword; then how are people supposed to do floating point? Also don't forget that pointers and arrays are normally data types too; and that most languages support "user defined data types" (structures, unions, etc).

As I say, ASPL is intended for one very specific task, and I cannot imagine that there will be very much call for floating point arithmetic in the kernel of an operating system. If there is, it can go into version 2.

In that case (building an entire tool-chain to use for one "80x86 kernel" binary and nothing else) the entire idea is a stupid waste of time - you'll spend more time writing and fixing the compiler than you actually spend using the compiler (especially if you want the resulting code to be optimised rather than extremely bad).

mathematician wrote:

Brendan wrote:

Also note that word, dword and qword are really bad names (they only make sense on 80x86 and any programmer familiar with other CPUs may mistake "word" for 32-bit or anything else. I'd recommend using u8, u16, u32, u64 instead. I'd also recommend getting rid of the "signed" type qualifier and having s8, s16, s32, s64 too. Finally, for bitfields, why not have "u3" for a 3-bit bitfield?

They don't appeal to me for aesthetic reasons alone.

Given the choice between "aesthetics" and actually being good for the purpose it's designed, aesthetics are for people that don't belong anywhere near engineering who should be playing with Barbie dolls instead. Note: What I mean here is that "aesthetics" are not an acceptable reason. Perhaps when you said "for aesthetic reasons" you really meant to say something like "I think it's worse for code readability/maintainability" or "I think it's harder for the compiler to parse". These would have both been valid (but incorrect) reasons.

mathematician wrote:

Brendan wrote:

If assigning 9 to a 3-bit field is guaranteed to cause a compile time error, then how about "const byte foo := 9;" followed by "page.avail := foo;" or maybe "const byte foo := 3;" followed by "page.avail := foo*3;"?

If you do this for bitfields, do you also do it for other data types (e.g. "const word hexnum := 0x12345678;")? What about "word dayOfWeek range 1 to 7 := 9;"?

There are sorts of things which can and should be checked for at compile time.

I doubt you understand the complexity involved. For "page.avail := foo*3;" you need to do whole program analysis and prove that "foo" is always within the range 0 to 2.333.

mathematician wrote:

Brendan wrote:

For integer ranges, why does the compiler need to be told the size? For example, instead of "word dayOfWeek range 1 to 7" why can't you just do "range 1 to 7 dayOfWeek" and let the compiler choose an underlying data type capable of storing values from 1 to 7?

Maybe it could, but typing "word" (or byte) doesn't seem any great hardship.

Except now you have type information in 2 places (on the left and right of the name) which will complicate parsing. For a simple example, consider:

Code:

typedef range 1 to 12 myMonth_t;
typedef range 1 to 31 myDayOfMonth_t;
// typedef byte myMonth_t range 1 to 12;
// typedef byte myDayOfMonth_t range 1 to 31;

word birthYear;
myMonth_t birthMonth;
myDayOfMonth_t birthDayOfMonth;

mathematician wrote:

Brendan wrote:

For bitsets (continued from the earlier "u3 for a 3-bit bitfield" suggestion), why can't I just do:

Code:

struct pgLevel1_t
    u1 present;
    u1 readWrite;
    ...
    u40 pageAddress;
    u11 avail;
    u1 noExecute;
pgLevel1_t end;

Why would that be an improvement?

Because it's simpler (for both humans and the compiler) than having different syntax for 2 almost identical things.

mathematician wrote:

Brendan wrote:

The increment and decrement operators are a mistake. Because they have side effects they become a pain in the neck. A simple example would be "index += myArray[ incr index]".

I tried writing some code in ASPL, although I obviously can't compile it yet, and x := x + 1, cropped up so often that it had to have special provision made for it. Admittedly that was before I had thought of x := + 1.

They could be statements instead of operators. E.g. "incr index; index += myArray[index];" and
"index += myArray[index]; incr index;" would both be valid, but "index += myArray[ incr index]" wouldn't be allowed.

mathematician wrote:

Brendan wrote:

How would De Morgan re-arrange something like 'if( !myBool1 or myBool2 )" to avoid the not? Is something like 'if( (myBool1 ^ true) or myBool2 )" easier to read? Given that your boolean operators are already words (and not symbols like "||"), is there any sane reason to refuse to support 'if( not myBool1 or myBool2 )"?

The only time I envisage the "or" and "and" keywords being used is in conditional expressions, and not((x = 2) or (y > 10)) could just as easily (and in my opinion more naturally) as x != 2 and y <=10.

That's short sighted at best:

Code:

    isPresent := myPage.present;
    isUserSpace := myPage.pageAddress >= KERNEL_SPACE_START_ADDRESS;

    if( not isPresent and isUserSpace) ) {
        // Handle the most likely/expected case
        allocPage();
    } else {
        // Handle the error cases
        if( isPresent ) {
            return ERR_PAGE_ALREADY_PRESENT;
         } else if( not isUserSpace ) {
            return ERR_PERMISSION_DENIED;
         }
    }

It is possible to do it without "not"; but it forces the programmer to use less natural, less readable and less maintainable alternatives; so failing to provide "not" is pointlessly stupid.

mathematician wrote:

Brendan wrote:

Where did all the relational operators go (==, <, <=, >, >=, !=)? Did you forget to put them in the "operator precedence" table, or is it impossible to do things like "if( myValue > 4)"?

I didn't forget them but it seemed weird to lump them in with the arithmetical and bitwise operators; especially in a language without a boolean data type. In C I write if(x < (y + 2)) just to be on the safe side, but it seems to me intuitively obvious that the calculation y + 2 should come before the comparison. For (x < y) + 2 to even make sense, you have to postulate a boolean data type, cast it into an integer, and then add two to it.

Um, what?

There are only 2 sane choices:

You do have a boolean data type; and therefore your type checker can check if (e.g.) the operands of a boolean OR are both booleans, and can check if (e.g.) the operands to an addition are both numbers.
You do not have a boolean data type (so people just use integers instead); and therefore your type checker can not check if (e.g.) the operands of a boolean OR are both booleans, and can not check if (e.g.) the operands to an addition are both numbers.

For both cases, the relational operators are still operators, and the operands to the relational operators are numbers. The only difference is what the resulting type is - e.g. if there is a boolean type then something like "x < y" has a result of type boolean; and if there isn't any boolean type then something like "x < y" has a result of type number.

Also don't forget that the compiler converts expressions into AST (which is where your precedence rules are taken into account), and the type checker works recursively on the AST. For example, the expression "x + (y < z)" might get converted into the AST:

Code:

    expression
     |_addition
        |_x
        |_less than
           |_y
           |_z

The type checker starts at the top and tries to find the resulting type of "addition". To do that it has to find the types of both "x" and "less than" (which is where it gets recursive). To do that it has to find the types of "y" and "z". This means that (after all the recursion) it becomes like this:

Code:

    expression
     |_addition
        |_type=number
        |_less than
           |_type=number
           |_type=number

Then it checks the operands for "less than", then finds the resulting type of "less than":

Code:

    expression
     |_addition
        |_type=number
        |_type=number

Then checks the operands for "addition", then finds the resulting type of "addition":

Code:

    expression
     |_type=number

And now it's completed the type checking. Notice that (because there was no boolean type) the resulting type of the "less than" had to be "type=number"? This is what makes it impossible for you to generate an error when someone tries to add a boolean to a number. If there was a boolean type, then it would've become:

Code:

    expression
     |_addition
        |_type=number
        |_type=boolean

And in this case the type checker can/would complain that addition expects a number type.

Note: The type checker typically does a lot more than this - I've simplified a lot to make the recursive nature of it clear.

Now, try to explain how you're planning on doing parsing and type checking when there are no relational operators. For example, for the expression "(x > y) and (x < z)" describe what the AST would look like and how the type checker would check it.

mathematician wrote:

Brendan wrote:

If you only forgot the relational operators; what would "myBool := left < middle > right;" mean?

Good question. Have you ever written an expression like that? Because I haven't.

Is this some sort of joke? When given correct source code a compiler must generate correct output (and when given invalid source code the compiler must generate a hopefully meaningful error message). Whether or not it's likely to occur in source code is entirely irrelevant.

Is it valid? If it is valid what is the correct output?

mathematician wrote:

Any compiler I write will only look for conditional operators in connection with if,while.for or repeat instructions.

And then you'll get to common sub-expression elimination and realise that you end up with conditional operators in expressions anyway. For a simple example, this:

Code:

    if(x < y*3) {
        foo();
    }
    if( (x < y*3) and (z = 3) ) {
        bar();
    }

Should be converted/optimised into:

Code:

    temp = x < y*3;
    if(temp) {
        foo();
    }
    if( (temp) and (z = 3) ) {
        bar();
    }

Cheers,

Brendan

mathematician · **Posted:** Sun Nov 30, 2014 8:10 am

To prevent the quotes becoming nested too deeply, I have removed some of them.

Brendan wrote:

In my opinion (for low level languages), everything should do what it looks like it does and additional overhead should not be hidden (including things like "myStruct1 = myStruct2", which should be banned as it hides a potentially costly copying operation).

Well maybe. I already have a range of functions for UTF-8 strings, written in C. When the time comes it might make sense to rewrite them in ASPL; especially if I decide to bootstrap the compiler. However, "string" will have to be a type, because they are not simply arrays, as they are in C. The variable length characters means that some additional information needs to be embedded with them.

Quote:

Your function call would need to convert the codepoint into a variable length string of bytes. For something like this:

Code:

dword foo = 0x87654321;
myString[2] = foo;

The codepoint is not valid and can't be converted to UTF-8; but you have no way to return an error from your function.

Functions for doing the conversion, in both directions, already exist, but, having written them, I think I have yet to use them.

Quote:

For something like this:

Code:

myString[0] = firstChar;
myString[??] = secondChar;

There's no way to determine how many bytes the first character consumed, and no way to figure out where to put the second character.

The number of bytes a UTF-8 character occupies is encoded into the first byte of that character.

Quote:

For both of these reasons, the first thing any sane programmer is going to do is to write their own function that tests for errors and does return a "number of bytes". If that's the case, then why bother supporting it in the first place (e.g. rather than providing a library function that does it properly)?

As I have said there already exist functions which the compiler either has, or probably will, use internally. Later on they could possibly be rewritten in ASPL.

Quote:

In that case (building an entire tool-chain to use for one "80x86 kernel" binary and nothing else) the entire idea is a stupid waste of time - you'll spend more time writing and fixing the compiler than you actually spend using the compiler (especially if you want the resulting code to be optimised rather than extremely bad).

In a sense, writing an operating system which isn't going to challenge Windows, or even Linux, is pretty stupid. For the purposes of writing the code you have to pretend to yourself that other people will be using your system, but, away from the keyboard, you know full well that that is highly unlikely to be the case. The only chance you might stand is if you could find a niche market whose needs couldn't be met by an application running on top of an existing OS. Failing that, you do it just for the hell of it.

Quote:

Given the choice between "aesthetics" and actually being good for the purpose it's designed, aesthetics are for people that don't belong anywhere near engineering who should be playing with Barbie dolls instead. Note: What I mean here is that "aesthetics" are not an acceptable reason. Perhaps when you said "for aesthetic reasons" you really meant to say something like "I think it's worse for code readability/maintainability" or "I think it's harder for the compiler to parse". These would have both been valid (but incorrect) reasons.

i meant what I said. What the data types are called is a fairly marginal issue, and the ones you suggested just strike me as ugly. Why code it in a high or medium level language at all, rather than just assembly code? If you thought your compiler and/or operating system was going to take over the world, the answer might be portability, but, for me personally, it is aesthetics. I prefer the look of the code on the page.

If you are interested, the type names are now, int8_t, int16_t, int32_t, int64_t, int8s_t, int16s)t, int32s_t, int64s_t.

Quote:

I doubt you understand the complexity involved. For "page.avail := foo*3;" you need to do whole program analysis and prove that "foo" is always within the range 0 to 2.333.

Only if checking that in z = x/y, checking that y could never be zero is also a job you would give to the compiler. As for constants, checking that 9 isn't being or'ed into a three bit field would be straightforward, because the structure (which already exists) would record details about the field, one of which is the maximum value it can hold.

mathematician wrote:

Brendan wrote:

For integer ranges, why does the compiler need to be told the size? For example, instead of "word dayOfWeek range 1 to 7" why can't you just do "range 1 to 7 dayOfWeek" and let the compiler choose an underlying data type capable of storing values from 1 to 7?

Maybe it could, but typing "word" (or byte) doesn't seem any great hardship.

Quote:

Because it's simpler (for both humans and the compiler) than having different syntax for 2 almost identical things.

That a letter followed by a number on the left hand side is preferable to a colon followed by a number n the right hand side, is something about which we will have to disagree.

That's short sighted at best:

Code:

    isPresent := myPage.present;
    isUserSpace := myPage.pageAddress >= KERNEL_SPACE_START_ADDRESS;

    if( not isPresent and isUserSpace) ) {
        // Handle the most likely/expected case
        allocPage();
    } else {
        // Handle the error cases
        if( isPresent ) {
            return ERR_PAGE_ALREADY_PRESENT;
         } else if( not isUserSpace ) {
            return ERR_PERMISSION_DENIED;
         }
    }

[/quote]

For good or ill, I would probably write

Code:

if myPage.present = 0 and myPage.pageAddress < KERNEL_SPACE_START_ADDRESS
               allocPage()

It is possible to do it without "not"; but it forces the programmer to use less natural, less readable and less maintainable alternatives; so failing to provide "not" is pointlessly stupid.[/quote]

Quote:

Um, what?

There are only 2 sane choices:

You do have a boolean data type; and therefore your type checker can check if (e.g.) the operands of a boolean OR are both booleans, and can check if (e.g.) the operands to an addition are both numbers.
You do not have a boolean data type (so people just use integers instead); and therefore your type checker can not check if (e.g.) the operands of a boolean OR are both booleans, and can not check if (e.g.) the operands to an addition are both numbers.

For both cases, the relational operators are still operators, and the operands to the relational operators are numbers. The only difference is what the resulting type is - e.g. if there is a boolean type then something like "x < y" has a result of type boolean; and if there isn't any boolean type then something like "x < y" has a result of type number.

Also don't forget that the compiler converts expressions into AST (which is where your precedence rules are taken into account), and the type checker works recursively on the AST. For example, the expression "x + (y < z)" might get converted into the AST:

Code:

    expression
     |_addition
        |_x
        |_less than
           |_y
           |_z

The type checker starts at the top and tries to find the resulting type of "addition". To do that it has to find the types of both "x" and "less than" (which is where it gets recursive). To do that it has to find the types of "y" and "z". This means that (after all the recursion) it becomes like this:

Code:

    expression
     |_addition
        |_type=number
        |_less than
           |_type=number
           |_type=number

Then it checks the operands for "less than", then finds the resulting type of "less than":

Code:

    expression
     |_addition
        |_type=number
        |_type=number

Then checks the operands for "addition", then finds the resulting type of "addition":

Code:

    expression
     |_type=number

And now it's completed the type checking. Notice that (because there was no boolean type) the resulting type of the "less than" had to be "type=number"? This is what makes it impossible for you to generate an error when someone tries to add a boolean to a number. If there was a boolean type, then it would've become:

Code:

    expression
     |_addition
        |_type=number
        |_type=boolean

And in this case the type checker can/would complain that addition expects a number type.

Note: The type checker typically does a lot more than this - I've simplified a lot to make the recursive nature of it clear.

Now, try to explain how you're planning on doing parsing and type checking when there are no relational operators. For example, for the expression "(x > y) and (x < z)" describe what the AST would look like and how the type checker would check it.

mathematician wrote:

Brendan wrote:

If you only forgot the relational operators; what would "myBool := left < middle > right;" mean?

Good question. Have you ever written an expression like that? Because I haven't.

Is this some sort of joke? When given correct source code a compiler must generate correct output (and when given invalid source code the compiler must generate a hopefully meaningful error message). Whether or not it's likely to occur in source code is entirely irrelevant.

Is it valid? If it is valid what is the correct output?

mathematician wrote:

Any compiler I write will only look for conditional operators in connection with if,while.for or repeat instructions.

And then you'll get to common sub-expression elimination and realise that you end up with conditional operators in expressions anyway. For a simple example, this:

Code:

    if(x < y*3) {
        foo();
    }
    if( (x < y*3) and (z = 3) ) {
        bar();
    }

Should be converted/optimised into:

Code:

    temp = x < y*3;
    if(temp) {
        foo();
    }
    if( (temp) and (z = 3) ) {
        bar();
    }

Cheers,

Brendan

[/quote]

As I said, I only envisage conditional expressions occurring in the context of if or looping instructions. So to take an example

Code:

if a + b > z then
      x = 2;
else
      x = 3;

In the "if" node there would be pointers to three sub trees - one to the conditional expression, one to the "if" sub tree and one to the "else" sub tree. When it came to type checking, it would be necessary to check (for example) that z wasn't a character, whereas a + b evaluated to an integer, but when the time came to generate intermediate code the condition would form the basis for a branching instruction and wouldn't evaluate to any kind of type.

Code:

temp = a+b
cmp temp, z
jump_if_less_than_or_equal  lab_1
x = 2
jump lab_2
lab_1
x = 3
lab_2

Personally I have never had much use for the bool type, which is why I didn't put it in, but I must admit that "if( not isPresent and isUserSpace)" is stylistically pleasing, so maybe I will.

OSDev.org

Custom Programming Language

Who is online