Public Domain C/C++ Compiler

AJ · **Posted:** Wed Nov 29, 2017 8:57 am

Wajideus wrote:

Everything you've said makes absolutely no sense at all. To begin with, there are a bunch of open-source C compilers out there. There's literally no reason to write one at all aside from personal ego.

I feel that I need to at least defend "~" on this point. After all, we are on a forum dedicated to hobby OS development for which you could make a similar argument.

I find the other criticisms in this thread to be pretty valid, however.

Cheers,
Adam

Wajideus · **Joined:** Thu Nov 16, 2017 3:01 pm **Posts:** 47

Quote:

I feel that I need to at least defend "~" on this point. After all, we are on a forum dedicated to hobby OS development for which you could make a similar argument.

I find the other criticisms in this thread to be pretty valid, however.

I would make the same argument to any people working on a hobby OS. In fact, I already have:

Wajideus wrote:

I honestly just don't see the point in creating Unix again. There are already enough Unixes. Lets quit rubbing our crotches for a second here, and look at where we've made mistakes so we can make improvements.

(From the TumuxOS Post)

It makes absolutely no sense to reimplement something for the umpteenth billion time. Especially when there are already open source implementations of it available. If you're going to go out of your way to design a programming language, write a compiler, or make an OS, at least try to do something new and/or different.

Schol-R-LEA · **Posted:** Wed Nov 29, 2017 11:03 am

AJ wrote:

Wajideus wrote:

Everything you've said makes absolutely no sense at all. To begin with, there are a bunch of open-source C compilers out there. There's literally no reason to write one at all aside from personal ego.

I feel that I need to at least defend "~" on this point. After all, we are on a forum dedicated to hobby OS development for which you could make a similar argument.

I don't think that Solar has any problem with that; he basically said so in the next paragraph, even if he was a bit dismissive (mostly regarding ~'s ability to accomplish the task, I suspect, rather than the idea itself).

I think that you missed the parts where ~ was saying that he was writing one because no suitable one exists, and when several examples were mentioned, he never acknowledged them or said what he thought 'suitable' meant. ~'s statements in this thread and others make it clear that he doesn't see it as just him climbing the mountain because it is there.

~ · **Joined:** Tue Mar 06, 2007 11:17 am **Posts:** 1225

Solar wrote:

~ wrote:

I only answered Schol-R-LEA about which data structures I was going to use to parse the code.

Only, you didn't.

Schol-R-LEA explicitly asked you about the token data structures.

I will simply make a file called "elements.dat" that will contain a long at the start with the number of elements and the sequence of elements as they appear on the whole code, their type (start of line, end of line, preprocessor, keyword, identifier, number, operator, blank space block, comment block, string, open parenthesis, close parenthesis, backslash, asterisk...). I need a main loop with helper functions capable of recognizing every element with precedence (spaces, comments, preprocessor, strings, numbers, keywords, identifiers...). They have to record the start and end of each element, then a tree of IFs will call a specialized program only for each element to fully process it separated from the rest of compiler/language elements. The end of the processing of an element or set of elements results in default assembly to then write to the assembly to assemble with NASM. You will be able to see the on-file structure array to handle #includes because that's what I need to implement now.

Solar wrote:

Linking is also suspiciously absent from your deliberations.

This is why I made pure assembly skeletons for PE EXE and DLL. The compiler is supposed to be capable as to produce assembly only instead of producing object code to link even for the executable file formats. In the meantime I can use the capabilities of NASM to produce Linux, DOS or Win16 binaries, but it's something that the compiler should do when we learn how. It's aimed to produce raw assembly code for raw binaries, so if I have executable skeletons or NASM, producing PE EXEs or the like will be something doable.

____________________________________
____________________________________
____________________________________
____________________________________
Compilers could be made simple to compile using only plain C so that they can be compiled with any compiler available but providing additional language support to the producing compiler.

The code from the latest GCC or other compilers could be inspected and made fully portable to any producing compiler for old OSes/machines, but it would need to be massively rewritten in plain C, and still modify it more to support extensions from all other compilers to have a potent tool that doesn't divide the language just because of compiler brands.

So we only have two main choices:

- Modify existing compilers (practically the same as knowing how to develop one from scratch).

- Write a compiler from scratch gradually as we find existing good code to use/clean so it compiles anywhere.

- Maybe write a set of libraries private only to GCC based only in the most basic OS/system features so that the latest GCC truly compiles anywhere. It's a little more than a text processor, so it shouldn't be so difficult to make OS independent.

In any case, if we manage to reimplement all or modify the code towards old C, we will be making the very same job of cleaning up existing software technology to make it freely accessible because modern software, libraries and modern C/C++/Java/JavaScript languages currently are only accessible to half-decade-old NT and UNIX.

The language standards don't move nearly as fast as the rest of software, so if we have a compiler written in plain C that recognizes all existing compiler extensions, we will break the trap of having to use only the latest OS releases just to be able to port/run applications because they are written in the newer language versions.

iansjack · **Posted:** Wed Nov 29, 2017 12:28 pm

Quote:

It's aimed to produce raw assembly code for raw binaries

So, no ability to use libraries. That's going to be a little restrictive.

~ · **Joined:** Tue Mar 06, 2007 11:17 am **Posts:** 1225

iansjack wrote:

Quote:

It's aimed to produce raw assembly code for raw binaries

So, no ability to use libraries. That's going to be a little restrictive.

The code can use includes for libraries. The EXE skeletons or NASM can be provided import data. It would be the same as building a project file/compiler makefile.

Producing assembly only makes easy to link if desired.

The difference is that it wouldn't use additional IDE/suite layers beyond the actual file formats, library data, etc. It's clearer at least for OS development and is portable since it's dealing with the knowledge of the raw formats, including more library skeletons as more programs are compiled without errors.

Wajideus · **Joined:** Thu Nov 16, 2017 3:01 pm **Posts:** 47

iansjack wrote:

Quote:

It's aimed to produce raw assembly code for raw binaries

So, no ability to use libraries. That's going to be a little restrictive.

I'm not sure what he's doing, but I had thought of doing this before in my compiler to segregate the assembly language from the object file format. To pull it off, I had planned on dumping the relocations, symbols, and load mapping to separate files. The relocation file would be a sort of binary diff format, the symbol file would be something like an ini, and the load mapping would be something like a linker script.

There's a couple of neat things you can do with this, like streaming assembly code to stdout and piping it into a virtual machine; or consolidating load maps into a table for a class loader (something I planned on doing for loading and unloading actors in a game engine).

iansjack · **Posted:** Wed Nov 29, 2017 2:21 pm

I don't quite understand how you can use dynamic libraries if your output is raw binary. How does relocation work without relocation information?

Schol-R-LEA · **Posted:** Wed Nov 29, 2017 5:54 pm

~ wrote:

Solar wrote:

Schol-R-LEA explicitly asked you about the token data structures.

I will simply make a file called "elements.dat" that will contain a long at the start with the number of elements and the sequence of elements as they appear on the whole code, their type (start of line, end of line, preprocessor, keyword, identifier, number, operator, blank space block, comment block, string, open parenthesis, close parenthesis, backslash, asterisk...).

OK, I can see that this is indeed a table of the tokens... sort of... but why are you putting it in a file? Unless there is a memory crunch - and generally speaking, even large programs won't eat 64KiB for their symbol tables, and even in real mode I think you can spare a whole data segment for something this important - there is no reason to for a C compiler to save it to a file, and every reason for it to keep it as a tree in memory, unless you intend to have the lexer and the parser as separate programs with no sharing of memory.

Such compiler designs have existed in the past; the Microsoft Pascal and Fortran 77 compilers for MS-DOS, circa 1983, comes to mind. But they designed it that way to accommodate the CP/M version of the compiler, and retained it for the first few MS-DOS versions to allow it to run on 64KiB IBM PCs; even by 1983, those were only a small fraction of PCs, with newer ones shipping with at least 256KiB and many IBM PC/XT and Compaq Deskpros already hitting the 640KiB limit (in fact, memory-hungry programs such as Lotus 1-2-3 were already running into problem with that limit, and in 1984 both bank-switched Expanded Memory for 8088s, and Extended Memory for 80286s, were introduced to get around it).

It wasn't really necessary even before that, though. After Turbo Pascal came around in late 1983 - a single-pass, all in memory compiler that ran in 64KiB under both 8080 CP/M and 8088 MS-DOS even including it's simple full-screen text editor, and which blew the older multi-pass compilers away in terms of speed and useful error messages - that technique vanished even in the 8-bit world (as there were still plenty of Apple //es and IIcs, and Commodore 64s, still in use).

Why you think that bringing that approach back is a good idea isn't at all clear to me, unless you are actually talking about for the program listing rather than the symbol table, in which case my question becomes, why are you talking about that instead of the symbol table and the in-memory Token struct/class?

~ wrote:

I need a main loop with helper functions capable of recognizing every element with precedence (spaces, comments, preprocessor, strings, numbers, keywords, identifiers...).

In other words... the tokens. And yes, you would definitely need a set of helper function for this - specifically, a set of functions that combine to form a lexical analyzer.

Also, you would usually handle precedences later, in the parser. More on this later.

~ wrote:

They have to record the start and end of each element, then a tree of IFs will call a specialized program only for each element to fully process it separated from the rest of compiler/language elements.

In other words, a lexical analyzer, specifically an ad-hoc lexer. This is indeed one of the things I was talking about, though I get the sense that you don't know all of the English names for these things, which may be part of the problem we are having. I can't tell whether this is due to a communication problem (from your README file on Archefire, I gather that your native tongue is Spanish, and I get the impression that your English isn't particularly strong - though if this is so, then I have to say your writing is still better than many native English speakers), or because you haven't read up on the existing techniques, or both, and I am willing to give you some benefit of the doubt on this.

~ wrote:

The end of the processing of an element or set of elements results in default assembly to then write to the assembly to assemble with NASM. You will be able to see the on-file structure array to handle #includes because that's what I need to implement now.

OK, now this is a worrying statement, because it sounds as if you are skipping a few steps. My impression is that you are combining three roles - the lexer, the parser, and the code generator - by doing substring matches on the input stream, and structuring the parser so that it is calling the matching function repeatedly against the input strings, walking through the set of possible matches, and then working from there until you collect the end of the expression (what in parsing is called a 'terminal' of the grammar), at which point you output a stream of one or more lines of assembly code.

It is entirely possible to do it this way - it is how the original versions of Small C did it - but you seem to be missing some details as to how you can make that approach work.

The approach in question is called 'recursive decent parsing', a type of top down Left-to-right, leftmost derivation parsing. It is an old, tried, and true method for writing a simple compiler by hand, and is the starting point for almost very compiler course and textbook that doesn't jump directly into using tools like flex and bison. It was developed in the early to mid 1960s, and was probably first investigated by Edsger Dijkstra some time before his 1961 paper on the topic; a number of others experimented with it arond that time, and Tony Hoare seems to have been one of the first to write a complete compiler in that fashion, the Elliot Algol compiler.

In the early 1970s, Niklaus Wirth popularized it for use in the first formal compiler courses, as a method that was easier to use when writing a parser by hand with than the earlier canonical LR parsing method that was developed in 1965 by Donald Knuth (canonical parsers, and bottom-up parsers in general, require large tables to represent the grammar, and are an unholy nightmare to develop entirely by hand, but they are much more efficient that R-D parsers and are well suited to automatically generating the parser).

Recursive-descent works pretty well... for small projects done by hand. It is where just about everyone studying compilers starts out, and I can't fault you for going that route... except that I am not sure if you really understand it yet, as I get the impression that you still haven't read up on formal compiler design yet.

This is almost certainly a mistake. Lexical analysis and parsing are, far and away, the best understood topics in the entire field of computer programming, with the possible exception of relational algebra, and they have uses far beyond compilers and interpreters. Notice the dates I quoted - most of them are from over 50 years ago. These are topics that academic computer scientists and working programmers alike understand better than anything else, and the techniques for it a varied, effective, and solid.

If you don't at least try to learn more about the prior art before tackling writing an actual compiler, even a toy one, then you are doing yourself a disservice.

Maybe I am wrong, and you are simply having trouble expressing what you are doing in a foreign language. But you have to understand that we are trying to help you, trying to give you what we consider the best advice possible.

It's dangerous to go alone - take this! hands ~ a copy of the free PDF version of Compiler Construction by Wirth

Wajideus · **Joined:** Thu Nov 16, 2017 3:01 pm **Posts:** 47

iansjack wrote:

I don't quite understand how you can use dynamic libraries if your output is raw binary. How does relocation work without relocation information?

there's still relocation information, it's just in a separate file. basically all a relocation is is just a tuple of a pointer into the raw assembly code and an algorithm for decoding and encoding the offset from a base address (usually the load address).

~ · **Joined:** Tue Mar 06, 2007 11:17 am **Posts:** 1225

@iansjack, At least for PE DLLs, relocation is currently left to the OS via internal paging. The only detail is that the DLL has a high base address that is very generic and common to other normal DLLs so it has the chance to virtualize that address. I've tested it with my skeleton DLL/EXE and it can load as many instances of the program as desired without failing:

For other formats like ELF I will have to use NASM at this point, so I have to learn the actual details of relocation so that I can generate them from the compiler.

You can inspect the base address used for the skeleton PE DLL and EXE here. The one from the DLL is much higher in this case:
http://sourceforge.net/projects/x86winapiasmpedllskeleton32/files/

@Schol-R-LEA, One reason to put structure arrays in files and load elements individually in memory is that I also want the compiler to serve as a "source code explainer", which can be able to display the list of all functions in order, the list of global variables, and solve them from their custom data types down to generic-only C types as well as down to their assembly code for the target platform. It will be very informative.

Another reason is that I suspect that there could be a point where the programs will be so big that I will have to free and reload certain source files to hold their identifiers in memory.

In any case, the compiler is full of wrapper functions that are opcode-like, so they can easily be rewritten internally if memory is to be used later. In any case I will have to write the data to disk for being able to study the pieces that the source code is using (list of files, identifiers, variables, functions...).

Schol-R-LEA · **Posted:** Thu Nov 30, 2017 9:51 am

~ wrote:

@Schol-R-LEA, One reason to put structure arrays in files and load elements individually in memory is that I also want the compiler to serve as a "source code explainer", which can be able to display the list of all functions in order, the list of global variables, and solve them from their custom data types down to generic-only C types as well as down to their assembly code for the target platform. It will be very informative.

I still think you don't really get what I am trying to tell you. This approach simply isn't sufficient. Source code in a high-level language can't simply be matched to a string of opcodes - you actually need to parse it. I am not seeing any evidence so far that you even understand what that means, or that you need to start with at least a basic defined grammar, in some notation such as Backus-Naur Form or Railroad Diagrams, in order to know how to parse it.

In fact, this talk about 'structure arrays' makes me wonder if you even know the sorts of basic data structures you will need - this is something which really calls for a tree, as an array is simply too monotonic to really be suitable (you can use one, but you'd be wasting a lot of time and effort doing so). At the very least, if you insist on using a linear data structure, a linked list would make more sense, as it is a lot better for handling data whose size isn't know ahead of time, and is a lot less likely to waste memory than a fixed-size array (or even a dynamic array).

My overall impression so far is that you have lemon juice all over your face, and don't realize that it isn't hiding anything.

But again, that impression on my part could be simply because you haven't made enough of your knowledge clear to us for me to judge it. I at least am aware that I am ignorant of how much you actually know. If you do know these things already, I would appreciate it if you demonstrated that knowledge, because so far, all you've shown is ignorance of a particularly pig-headed, stubborn, and utterly willful sort.

I am going to give some free advice again (relating to a PM conversation I had on this subject with someone else here, actually). Both MIT Open Courseware and Stanford Open Classroom have free video courses and e-texts on their websites that cover compiler theory and design, and while I will admit that I haven't gone over their courses, I intend to - I want to see which one I think is the better one so I can be more focused in my recommendations, and besides, the topics are deep enough that you can almost always learn something new from a different presentation of them.

If you don't feel comfortable taking such a course in English, you could try taking one in Spanish, or some other language which you feel more comfortable with (I don't know what languages you know other than Spanish and English). I would expect that there is at least one Spanish-language online course on the topic, and would guess that there are several around if you look.

But the point is, you don't seem to know enough about even the basics of this to proceed as things are. You need to Get Good, and for this topic, that means you need to study.

Schol-R-LEA · **Posted:** Fri Dec 01, 2017 1:09 pm

Sorry for the double post, but I didn't want to edit the previous one in case ~ had already read it.

I was re-reading some recent articles on The Daily WTF, and came across one which I think should serve as a warning to ~ about where his current approach appears to be going: Theory vs. Reality

TL;DR: in this case it is the lack of applying theory to reality that is the problem - specifically, not considering beforehand whether there might be a approach that was better suited to the task using an appropriate data structure - which is why I thought it apt.

@~: Again, I don't know if my impression is correct or not, but based on what you have said about how you are going about this project, you will soon find yourself drowning in ad-hoc code, just as the storyteller did, and at that point, you will very likely end up having to scrap your current approach and reconsider it in much the same manner. All I and Solar are trying to do is save you the trouble.

Your choices are these:

Follow our advice by setting the code aside for a while; study up on the known solutions; choose the ones you feel would work best for your goals; plan out a design for the compiler; then get back to writing it only after all of that; or,
Try out your current approach, fail, then do everything I just said.

This isn't something where you can wing it and expect success. No matter how long reading up on the topic might take, trying to do it without reading up on it first will certainly take longer.

The only advantage to how you are doing it now is that, by failing at it, you will be forced to learn that lesson in a very forceful and painful manner, though at least you can expect the lesson to stick ("Experience keeps a dear school, but fools will learn in no other." - Benjamin Franklin). The choice, at this point, is yours.

~ · **Joined:** Tue Mar 06, 2007 11:17 am **Posts:** 1225

I have uploaded the code to SourceForge.net and updated the first message in this topic:
http://sourceforge.net/projects/c-compiler/files/

So far I've written code to generate text lines, and to generate the structures that indicate the start/end offset of each line excluding the newline sequence. Now my line-counting function can dynamically keep track of the line metadata in any file, not just lines.dat, and with that I've started to include a text line manager and a source file manager, very tiny and OPCODE-function based.

Now I need to add an OPCODE function to search for exact, case-insensitive text, in a text file, to know whether I already added a source file in some index, and need to add code to dynamically generate the file names for the line metadata files using the index of each file in the order found by #includes, starting by the source file specified in the command line. Although a source file should be able to be included more than once so searching whether it's already present seems to be an obvious error.

The idea that has helped me the most from this thread is to use a stack of unsigned long offsets to hold in which order I should keep parsing source files. Add a new file to the list of source files (filelist.txt) when finding a #include as well as pushing the index of that file, and popping every time I reach the end of the current source file.

This idea of a stack of source files will be completed by holding the information of which was the last line/character number and raw offset in the file before opening another one in a nested way because of a #include. In this way when I reach the end of a nested include source file and pop it from the stack, I can reopen the previous file, the file that included that other file, and immediately position the file manager and file pointers to their previous offsets using only 1 file handle for all the source files of a project but instead using on-disk structures to hold the metadata files to handle all other files provided by the user.

Schol-R-LEA · **Posted:** Sun Dec 24, 2017 2:40 pm

And to the shock and amazement of absolutely no one, Tilde has managed to demonstrate a complete lack of understanding of how to use SourceForge.

EDIT: I have taken a look at the code, not in detail but still enough that I can tell you I am not impressed. The best thing I can say about it is that it is better than the code for Spectate Swamp Desktop Search, which I can tell because my eyes aren't bleeding as I read it. That is about the lowest bar imaginable, so that is hardly a compliment.

OSDev.org

Public Domain C/C++ Compiler

Who is online