Making a Programming Language From Scratch/Decisions
Introduction
editYou are aiming to make a programming language. However, there are many existing languages in the market. What is the purpose of yours? Well, it is mostly about the minute differences in the syntax and the output code that creates the resultant diversity of languages. Mostly it is about memory format and speed of execution. However, generally these two cannot exist together. Thus there are to be certain tradeoffs that in the end will determine the importance of your language and where it stands in the global market. There are some minute factors such as general scope of variables, redundancy of instructions, style of compilation and execution speed differences, which may seem trivial at first but when they add up in a long source file, they can mean the difference between the program execution time in minutes and hours. These minute factors remain the driving force in the continuing research for new and better programming languages. The following sections will cover some major decisions you have to undertake before you even start on your language
Base Languages
editIt is very difficult to compile to machine language, and generally the next best step, assembly language is undertaken. However, even in assembly language, the actual file input and output operations necessary for compilation are very complex and difficult to write. However, fortunately there exist some high-level languages in which to write our compiler. Note that the selection of the Base language is not important for the end output quality, but for the compilation speed and ease of writing the compiler. This project is likely to be very long and thus it is imperative that you select a language in which you are comfortable.
Usually execution in compiled languages such as C++ and C is much faster than in Interpreted Languages. However, they have the disadvantage of having lower portability, and thus the choice of the Base language depends totally on your priorities.
Memory Format
editMemory nowadays is very high, (usually around 8 GB RAM) , thus the earlier unthinkable format of default static memory layout can be considered. In this format, the variables declared within a function remain active, i.e. they retain their value even after the function returns control, up until the termination of the program. This means that functions can hold their values for longer, an important fact that can allow for much shorter and efficient programs. However, most languages have what is called 'Stack' variables, where variables are pushed onto a memory unit called stack , and they remain there until the function ends. However, after their life ends, the stack is not cleared but left as it is , just that now it is available to further variables who overwrite the previous variables. However, this leads to 'Garbage values' which can lead to program errors. Also, it demands a complicated system of data management.However, stack variables occupy much less space and are also accessed faster. Further, there is the option of register variables, which remain in registers which are accessed up-to 30 times faster than stack variables. However register variables require huge levels of complex data management which would make the compiler more complex.
After that, there is the option of .[MODE] directive in assembly. This depends on the type of processor you are writing the compiler for. For 808386 processor use .386 for 808486 use .486 and so on. Some assembly languages support multiple modes for one processor, however each mode is best suited for its own specific processor. Further more , there is the MODEL directive , namely FLAT model, BIG, TINY, LARGE and MEDIUM. Each is best suited for its own type. In this book we use the FLAT model. A lot depends also on the MODE of the OS. Here we use the protected mode which provides 16 TB (can vary from machine to machine) of "virtual memory".
Moreover, in the case of actual memory allocation, there is NEAR and FAR pointers used for NEAR and FAR model respectively. In protected mode, the OS assumes NEAR mode and thus NEAR pointers.
Another thing to consider is the size of your variables. Commonly three variables int, char and float are used. In C compilers for 32-bit environment , int takes 4 bytes, char takes 1 byte and float requires 8 bytes (for double). There are further directives short, long, single and double which you may or may not include in your programming language.
Most assemblers handle memory allocation on their own, however with some you have to be more explicit and give precise instructions as to where exactly you want your variables to be.
Speed
editEvery language-maker knows that by far the most important factor distinguishing programming languages in these days is the speed of execution. However, speed does not come easy and if you want speed , you better be prepared for many days of hard labor in the process of what is known as optimization. Moreover, speed often clashes with memory layout, thus you need to decide which is more important to you. Redundancy is another factor, but to remove redundancy you need to write very long algorithms that check for every possibility of redundancy.