Programming Language Concepts Using C and C++/Program Level Structure

Our investigation of the program level structure will initially be guided by the structure of a compiler. We will first take a look at how program entities, tokens, are formed linearly and how these are further grouped into hierarchic structures. The former corresponds to the way a lexical analyzer views the source program while the latter corresponds to the hierarchic structure imposed on the stream of tokens by the syntax analyzer. This will be followed by a treatment of subprogram structure and the data objects available to subprograms. We will end our discussion with an elaboration of different types of parameter passing mechanisms.

Program Text

The syntax of any language is a set of rules governing the construction of strings of characters that are grammatically correct. Program text may be considered a string of characters that is ultimately translated into an ordered sequence of machine instructions. We will now examine these characters themselves, the way they are organized into the substrings that make up the pieces of program text, and the way they are arranged over the many lines of code that make up a complete program.

Character Set

Just like alphabet symbols in communication using natural languages, we use a character set to compose programs in a programming language. While all languages can boast in their character sets the 26 uppercase letters of the English alphabet, the 10 decimal digits, blank symbol, and symbols for punctuation and fundamental operators, some older languages do not support the lowercase letters. Examples to programming languages excluding lowercase letters from their character sets—because the character codes commonly used in the ‘50’s, when these programming languages were first developed, did not include them—are FORTRAN and COBOL. Another programming language example with this property is APL with its special character set full of Greek letter symbols.

In computer memory, these characters may be encoded in different ways. The most popular choice used for representing these characters is the ASCII standard. The Unicode standard, basically a superset of the Anglo-American ASCII, provides support for all natural languages spoken in the world and much more. Therefore, Unicode, supported by Java, C#, and many other programming languages, has gained acceptance as software internationalization has become more important.

Example: A Java program to calculate the summation of numbers from 1 to 10.
`public class TryUnicode { public static void main(String[] args) { int toplam = 0; // "toplam" is the Turkish word for "sum" for (int sayaç = 1; sayaç < 10; sayaç++) toplam += sayaç; // "sayaç" is the Turkish word for "counter" System.out.println("1'den 10'a kadar sayıların toplamı: " + toplam); // "Summation of numbers from 1 to 10: " } // end of void main(String[]) } // end of class TryUnicode`

In addition to these two, we have a series of ISO/IEC standards that enable single-byte representations of various alphabets. Known as ISO-8859-n, these character encodings are ASCII-based and provide letters special to the language/language group in the upper half of the relevant character table; lower half of all ISO-8859-n encodings are exact replicas of the ASCII character encoding. Thanks to ISO-8859-n, English and various other languages can now coexist in the same program source. However, using languages from different ISO/IEC encodings means we must use multiple encodings, which may require quite a bit of trickery. Solution to our worries lies in making use of UTF-8, the variable-length encoding based on Unicode. In the UTF-8 encoding, ASCII-subset of Unicode, which has quite a few characters in common with other Latin-based alphabets, is represented in a single byte while the rest are represented in 2, 3, or 4 bytes.

Operators

Operators are value returning, generic functions. Operator names usually consist of a single character or a string of two characters. Although many languages confine the semantics of available operators to the language specification, some, including C++, C#, and Ada, provide the programmer with a facility to redefine this semantics. Some, such as PROLOG, may even let you construct new operators and provide the semantics for them.

Example: An Ada83 function definition that overloads the “+” operator to add two operands of type Fraction.
`FUNCTION "+" (Left, Right: Fraction) RETURN Fraction IS Result: Fraction; BEGIN Result.Numerator := Left.Numerator * Right.Denominator + Right.Numerator * Left.Denominator; Result.Denominator := Left.Denominator * Right.Denominator; Reduce(Result); END "+"; ... f1, f2, f3: Fraction; ... f3 := f1 + f2; ...`

Operators may be roughly divided into the following groups:

Arithmetic operators are those that provide fundamental operations such as addition, subtraction, multiplication, and division. Some languages, such as FORTRAN, also include an exponentiation operator.
Boolean operators are those that work only on logical values, true and false. Examples include conjunction, disjunction, and negation.
Relational operators test for the truth-value of a relation between two operands. If the relation holds, result returned by the relational operator is true. Otherwise, it will be false. Well-known operators in this class are less-than, equality, inequality, and greater-than.
Bitwise operators involve boolean manipulation of individual bits. Typical operators in this category are bitwise and, bitwise or, bitwise negation, and bitwise exclusive-or.

One other possible division, which in many cases serves to figure out the precedence levels, is as follows:

Unary operators such as -, +, NOT (Highest precedence)
Multiplicative operators such as *, /, DIV, MOD, AND
Additive operators such as +, -, OR
Relational operators such as <, >, <=, >=, <> (Lowest precedence)

Delimiters

Delimiters are symbols or keywords that serve to separate pieces of program text. In a way, they are the punctuation for your program.

They come in two flavors:

Separators indicate the end of one program entity and the beginning of the next. For example, a blank space often serves to separate identifiers and expressions within statements; Pascal-based languages use a colon to separate a variable name from its type declaration.

Example: Variable definition in Pascal.
`var ID_Number : integer;`

Many different symbols may be used to delimit the end of one program statement and the beginning of the next. A popular choice among ALGOL-based programming languages is semicolon. However, it should be kept in mind that Pascal-based and C-based languages use semicolon slightly differently. In the former group, semicolon is used kind of like a separator operator. That’s why the last statement of blocks in these languages does not end with a semicolon. (That would be like writing a + b + instead of a + b.) In the latter group, semicolon is used as a terminator. That is, every statement, regardless of where it lies, must end with a semicolon.

Brackets are symbols or keywords that come in pairs. They are used to set off a program code segment. Examples are ( ) for delimiting subprogram parameter list, [ ] for specifying the relevant array component, and { } (or BEGIN-END) for building compound statements

Example: A C block to swap the values of two variables.
`if (a < b) { int temp = a; a = b; b = temp; }`

Program Text Appearance

Fixed vs. Free Format

In 1950’s, people, or rather a select group of researchers, used to enter programs by punching individual instructions on cards. These punched cards had fixed formats. In order to ease the backward-compatibility with these cards, early programming languages were designed to be fixed-format.

Structure of a pre-FORTRAN 90 source line
Col #	Function
1	C for comment
1-5	Line number
6	Continuation line indicator
7-72	Fortran statement
73-80	Identification and sequencing, not read by the compiler

Structure of a COBOL source line
Col #	Function
1-6	Sequencing, not used by the compiler
7	* for comment or - for continuing word or string
8-11	Headers of program units (A margin)
12-72	Statements (B margin)
73-80	Program identification, not used by the compiler

Later on, as terminals started to be used for input/output purposes, free-format languages gained the upper hand. The lineage of these languages goes back to ALGOL.

Needless to say, there is not much merit in comparing these two formats. An indication is the fact that FORTRAN, starting with FORTRAN 90, is now free-format. But, this does not mean that all those code written in fixed-format programming languages will disappear into thin air. One still has to have an idea about fixed-format for being able to maintain such code.

Pragmatic Concerns

A program is not only a recipe for implementing an algorithm in a specific language, but also a way to communicate your ideas to fellow programmers. So, in addition to what is provided by the programming language and the environment we are working in, we should strive to make our code as clear as possible. In order to achieve this we need to make use of a number of methods and conventions. These are source code indentation, blank lines, and comments. Liberal and consistent use of these three, without obscuring the code, will make the program easier to understand and therefore easier to maintain.

Program Structures

Next in our investigation is the hierarchic structure of a program. In other words, what is contained in a program? Can this containment relationship be extended to multiple levels? The answer to the second question is ‘yes.’ What we will do in this section is to figure out these levels.

Identifiers

Identifiers are words that provide access to program and data entities. An example to a data-level identifier is a variable name. An example to a program-level identifier is a procedure name.

Used in a similar context with a variable name is a reference. A reference must first undergo some degree of evaluation first. Examples are a function reference and an array component reference.

Some identifiers are supplied by the programmer, whereas some are built into the programming language. The latter are often referred to as keywords. In general, programming languages forbid the re-use of these keywords as user defined identifiers. A notable exception to this is PL/1.

Expressions

Expressions, composed of operators and operands, are sequences of operations to be evaluated. An expression evaluates to a single result and is, thus, very much like a variable except that it does not have a name. As a matter of fact, temporary variable names are given to (sub)expressions during execution, but these are not available to the programmer.

Expressions are inherently recursive; one expression may be contained within and defined in terms of another. This means that the result obtained by evaluating one expression may then be used as an operand in evaluating the next.

Statements

Statements may be defined similarly to a program or a subprogram. The major difference is their abstraction level. What is considered a simple statement at one level may be considered a compound statement at another level. They (statements, subprograms, and programs) all accomplish a particular task.

Statements may be classified as

Executable: This type of statements cause an action on the part of the computer such as assignment of a (new) value to a variable. Informative: Also known as non-executable statements, these statements describe the nature of the data used by the program. An example to this type is the variable declaration statement.

A further classification may be made on the basis of the statement grouping.

Simple statements: Simple statements contain a single action. Typical examples are I/O statements and assignment statement. Compound statements: A compound statement is a program structure that is composed of zero or more simple statements, often delimited by a pair of matching keywords or brackets. An example to the former is the BEGIN/ END pair in Pascal. Curly braces, { and }, used for the same purpose in C-based programming languages are examples to the latter. Python programming language is rather unique in its way of delimiting compound statements. Instead of using delimiters the language processor figures out blocks by making use of indentation in the program.

The compound statement may be used wherever a single statement is allowed since it is treated, syntactically, as a single statement.

Subprograms

A subprogram is a block of statements often with a name; it is relatively independent and it performs a specific task.

A subprogram is perceived as a single statement by its user, whereas the implementer sees it as a group of related informative and executable statements. This goes to emphasize the similarity between a statement and a subprogram: What is considered a subprogram at one abstraction level may be considered a simple statement at another level or vice versa.

Internal Subprograms: These are named program units that are part of and contained within the text of the program. An internal subprogram is compiled together with the program in which it is called.
External Subprograms: These are named program units that are outside the text of the program and may be independently (and/or separately) compiled into object modules that are then used (linked) as needed.

Modules

A module is a grouping of related subprograms packaged as an independent unit for reuse by other units, probably accompanied with relevant data. Examples include classes in Java and C#, modules in Modula-2, and C source files.

Source code for a module and corresponding object code are generally discriminated and referred to as source module and object module, respectively.

Programs

A program may be loosely defined as an independent unit that can be executed. It often has a name and may invoke one or more subprograms, in which case, it might be called the main program.

One should note the difference between a program and a process. A program is an executable file, and a process is an instance of the program in execution. That is, in a way a program is to a process what a class is to an object. When a process is spawned its image in memory is created using the related program as its template.

Jobs

In addition to these program structures, a job may be defined as a unit that interfaces with the operating system and may contain within it several programs that are to be linked together and/or executed in sequence. Examples are scripting language programs, such as batch programs in DOS.

Example:
`gcc –o $1 $1.c if [ $? –eq 0 ]; then mv –f $1 Executables/$1 fi`

The above job compiles a C program and moves the resulting executable into a directory named Executables. Second action is not taken unless the first one is successful. This is accomplished by checking the exit value of the gcc command. Sticking to a widely adopted convention; on success gcc exits with 0, on failure it exits with a nonzero value. In bash, the exit value of the last command can be found by inspecting $0, as is done in the second line.

Subprogram as a Program Structure

A subprogram, like the program itself, contains both informative and executable statements.^[1] The declarative portion of the subprogram contains the informative statements consisting of declarations that pertain to the subprogram and provides a syntactic interface between the action statements contained in the subprogram and the caller.

The action portion contains executable statements; i.e. the implementation. One (or more) statement is designated as the entry point to the subprogram. Usually, there is only one entry point and it is the first statement. The exit point may be the last statement in the subprogram or an explicit return or exit statement. It is not advisable to have more than one entry and exit points.

Blocks

A block is used to collect a group of related statements. Similar to a subprogram it opens a new scope of its own. It is not activated by an explicit call statement, but by execution of the normal sequence of instructions in the program.

Programming languages permitting nested blocks are called block structured languages. This is sometimes contented by puritans, who claim that for a language to be classified as block structured it must let the user define nested subprograms.

Example: C fragment to implement bubble sort.
`#define FALSE 0 #define TRUE 1 ... pass = 1; do { swapped = FALSE; for (j = 0; j < noOfComponents - pass; j++) { if (a[j] <= a[j + 1]) continue; swap: { int temp = a[j]; a[j] = a[j + 1]; a[j + 1] = temp; swapped = TRUE; } /* end of swap block / } / end of for loop */ pass += 1; } while (pass < noOfComponents && swapped);`
Note: This is not the way to go about implementing bubble sort; some unnecessary and ugly code has been added to show the use of blocks in C.

Paragraphs

A paragraph is an internal subprogram, a named sequence of statements. It is invoked by means of a pseudo-call statement such as the PERFORM paragraph-name in COBOL. They do not have a declarative part. In other words; they do not let you declare local variables, nor do they accept parameters.

Example: A COBOL paragraph to swap the values of two variables.
`PERFORM SWAP. ... SWAP. MOVE A TO TEMP. MOVE B TO A. MOVE TEMP TO B. SOME_OTHER_PARAGRAPH. ...`

Note that the paragraph does not take any parameters. That’s why, all variables used in the example must be declared as global variables.

Procedures, Subroutines

Often, the terms procedure and subroutine are used to refer to subprograms (internal or external) that do not return any value. They are used for their side-effects; that is, for printing output on the screen, changing the memory through a group of assignment statements, and so on. They are activated through an explicit call and take parameters. In some programming languages, such as the Pascal-based ones, they can be nested.

Functions

A function may be called a value returning procedure or a typed procedure. They are activated simply by using the name of the function (along with any needed arguments) in any place in the program text where a variable name would be allowed as an r-value.

Built-in functions are those that are supplied along with the compiler and are thus part of the vocabulary of the language. A generic function is one whose type depends on the types of the arguments used at the point of call.

Example: A generic C++ function that finds the minimum of its two arguments.
`template <class Type> Type min(Type a, Type b) { return a < b ? a : b; } // end of <Type> min(<Type>, <Type>)`

Access to Data in Subprograms

Definition: The period of time during program execution when storage is bound to an object—that is, a region of memory that can be examined and stored into—is referred to as the object’s extent.

Definition: The scope of a declaration is the region of the program text over which that declaration is active.

Realize scope is a static notion while extent is a dynamic one. Scope can be determined by scanning the source, something that cannot change once a program starts running, and hence is a static notion. On the other hand, extent depends on the control flow of the program, something that may be changed by the input, and hence is a dynamic notion.

Global Variables

When a subprogram is invoked some data, the global variables, may already be accessible to it without anything special being done. Scope of a global variable extends from its declaration point to the end of the source file and there is only one copy of it. Having a single copy means that it can have only one variable name-address binding, which implies that the memory location allocated to it does not change throughout the program execution. Such an object is said to have static extent; its relative address can be determined before the beginning of execution, i.e. at the compile time; it is stored in the memory area called static data region. Looking at the source code, one can calculate the address of all static objects—that is, all global and static local objects—relative to the beginning of this region.

Global variables in memory

int i2; int i1; void f(...) { ... } /* end of void f(...) */ int main(void) { ... ... } /* end of void main(void) */

Code
0 int i2 4 int i1
?

Class Variables

Class variables are used to represent attributes that do not change from one instance to another. This means a single copy can be shared by all instances, which makes it possible to determine the [relative] address at compile-time.

Local Variables

Local variables are made accessible on invocation of the subprogram. Such variables are said to have a local scope from the point of their declaration to the end of subprogram they are declared in. Likewise, they are visible, unless hidden by another identifier with the same name, from the point of their declaration to the end of the subprogram. Their extent (lifetime), normally limited to the particular subprogram invocation, may in some programming languages be extended to the execution of the entire program by declaring them to be static.

Parameters of a subprogram may be thought of as special local variables initialized from outside the subprogram.

If we limit ourselves to simple call, treatment of local variables will be very much like global variables: there will be only one copy of them; there will be a single (variable name-address) binding. The only difference will be the scope of the variable.

This type of allocation for local variables has the advantage of enabling address determination statically. This can be done, because at any time during the program execution a function can be activated at most once. On the down side, recursion is not possible; the programmer must simulate recursive algorithms or code the non-recursive version of the algorithm.

Local variables in memory (no recursion)

int i2; int i1; void f(...) { double d1; double d2; ... } /* end of void f(...) */ int main(void) { ... ... } /* end of void main(void) */

Code
0 int i2 4 int i1 8 double d1 16 double d2
?

Question

Before FORTRAN 90, FORTRAN supported simple call only. What do you think is the reason for this?

Answer

When it (FORTRAN) first came out FORTRAN programs had to rival the speed of programs written in assembly and machine code. This meant any choice that would slow down the programs had to be avoided. For this reason recursion, which is absent in simple call, was excluded from the programming language.

In simple call, a subprogram cannot call itself, directly or indirectly. This means there will be no cycles in function invocation graph; at most one set of local variables will be allocated in memory at any time. For supporting such a scheme, the allocation method provided above, where memory for local variables of all subprograms are allocated before runtime, is sufficient and compliant with the goal set in the above paragraph.

This is yet another case showing design decisions—in this case, exclusion of recursion from FORTRAN—that look like a blunder at a later date may actually have very good reasons.

An attempt at placing local variables in memory (with recursion)

int i2; int i1; void f(...) { double d1; double d2; f(...); } /* end of void f(...) */ int main(void) { ... ... } /* end of void main(void) */

Code
0 int i2 4 int i1 8 double d1 16 double d2 8 double d1 16 double d2 8 double d1 16 double d2 ⇓
?

The possibility of recursion, the indefinite number of times a subprogram may be called, a separate set of parameters and local variables for each subprogram invocation, and the nature of subprogram call termination order impose the use of a separate dynamic, stack-like memory region for storing local variables and parameters together with some other information. This region is called the run-time stack. Each time a subprogram is invoked, a new record of the invocation is constructed and pushed onto this stack. This record is referred to as the activation record of that particular subprogram invocation; the slot into which the activation record is inserted into is called the activation frame.

The fields contained in a typical activation record:

Temporary values, such as those arising in the evaluation of expressions, are stored in the field for temporaries.
The field for local data holds data that is local to an execution of a subprogram.
The field for saved machine status holds information about the state of the machine just before the subprogram is called. It includes the values of the program counter and machine registers that have to be restored when control returns from the subprogram to the caller.
The textual link is used to refer to nonlocal data held in other activation frames. In C-based languages textual links are not needed because nested definition of subprograms are not allowed. Languages allowing nested subprogram definition, such as Pascal, require the use of this field.
The dynamic link points to the activation frame of the caller.
The caller supplies the parameters to the callee in the actual parameters field. The size of this area for a particular function does not change. (Variable length argument list forms an exception to this.) In practice parameters are often passed in machine registers for greater efficiency.
The callee returns the result value to the caller in the returned value field. Again, in practice this value is often returned in a register for greater efficiency.

Typical fields in an activation record
Returned value
Actual parameters
Textual link
Dynamic link
Saved machine status
Local data
Temporaries

Subprogram Call and Return

Before the control is passed to the callee, arguments are evaluated and pushed on top of the run-time stack and LB (Local Base) is adjusted to point to the activation frame of the callee. Previous value of LB, that is the starting address of the activation frame of the caller, is, along with some other control information, pushed onto the run-time stack. Each time something new is pushed onto the stack ST (Stack Top) is advanced so that it shows the first available position.

Insert figure here

Upon entry to the callee, local variables are allocated on the stack. As need may arise in the evaluation of expressions, temporary variables are also created on this stack. When the subprogram completes execution and returns back to the caller; local variables are freed, parameters are disassociated by popping the current activation record off the run-time stack, and, in the case of a function subprogram, a value is returned to the caller.

Question

As a rule of thumb, returning pointers to local variables as the result of a subprogram is considered bad programming and should be avoided. What do you think is the reason behind this?

Answer

When a subprogram completes its execution and returns back to the caller, its activation record is popped off the run-time stack. This means the area of memory holding the variables local to the subprogram is now available for activation records that are to be created as a result of subsequent subprogram calls. So, it is very likely that the address held in our pointer is now overwritten with parts of other activation records; what we get when we dereference the pointer is not the value of the local variable but the value of some unrelated piece of data.

For optimization purposes, a compiler may choose to store some information about the current subprogram invocation in registers, instead of the run-time stack.

Local vs. Global Variables

Variables defined within the confines of a subprogram are local variables. They come to life only when the subprogram is invoked (an activation record is created for the subprogram and pushed onto the run-time stack) and, once the execution of the subprogram is completed, they cease to exist (the activation record of the subprogram is destroyed by popping it off the run-time stack). They do not retain their values between different activations of the same function.

Local declarations are used in order to limit the scope of a variable. The use of local variables cuts down on the problem of data name clashes. It also helps create independent units, namely subprograms. Global variables, on the other hand, are declared in the main program, remain in existence throughout the course of execution of the program, and are accessible from all program units. For this reason they can be allocated at compile-time, before the program even starts to run, and deallocated at the termination of the program.

The use of global variables in subprograms is strongly discouraged. The side-effects caused by the changes to the global variables in the subprograms may give rise to their returning different results for the same parameters. It also gives rise to strong coupling between different subprograms. This makes testing of subprograms more difficult and causes real debugging and maintenance headaches. This effect may be minimized by grouping such strongly-coupled subprograms and placing them inside a module, class, or, if such constructs are not directly supported by the language, source file.

Some languages provide a cross between local and global variables: static local variables. Such variables have local scope, like local variables, and static extent, like global variables. That is, they retain their values between calls while still remaining accessible only to the subprogram. Memory is allocated to them statically at compile-time (from the static data region) and deallocated at the end of the program execution.

Example: Static local variables in C.
`void silly(void) { static int noOfTimes = 1; printf("This function has been called %d times.\n", noOfTimes++); } /* end of void silly(void) */`
In the first invocation of this function, the local variable `noOfTimes` has 1 as its value. At the end of subprogram execution, the `noOfTimes` variable has the value 2. This value is preserved between the end of the subprogram and the next invocation. Next time the subprogram is invoked, `noOfTimes` has value 2.

Since the initialization of static local variables occurs, unlike plain local variables, before run-time, they are guaranteed to have consistent value. As a demonstration, run the following program and you will see that 5 is printed for staticLocal_i while a random value is printed for plainLocal_i.

Example: Initialization of `static` local variables in C.
`#include <stdio.h> void f(void) { goto L1; { static int staticLocal_i = 5; int plainLocal_i = 6; L1: printf("Static i: %d\nNon-static i: %d\n", staticLocal_i, plainLocal_i); } } /* void f(void) / int main(void) { f(); exit(0); } / int main(void) */` Remove the initialization of `staticLocal_i` and run the program for several times. You will see that it is always initialized to 0.

Question

In C, static local variables are by default initialized to zero whereas plain local variables are left uninitialized? What is the rationale behind this?

Answer

A static local variable comes into scope and becomes visible each time the subprogram it is defined in is activated. But unlike plain local variables, the memory allocated to them is not disposed of at the end of the subprogram call; they exist between subprogram calls. Only one copy of such a variable is shared among all subprogram activations. That is, there is only one copy of a static local variable in memory. In other words, there is a one-to-one mapping between variable name and address. So, it is possible for the compiler to compute the (logical) address of such a variable at compile-time. Now that such a variable can be easily initialized before run-time and this does not hamper the performance of the program, all such locations are initialized to 0 or the user provided value. For the sake of completeness, it should be noted that it is not the compiler that performs initialization to 0. What the compiler does is to count the number of such bytes and take a note of it. When the file created by the compiler/linker duo is later loaded into memory and made ready to run, loader in coordination with the virtual memory subsystem makes sure these variables are initialized to 0.

On the other hand, a plain local variable has to be initialized each time the subprogram is invoked and this incurs a performance penalty. For this reason, a programmer, if she wants to ensure that local variables are initialized, has to make the initializations of such variables explicitly.

Heap Variables

All data objects we have covered so far have lifetimes that correspond to the block they are defined in. Some languages, however, support features whereby the lifetime of data objects is not related to the program structure creating them. Records created dynamically in most ALGOL-based programming languages and manipulated through a pointer or a handle are examples of such data objects. Some other means of storage allocation and recovery must be used for the run-time representation of such subjects. Because of the random order of storage allocation and recovery made possible by these language facilities the storage involved, and its controlling mechanisms, are often referred as a heap.

In some programming languages the programmer explicitly discards heap memory by means of a language-supported operation (such as dispose in Pascal or free in C), while in others this is taken care of by part of the language runtime library called garbage collection.

Now that lifetime of a heap object is not imposed by the programming structure it is created in, such objects are said to have dynamic extent. Accessing heap objects indirectly, either through a pointer or a handle, gives these objects an anonymous nature. Do not forget, it is not the reference that is anonymous, but rather the heap object that is being referenced. Thanks to this property, by simple assignment, we can get two handles to stand for the same heap object. This means we can share the same data object. Realize that it is the heap object that is being shared, not the handle; handle is duplicated. If handles on the same heap object have different extents the heap object itself is expected to remain in existence until the last handle on it ceases to exist. That's why such objects are said to have dynamic extent.

With sharing instead of copying, indirect manipulation through handles instead of direct manipulation; one has to adopt a different programming style, a much more disciplined one. This is even more so in programming languages with no garbage collection facility. In addition to the dangling pointer problem, where a pointer holds the address of a non-exsting object, one has to avoid creating garbage: heap objects that are not pointed to by any pointers.

In addition to the dynamic structures such as lists and dynamic arrays, heap memory is most notably used for allocating instance variables—that is, non-static data used to represent the state of an object—in object-oriented programming languages. In order to provide for polymorphism, which requires the same handle to hold reference of objects with variable sizes, dynamic allocation of memory in the heap turns out to be the only choice.^[2]

Summary

The preceding presentation brings us to the figure given below. Basically, memory layout of a running program comprises code and data sections. The former is immutable, that is, cannot be altered during the runtime. For this simple reason, compilers mark the code pages as read-only. This prevents (accidental or malicious) overwriting of these pages. The data section, with the exception of compile-time constants, is meant to be read-write. Compile time constants can be incorporated into the code section as immediate mode data. Runtime constants (as found in C#, Java) can be manipulated through an address, very much like variables. But they must be stored in a read-only page.

Parts of a running program figure

As for the subdivision of the data section, static data region is used to allocate global and static local data objects. Initializations of such data objects are done before runtime. In case the programmer may not provide explicit initial values, the compiler guarantees them to have all zeros in their representation. This value is interpreted to be 0 for an integer, 0.0 for a real, null for a handle or a pointer, false for a Boolean, etc. This region is managed by the compiler, is allocated before runtime, and objects in it have static extent.

The run-time stack is used to allocate plain local variables, arguments, temporary variables, and some control information. Unless the programmer does not provide initial values, no default initial values are provided. This is because this region is allocated at runtime and anything done as extra incurs a performance penalty. So, if the program makes assumptions about initial values, the programmer must explicitly provide these. Although a subprogram may be called an indefinite number of times, the actions to be performed in the course of calling it (prelude) and the actions performed in the course of returning from it (postlude) are always the same. These templates are put in the code: every time a subprogram is invoked the prelude is executed and a jump is made to the start of the callee; every time a subprogram returns, the postlude is executed and control is returned to the instruction contained in the IP. This sequence being the same all the time means the compiler can manage the run-time stack. The extent of run-time stack data objects is limited to the lifetime of the frame they are created in; that is, they have local extent.

Finally, heap is used to allocate data objects with dynamic extent. Their usage pattern not being predictable means that it must be managed by the programmer, although some of this burden can be relieved by a garbage collector. The reasoning we made on the initialization of run-time stack data objects applies to the heap objects as well: if needed, the programmer must walk the extra mile.

Properties of different data sections table

Parameters

A subprogram, like a program, may be considered to be a mapping of a particular set of values into a particular set of results. Names for these values and results are collectively referred to as parameters. Input parameters refer to values that are input to, and operated on by, the subprogram. Output parameters receive their values by processing done within the subprogram; these are results. A specially designated output parameter is also called the return value of a subprogram. Input/output parameters provide two-way communication between the caller and the callee: They may be used both for data input to and result output from the subprogram.^[3]

Example: An Ada subprogram that finds the larger of its two arguments.
`PROCEDURE Max(num1: IN Integer; num2: IN Integer; larger: OUT Integer);` or `FUNCTION Max(num1: IN Integer; num2: IN Integer) RETURN Float;`

Example: An Ada subprogram that swaps the values of its arguments.
`PROCEDURE Swap(num1: IN OUT Float; num2: IN OUT Float);`

When a subprogram is called, the name of the subprogram is followed by a list of arguments, which is often enclosed in parentheses. The arguments correspond in number and order to the parameters listed in the subprogram specification. The parameters provided by the caller are called actual parameters, whereas the ones supplied in the subprogram specification are referred to as formal parameters.

Some programming languages may relieve the programmer from the restrictions stated in the previous paragraph. For instance, programming languages such as C, C++ , and Java provide the programmer with a facility to define subprograms with variable-size parameter lists. Similarly, programming languages such as Lisp and Ada enable programmers to match arguments with formal parameters in an out-of-order fashion.

Example: Passing variable number of arguments in C.
`int printf(const char* format, ...);`
Ellipsis in the above prototype means printf can be passed zero or more arguments after the first fixed argument. Format information passed in the first argument will presumably provide data to figure out the number and type of the following arguments.
`printf("Sum of %d and %d is %d\n", num1, num2, num1 + num2); printf("Class average in CSE224 came out as %f", avg);`

Example: Named parameters in Lisp.
`(defun make-complex (&key real-part imag-part) (cons real-part imag-part))`
The `&key` keyword above lets the users of `make-complex` pass the arguments by associating them with formal parameter names. Arguments passed in this way are called keyword arguments.
`(setq c1 (make-complex :imag-part 4.0 :real-part 3.0))`

Parameter Passing Mechanism

The mechanism by which an argument is transferred to the corresponding parameter of a subprogram is called parameter passing. The activity of parameter passing follows that of parameter evaluation in which each argument in the calling program, also referred to as the caller, is associated with its corresponding parameter in the subprogram, also known as the callee, and any necessary computation is carried out.

Call-by-value

When arguments are passed to parameters by value, the parameters are set up as variables local to the subprogram and initialized to the value of the argument. In effect, the values of the arguments are copied to locations set aside for the parameters. The storage locations of the arguments are, therefore, protected from change within the subprogram.

Example: A C function that returns the cube of its argument.
`// Not a particularly good implementation of calculating the cube of a given value. long cube(int n) { long result = n * n * n; return(result); } /* end of long cube(int) */ a = cube(m); ... ...`

Call-by-location

When passing an argument to a subprogram using this mechanism, the caller passes a reference, or machine address, for the argument. The parameter, then, is a pointer variable that point to the location of the argument. Hence, it is also called call-by-reference or call-by-address.

Rather than sending a value to the subprogram, this method results in the two modules sharing the data residing in some storage location. The effect of the data sharing that occurs is that the variable argument passed by location is accessible from the subprogram without limitation. In other words, the statements in the subprogram have access to the data stored in the storage location, both to read from and write into it. When the subprogram terminates, the pointers are freed, but any changes that were made to the arguments will, of course, be permanent. Thus, there is very little difference in effect between this and a global variable. They both produce side-effects.

When global variables are used, the data names used in the callee must be exactly the same as the data names used in the caller. This is the only method for data sharing available in BASIC and COBOL. An advantage of using location parameters rather than global variables is that the data names used for the parameters are independent of those used for arguments.

Example: An OBERON procedure to swap the values of two variables.
`PROCEDURE swap(VAR a, b: INTEGER); VAR temp : INTEGER; BEGIN temp := a; a : = b; b := temp; END swap; swap(x, y); ... ...`

Example: A C function to swap the values of two variables.
`int x = 3, y = 5; ... ... void swap(int a, int b) { int temp = a; a = b; b = temp; } /* end of void swap(int, int) */ ... ... swap(&x, &y); ... ...`

The above C function and the previous example accomplish the same task: swap two integer variables. The way parameter information is communicated between the caller and the callee is the same: through the addresses of the actual parameters. But still, one (OBERON) is said to use call-by-reference while the other (C) call-by-value. Users of the OBERON utility do not need to know how data is passed between the caller and the callee; all they have to know is what the procedure does. On the other hand, users of the C utility have to know that they must pass an address in order to see the modified value of an actual parameter, which is something done by the compiler in the OBERON case; they have to know the answer to the question 'how'. So, one can say that the user is actually passing an address by value, not the variable itself; she is simulating call-by-reference using call-by-value.

Question

In C, arrays, whatever their size might be, are passed as pointers to their first components. What is the rationale behind this?

Answer

This is a rare case where we win on both time and memory. As the size of an array gets larger, it takes longer to make a copy of it in the activation frame of the callee. But if we use a single address to refer to the entire array, this [copying time] is not affected and with the exception of very small arrays takes shorter time. Also, since all we have to do is to copy this single address value, not the entire array, we keep a single copy of the array in memory, which means saving on storage.

For these reasons, designers of C have made the decision mentioned in the question. This may be thought of as forming an exception to parameter passing mechanism used in C: all parameters in C are passed by value, except arrays. The answer provided in the previous paragraph can be seen as a justification for this irregularity.

A variation on call-by-location is the call-by-result mechanism, where unlike call-by-location, the caller is not required to initialize the passed arguments. In other words, call-by-result implements output parameters whereas call-by-location implements input-output parameters.

Example: A C# example to the call-by-result mechanism.
`class CName { ... public static void Add(int lhs, int rhs, out long result) { result = lhs + rhs; return; } // end of void Add(int, int, long) ... } // end of class CName ... long res; int lhs = 3; ... CName.Add(lhs, 5, res); Console.WriteLine(“Sum of {0} and 5 is {1}”, lhs, res); ...`
The above fragment will produce the following output: Sum of 3 and 5 is 8

Call-by-result parameters are used for returning values from the called subprogram. For this reason, forgoing the modification of result in the above example will be seen as an error by the C# compiler.

Effect of using such a parameter can be obtained by returning parameter value as part of the subprogram result. The above example could have been written as follows:

public static long Add(int lhs, int rhs) { return (lhs + rhs); } // end of long Add(int, int)

Call-by-name

This mechanism defers evaluation of each argument until it is actually needed during execution of the subprogram. Instead of passing a value, or a location of a value, to a parameter, a rule for evaluating the parameter is passed. It is, conceptually at least, a text replacement mechanism. Each time the parameter name appears in the text of the subprogram it is "replaced" (not really, but the effect is as if it were replaced) by the exact text of the argument.

With call-by-name, if the argument is a scalar variable, the effect is the same as with call-by-location; if the argument is an expression, the effect is the same as with call-by-value.

Example: Call-by-name in pseudo-Pascal.
`function SIGMA(name x: real; name k, n: integer):integer; begin real s = 0; for k := 1 to n do s := s + x; SIGMA := s end ... ... SIGMA(A[i], i, 10); SIGMA(A[i] * B[i], i, 10);`
The first call returns the sum of elements in the vector, `A`, while the second call returns the scalar product of two vectors, `A` and `B`.

Functions, called thunks, that return the address of a given parameter can be used to simulate this behavior.

Call-by-need

In some programming languages—most notably, the normal order functional programming languages—an expression is evaluated only when some other expression or function requires its output. This lazy evaluation, or delayed evaluation, of arguments translates into a parameter passing mechanism known as call-by-need. Its effect is very similar to call-by-name since an argument is evaluated only when its value is needed and not immediately at the point of call.

Example: Lazy evalution in Haskell.
`non_strict_func a b = if a == 0 then 0 else a / b non_strict_func 0 (1/0) 0.0 :: Double`

Since Haskell evaluates its arguments lazily, the above code fragment produces the "more natural" result: 0. However, running the C version of the same fragment would have ended with a division-by-zero exception.^[4] We say that a subprogram is non-strict in a lazily evaluated argument, whereas a subprogram is said to be strict in an argument that is evaluated before entry to the subprogram.

Notes

↑ In older programming languages the declarative part is generally required to precede the executable part whereas in others the two types of statements can be given in a mixed order.
↑ Note that the keyword is polymorphism, not inheritance. Because natural use of inheritance is not possible without polymorphism. However, in some programming languages such as C++ one can also talk of inheritance without any reference to polymorphism. In such a case, instance variables can be allocated in other parts of the data section.
↑ Use of output parameters to return results from a subprogram is considered bad programming. If possible, the subprogram should be implemented as a function and the output parameters should be returned as part of the function result.
↑ For fairness it should be stressed that this statement is true for almost all Algol-based programming languages, which include C, Pascal, C++, Java, C#, and so on.

[1] In older programming languages the declarative part is generally required to precede the executable part whereas in others the two types of statements can be given in a mixed order.

[2] Note that the keyword is polymorphism, not inheritance. Because natural use of inheritance is not possible without polymorphism. However, in some programming languages such as C++ one can also talk of inheritance without any reference to polymorphism. In such a case, instance variables can be allocated in other parts of the data section.

[3] Use of output parameters to return results from a subprogram is considered bad programming. If possible, the subprogram should be implemented as a function and the output parameters should be returned as part of the function result.

[4] For fairness it should be stressed that this statement is true for almost all Algol-based programming languages, which include C, Pascal, C++, Java, C#, and so on.

[1]

[2]

[3]

[4]