Polymorphic Data Structures in C/Introduction to C Constructs

← Preface | Pointers →

This book is built on a fundamental understanding of the different constructs that C can utilize to create data structures. In this chapter, we examine those constructs, their uses, and the pitfalls a new C programmer can encounter.

Structures

A structure is a simple C construct that can store multiple types of data simultaneously. In other programming languages (such as Pascal), these are called records or clusters. Any type can be stored in a structure, including another structure. The syntax for defining a structure is simple:

struct tag {
   member declarations
} ;

Where struct is the keyword defining the structure, tag is the name of the structure (this is optional, but very helpful), and the member declarations are normal variable declarations, defining the names of the "fields" inside the struct. Note, however, that defining a structure does not declare any variables. Remember that defining a variable tells the compiler to record the information about the new type for use when the variables are actually declared. In order to declare a new structure, the standard C syntax is used:

struct tag variablename ;

variablename is, of course, the name of the new structure.

Structures are useful when a developer needs to keep track of multiple (but related) data types. For example, say a developer needs to keep track of how many times a particular word occurs in a given string. They would define a structure to hold the search string, as well as the count of the word.

struct word_count {
   char word[MAX_WORDLENGTH] ;
   int frequency ;
} ;

To declare the new structure, they would use:

struct word_count search ;

It is possible, as with all types in C, to declare and define the structure variable in the same statement. The example given above could be written as:

struct word_count {
   char word[MAX_WORDLENGTH] ;
   int frequency ;
} search ;

If the structure is only going to exist once as referenced, then the tag (word_count) may be omitted in this case. This usage is most helpful when defining nested structures. For example:

struct employee_data {
   char name[16] ;
   struct {
      char street[32] ;
      char city[8] ;
      char state[3] ;
      int zipcode ;
   } address ;
   struct {
      int salary ;
      int years_employed ;
   } misc ;
} ;

Operations on Structures

The primary operation performed on structures is member reference. This is achieved using the member reference operator, "." (the period). This is similar to the method of reference used in object-oriented languages, such as C++.

variablename.member

In order to get the frequency of the search term declared above, one would use the code search.frequency. The first character of the word would be search.word[0]. Similarly, with the declaration

struct employee_data d ;

The developer would write d.address.zipcode to access the ZIP code of the employee.

Reading data into a structure must be performed at the member level, meaning that in order to input data to a structure a developer must fully reference the member. For example, to read in the name, address zipcode and salary of the structure d above on a single line, one would use the following code:

scanf( "%s %d %d" , d.name , &d.address.zipcode , &d.misc.salary ) ;

However, it is possible to copy an entire structure without copying its members individually.

struct1 = struct2 ;

Unions

Unions are very similar to structures, because they too can store multiple types of data. However, unions can only hold data for a single member at any given time. For example:

union symbol {
   char name[4] ;
   int value ;
} ;

Defines a type that can hold either an array of four characters or an integer, but not both simultaneously. Unions allocate an amount of memory equal to the largest member, then overlay the reference point for all of the members. The statement above (assuming sizeof ( char ) is 1 and sizeof ( int ) is 2) can be shown graphically like so:

Unions are most practical when members use roughly the same amount of space, or when larger members are the ones used most often. Otherwise, unions are wasteful in terms of memory management and should be avoided in favor of more sophisticated methods.

Unions are not type-aware; that is, unions cannot tell exactly which member is in use. Therefore, it is always helpful to keep track of what type is in use through an external variable. A simple way of doing this would be to define a structure containing two fields: the union, and a type field to specify the type in use in the union.

Enumerated Types

Enumerated types are types that have multiple values (also called tokens) that are listed according to a specified order. Enumerated types are defined using the syntax:

enum tag { tokens } ;

Where enum is the keyword defining the enumerated type, tag is the name of the type (unlike structures, this is not optional), and tokens are the possible values separated by commas. For example, to make a field for the union used above, one could write

enum member_type { name , value } ;

Then, a structure could be built around the union, as so:

struct symbol {
   enum member_type type ;
   union {
      char name[4] ;
      int value ;
   } symbol ;
} ;

Using the declaration struct symbol sym ; the programmer could then make assignments using the code:

sym.type = value ;
sym.symbol.value = 6 ;

Remember that the type and symbol fields are syntactically independent (that is, the names of the fields are not dependent on each other), so it is up to the programmer to ensure their semantic relationship.

Enumerated types can be used for any purpose that needs a fixed set of values. The main problem with enumerated types is that the program may not be able to properly handle formatted input and output in regards to the enumerated type (usually in string format). Therefore, enumerated types are defined in C to hold integer values. The first token in the list is 0, and each token following it are numbered one higher from the last one. It is also possible to give explicit values to a particular token by listing them, followed by an assignment operator (=) and the value.

enum numbers { zero , two = 2 , three , ten = 10 , eleven } ;

Assigning an explicit value to a token causes the numbering to be "re-set" at that number; three and eleven above hold the expected values.

Enumerated types can also be handled using preprocessor directives (discussed next in this chapter), but this method is far less flexible and the error messages caused by faulty directives tend to be more complex to debug.

Preprocessor Directives

Preprocessor directives are shortcuts in C. They allow the developer to perform a "find-and-replace" on the code at the time of compilation. They are usually placed above the function main() or in a global header file (in a multi-file project). They are defined using the syntax:

#directive_type shortcut( parameters ) actual_expression

The parameters are optional, if parameters do not need to be passed to the directive. The actual_expression only needs to be included for certain directives. One of the most commonly-used preprocessor directives is the #include statement. It takes the header file (ether stored in the default libraries directory or the local directory) and inserts it into the program at that location. The code:

#include <stdio.h>

Causes the file stdio.h to be fetched from the C library directory (/usr/include/ on Unix-based systems) and inserted at the top of the program. Note the lack of a semicolon at the end of the definition. Here is an example of a user-created directive that will be used later on in the book:

#define DATA( L )   ( ( L ) -> datapointer )

In this instance, we are using the statement DATA( L ) to replace the normal statement ( ( L ) -> datapointer ). This system can make many complex statements much easier to create, since they only need to be created once. The preprocessor (of the C compiler) performs a "blind search" of all of the programmer's code and replaces code according to the #define rules. Because of this, error messages generated by the compiler due to preprocessor directives may be somewhat cryptic.

Type Definition

C provides a tool to define new types using the typedef statement. Unlike preprocessor directives, a proper C compiler will recognize the names used in the type definition and can perform more complex substitutions. Type definition is generally used to create type aliases for complex constructs such as structures and unions, as well as for arrays of predetermined size. A less common use is to redefine the name of a particular type (though this is generally considered bad form).

For example, consider this structure:

struct student {
   char name[64] ;
   int id ;
} ;

Normally, defining a variable to use the structure declared above would use the following statement:

struct student student_data ;

Instead, using type definition, we can create a type called student to make variable definitions easier.

typedef struct student student ;

Now, instead of typing struct student student_data, the programmer would only need to type student student_data. Type definition can help developers overcome some of C's inherent weaknesses, such as the lack of a string type.

typedef char string[128] ;

The above statement defines the string type to be an array of 127 characters (plus one null byte). Since strings in C are handled as arrays of characters terminated with a null byte, this implementation of strings is acceptable.