Introduction to Programming Languages/Algebraic Data Types
This chapter presents an overview on how some of the most popular abstract data types can be implemented in a more functional way. We begin with the important concept of Algebraic Data Types (ADTs). They are the main mechanism used to implement data structures in functional languages. We then proceed to present functional implementations of many popular data types such as stacks and queues.
Algebraic Data Types
editMost programming languages employ the concept of types. Put simply, a type is a set. For instance, the type int in the Java programming language is the set of numbers whereas the type boolean is made of only two values: true and false.
Being sets, types can be combined with mathematical operations to yield new (and perhaps more complex) types. For instance, the type int boolean (where denotes the Cartesian product) is a set of tuples given by:
Algebraic Data Types are a formalism used to express these compositions of types. In ML, these types are declared with the keyword datatype. For instance, to declare a type weekday containing seven elements, in ML, we write:
datatype weekday = Monday
| Tuesday
| Wednesday
| Thursday
| Friday
| Saturday
| Sunday
The vertical bar is used to denote union. Indeed, an algebraic datatype is often the union of simpler types. Each of these subsets has a unique label. In our example above, we have seven different labels. Each label distinguishes a singleton set. In other words, the cardinality of our weekday type is seven, as this type is the union of seven types which have cardinality one. A Boolean type can be declared as follows:
datatype Boolean = True
| False
And the union of booleans and week days is declared as follows:
datatype BoolOrWeek = Day of Weekday
| Bool of Boolean;
Whereas its Cartesian product can be declared as:
data BoolAndWeek = BoolWeekPair of Boolean * Week;
Algebraic Data Types have two important properties. The first is the fact that they can be pattern matched. Pattern Matching is a simple mechanism that allows a function to be implemented as a series of cases triggered by the tags that appear on the left hand side of the declarations. For example, we could write a function to test if a Weekday is part of the Weekend as follows:
fun is_weekend Sunday = True
| is_weekend Saturday = True
| is_weekend _ = False
Patterns are tried one at a time, from top to bottom, until one of them matches. If none match, the ML runtime throws an exception. Wildcards can be used, as we did above with _. This pattern matches any Weekday. The label that is associated with each subset of a datatype is essential for pattern matching, as it distinguishes one subset from the other.
The other important property of Algebraic Data Types is the fact that they can have recursive definitions. Thanks to that, structures such as binary trees can be defined quite concisely. The example bellow shows a tree of integer keys.
datatype tree = Leaf
| Node of tree * int * tree;
Algebraic Data Types can also be _parametric_, thus depending on other types. This allows us to define more generic structures, which increases modularity and reuse. For instance, a generic tree can be defined as follows:
datatype 'a tree = Leaf
| Node of 'a tree * 'a * 'a tree;
Where the ' sign is used to indicate that 'a can be any type. A tree of weekdays, for instance, has type weekday tree.
The fact that Algebraic Data Types can be defined recursively offers a great deal of flexibility. In the next sessions, we will show examples of how Algebraic Data Types can be used to implement some common data structures.
Disjoin Unions
editAs we have explained before, algebraic Data types represent disjoin unions of simpler types. As a new example, the type bunch, below, describes the union of two types: a polymorphic 'a type, and a polymorphic list type.
datatype 'x bunch = One of 'x
| Group of 'x list;
The function size below, of type 'a bunch -> int illustrates how this type can be used. This function returns the number of elements in an instance of bunch:
fun size (One _) = 1
| size (Group x) = length x;
Disjoint unions can also be implemented in languages that do not support the concept of labeled types. For instance, we could implement the type bunch in C, but, in this case, the compiler does not have enough information to check the correctness of the program:
#include <stdio.h>
#include <stdlib.h>
enum bunch_tag {ONE, GROUP};
typedef union {
int one;
int* group;
} bunch_element_type;
typedef struct {
bunch_element_type bunch_element;
unsigned tag;
} bunch;
bunch* createOne(int i) {
bunch* b = (bunch*)malloc(sizeof(bunch));
b->tag = ONE;
b->bunch_element.one = i;
return b;
}
void printBunch(bunch* b) {
switch(b->tag) {
case ONE:
printf("%d\n", b->bunch_element.one);
break;
case GROUP:
{
int i = 0;
while (b->bunch_element.group[i] != 0) {
printf("%8d", b->bunch_element.group[i]);
i++;
}
}
}
}
int main(int argc, char** argv) {
while (argc > 1) {
int i;
argc--;
i = atoi(argv[argc]);
bunch* b1 = createOne(i);
printBunch(b1);
}
}
A union of types in C is defined by the union construct. In this example, we use a structure of such a union, plus an integer representing the label, i.e., the type identifier. The programmer must be careful to never produce inconsistent code, because the language is not able to enforce the correct binding of labels and types. For instance, the function below is valid, yet wrong, from a logic perspective:
bunch* createOne(int i) {
bunch* b = (bunch*)malloc(sizeof(bunch));
b->tag = GROUP;
b->bunch_element.one = i;
return b;
}
The function is wrong because we are labeling the type with the constant GROUP, which was designed, originally, to denote instances of bunch having several, and not just one integer element. In ML, or in any other language that supports labeled disjoint unions, such mistake would not be possible.