Expert Systems/Printable version

Expert Systems

The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at

Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-ShareAlike 3.0 License.


About This Book


This book is all about Expert Systems, an Artificial Intelligence (AI) programming technique.

Target Audience


This book is designed for undergraduate and graduate students in computer science, computer engineering, or a related field. As this book is an introduction to the field of expert systems, and to artificial intelligence in general, students do not need to have a background in either of these areas.



Readers of this book are expected to be familiar with computer programming, and know at least one high level language. Students are also expected to have a background in logic, and probability. Some sections may require additional mathematics skills, such as calculus.

Introduction to Expert Systems

Computer Intelligence


AI research has been one of the most frenzied areas of computer science since the inception of the discipline. However, despite the massive effort and money that has gone into research, computers are still unable to perform simple tasks that humans do on a regular basis. Many researchers believed that a comprehensive system of logic would enable computers to successfully complete high-level reasoning tasks that humans can perform. However, logical computer programs require knowledge on which to base decisions. Converting human knowledge into a form that is both meaningful and useful for a computer has proven to be a difficult task.

Expert Systems


Expert systems are an area of AI research that attempts to codify the knowledge and reasoning processes of a human expert into a computer program.

How Expert Systems Work


Expert systems interact with another entity, such as a human user or an application, to discover information about a problem, and evaluate possible solutions. The most simple form of an expert system is a question-and-answer system, where a human user is presented with questions. The user answers these questions, and those answers are used to further the reasoning process of the expert system.

Uses of Expert Systems


Expert systems are used for problems where there is incomplete data about a subject, and insufficient theory available for the creation of an algorithmic solution. Some problems, such as medical diagnosis, are not easily solved with an algorithm, but instead require reasoning and induction.

Numerical algorithms are more efficient than expert systems, and are typically more exact. However, many problems are not suited to being easily modeled mathematically, and in these cases numerical algorithms are not possible. Other AI techniques, such as artificial neural networks are suited for problems where there is very little theory but a wealth of experimental data.

Expert systems tend to be slow, and often require extensive human interaction. However, well-designed expert systems can be very rigorous, and some expert systems have been shown to outperform the human experts that helped to develop them.

Shortcomings of Expert Systems


Expert systems are based on human knowledge and reasoning patterns. This knowledge must be extracted from a human expert by a specialized knowledge engineer. Knowledge engineers ask the expert questions about his knowledge and his reasoning processes, and attempts to translate that into a computer-readable format known as a knowledge base. Expert systems generated in this way will be flawed if the information received from the expert is flawed, or if it is incorrectly translated by the knowledge engineer.

Expert systems, because they are focused on a single problem area, tend to fail catastrophically if presented with a problem or information that is outside their domain.

Components of Expert Systems

Components of an Expert System


An expert system is typically composed of at least three primary components. These are the inference engine, the knowledge base, and the User interface. We will introduce these components below.

Knowledge Base


The knowledge base is a collection of rules or other information structures derived from the human expert. Rules are typically structured as If/Then statements of the form:

IF <antecedent> THEN <consequent>

The antecedent is the condition that must be satisfied. When the antecedent is satisfied, the rule is triggered and is said to "fire". The consequent is the action that is performed when the rule fires.



When rules are satisfied by the program, they are added to a queue called the agenda. The agenda is an unordered list of all the rules whose antecedents are currently satisfied.

Knowledge bases are typically not ordered, because order tends to play very little role in an expert system. Rules may be placed on the agenda in any order, and they may be fired in any order once they are on the agenda.

Interference Engine


The inference engine is the main processing element of the expert system. The inference engine chooses rules from the agenda to fire. If there are no rules on the agenda, the inference engine must obtain information from the user in order to add more rules to the agenda. It makes use of knowledge base, in order to draw conclusions for situations. It is responsible for gathering the information from the user, by asking various questions and applying it wherever necessary. It seeks information and relationships from the knowledge base and to provide answers, predictions and suggestions the way a human expert would.

User Interface


A user interface is the method by which the expert system interacts with a user. These can be through dialog boxes, command prompts, forms, or other input methods. Some expert systems interact with other computer applications, and do not interact directly with a human. In these cases, the expert system will have an interaction mechanism for transactions with the other application, and will not have a user interface.

Other Components


There are other components that are relatively common in an expert system, but are not strictly needed.

Working Memory


Working memory contains the data that is received from the user during the expert system session. Values in working memory are used to evaluate antecedents in the knowledge base. Consequences from rules in the knowledge base may create new values in working memory, update old values, or remove existing values.

Explanation Mechanism


The method by which an expert system reaches a conclusion may not be obvious to a human user, so many expert systems will include a method for explaining the reasoning process that lead to the final answer of the system.

Knowledge Engineering

Knowledge Engineering


Knowledge engineering is the task of extracting information from a human expert, typically through an interview, and translating that knowledge for use in an expert system.

Interview Techniques



The E.S shell simplifies the process of creating a knowledge base. It is the shell that actually processes the information entered by a user relates it to the concepts contained in the knowledge base and provides an assessment or solution for a particular problem. Thus E.S shell provides a layer between the user interface and the computer O.S to manage the input and output of the data. It also manipulates the information provided by the user in conjunction with the knowledge base to arrive at a particular conclusion.




CLIPS is a public-domain software tool for building expert systems. The name is an acronym for "C Language Integrated Production System." The syntax and name was inspired by Charles Forgy's OPS. The first versions of CLIPS were developed starting in 1985 at NASA-Johnson Space Center until the mid 1990s when the development group's responsibilities ceased to focus on expert system technology.

CLIPS is probably the most widely used expert system tool because it is fast, efficient and free. Although it is now in the public domain, it is still updated and supported by the original author, Gary Riley. CLIPS incorporates a complete object-oriented language "COOL" for writing expert systems. Though it is written in C, its interface more closely resembles that of the programming language LISP. Extensions can be written in C, and CLIPS can be called from C.

Like other expert system languages, CLIPS deals with rules and facts. Various facts can make a rule applicable. An applicable rule is then asserted. Facts and rules are created by first defining them, as shown below:

 (deffacts trouble_shooting
     (car_problem (name ignition_key) (status on))
     (car_problem (name engine) (status wont_start))
     (car_problem (name headlights) (status work))
 (defrule rule1
     (car_problem (name ignition_key) (status on))
     (car_problem (name engine) (status wont_start))
     (assert (car_problem (name starter) (status faulty))

Descendants of the CLIPS language include Jess, Haley Eclipse (of Haley Systems -> RuleBurst ->Oracle), FuzzyCLIPS, EHSIS, and others.

See also



About Jess


Jess, a rule engine for the Java platform, is a superset of CLIPS, developed by Ernest Friedman-Hill of Sandia National Labs. It was first written in late 1995.

It provides rule-based programming suitable for automating an expert system, and is often referred to as an expert system shell. In recent years, intelligent agent systems have also developed, which depend on a similar capability.

Rather than a procedural paradigm, where a single program has a loop that is activated only one time, the declarative paradigm used by Jess continuously applies a collection of rules to a collection of facts by a process called pattern matching. Rules can modify the collection of facts, or they can execute any Java code.

Jess can be used to build Java servlets, EJBs, applets, and full applications that use knowledge in the form of declarative rules to draw conclusions and make inferences. Since many rules may match many inputs, there are few effective general purpose matching algorithms. The Jess rules engine uses the Rete algorithm.

While CLIPS is licensed as open source, Jess is not open source.



Example 1


Code examples:

  ; is a comment

 (bind ?x 100)
 ; x = 100

 (deffunction max (?a ?b)
   (if (> ?a ?b) then ?a else ?b))

 (deffacts myroom
    (furniture chair)
    (furniture table)
    (furniture bed)

 (deftemplate car
    (slot color)
    (slot mileage)
    (slot value)

 (assert (car (color red) (mileage 10000) (value 400)))

Example 2


Sample code:

 (deftemplate blood-donor (slot name) (slot type))
 (deffacts blood-bank ; put names & their types into [[working memory(jess)|working memory]]
       (blood-donor (name "Alice")(type "A"))
       (blood-donor (name "Agatha")(type "A"))
       (blood-donor (name "Bob")(type "B"))
       (blood-donor (name "Barbara")(type "B"))
       (blood-donor (name "Jess")(type "AB"))
       (blood-donor (name "Karen")(type "AB"))
       (blood-donor (name "Onan")(type "O"))
       (blood-donor (name "Osbert")(type "O"))
 (defrule can-give-to-same-type-but-not-self ; handles A > A, B > B, O > O, AB > AB, but not N1 > N1
       (blood-donor (name ?name)(type ?type))
       (blood-donor (name ?name2)(type ?type2 &:(eq ?type ?type2) &: (neq ?name ?name2)  ))
       (printout t ?name " can give blood to " ?name2 crlf)
 (defrule O-gives-to-others-but-not-itself ; O to O cover in above rule
       (blood-donor (name ?name)(type ?type &:(eq ?type "O")))
       (blood-donor (name ?name2)(type ?type2 &: (neq ?type ?type2) &: (neq ?name ?name2)  ))
       (printout t ?name " can give blood to " ?name2 crlf)
 (defrule A-or-B-gives-to-AB ; case O gives to AB and AB gives to AB already dealt with
       (blood-donor (name ?name)(type ?type &:(or (eq ?type "A") (eq ?type "B" ))))
       (blood-donor (name ?name2)(type ?type2 &: (eq ?type2 "AB")  &: (neq ?name ?name2)  ))
       (printout t ?name " can give blood to " ?name2 crlf)
 ;(watch all)

See also





Prolog is a logic programming language. It is a general purpose language often associated with artificial intelligence and computational linguistics. It has a purely logical subset, called "pure Prolog", as well as a number of extralogical features.

It was created around 1972 by Alain Colmerauer with Philippe Roussel, based on Robert Kowalski's procedural interpretation of Horn clauses. It was motivated in part by the desire to reconcile the use of logic as a declarative knowledge representation language with the procedural representation of knowledge that was popular in North America in the late 1960s and early 1970s.

Much of the modern development of Prolog came from the impetus of the fifth generation computer systems project (FGCS), which developed a variant of Prolog named Kernel Language for its first operating system.

Pure Prolog was originally restricted to the use of a resolution theorem prover with Horn clauses of the form

H :- B1, …, Bn..

The application of the theorem-prover treats such clauses as procedures

to show/solve H, show/solve B1 and … and Bn.

Pure Prolog was soon extended, however, to include negation as failure, in which negative conditions of the form not(Bi) are shown by trying and failing to solve the corresponding positive conditions Bi.

Data types


Prolog's single data type is the term. Terms are either atoms, numbers, variables or compound terms.

Example atoms are: x, blue, 'Some atom', and [].

Numbers can be floats or integers. Many Prolog implementations also provide unbounded integers and rational numbers.

Variables are denoted by a string consisting of letters, numbers and underscore characters, and beginning with an upper-case letter or underscore. Variables closely resemble variables in logic in that they are placeholders for arbitrary terms. A variable can become instantiated via unification. A single underscore (_) denotes an anonymous variable and means "any term".

A compound term has a functor and a number of arguments, which are again terms. The number of arguments is called the term's arity. (Atoms can be regarded as a special case of compound terms of arity zero.)

Examples for terms are f(a,b) and p(x,y,z). Some operators can be used in infix notation. For example, the terms +(a,b) and =(X,Y) can also be written as a+b and X=Y, respectively. Users can also declare arbitrary sequences of symbols as new infix or postfix operators to allow for domain-specific notations. The notation f/n is commonly used to denote a term with functor f and arity n.

Special cases of compound terms:

  • Lists are defined inductively: The atom [] is a list. A compound term with functor . (dot) and arity 2, whose second argument is a list, is itself a list. There exists special syntax for denoting lists: .(A, B) is equivalent to [A|B]. For example, the list .(1, .(2, .(3, []))) can also be written as [1,2,3].
  • Strings: A sequence of characters surrounded by quotes is equivalent to a list of (numeric) character codes, generally in the local character encoding or Unicode if the system supports Unicode.

Programming in Prolog


Prolog programs describe relations, defined by means of clauses. Pure Prolog is restricted to Horn clauses, a Turing-complete subset of first-order predicate logic. There are two types of clauses: Facts and rules. A rule is of the form

Head :- Body.

and is read as "Head is true if Body is true". A rule's body consists of calls to predicates, which are called the rule's goals. The built-in predicate ,/2 denotes conjunction of goals, and ;/2 denotes disjunction. Conjunctions and disjunctions can only appear in the body, not in the head of a rule. Clauses with empty bodies are called facts. An example of a fact is:


which is equivalent to the rule:

cat(tom) :- true.

The built-in predicate true/0 is always true.

Given above fact, one can ask:

is tom a cat?

?- cat(tom).  

what things are cats?

?- cat(X).  
X = tom

Due to the relational nature of many built-in predicates, they can typically be used in several directions. For example, length/2 can be used to determine the length of a list (length(List, L), given a list List) as well as to generate a list skeleton of a given length (length(X, 5)), and also to generate both list skeletons and their lengths together (length(X, L)). Similarly, append/3 can be used both to append two lists (append(ListA, ListB, X) given lists ListA and ListB) as well as to split a given list into parts (append(X, Y, List), given a list List). For this reason, a comparatively small set of library predicates suffices for many Prolog programs. All predicates can also be used to perform unit tests: Queries can be embedded in programs and allow for automatic compile-time regression testing.

As a general purpose language, Prolog also provides various built-in predicates to perform routine activities like input/output, using graphics and otherwise communicating with the operating system. These predicates are not given a relational meaning and are only useful for the side-effects they exhibit on the system. For example, the predicate write/1 displays a term on the screen.



Execution of a Prolog program is initiated by the user's posting of a single goal, called the query. Logically, the Prolog engine tries to find a resolution refutation of the negated query. The resolution method used by Prolog is called SLD resolution. If the negated query can be refuted, it follows that the query, with the appropriate variable bindings in place, is a logical consequence of the program. In that case, all generated variable bindings are reported to the user, and the query is said to have succeeded. Operationally, Prolog's execution strategy can be thought of as a generalization of function calls in other languages, one difference being that multiple clause heads can match a given call. In that case, the system creates a choice-point, unifies the goal with the clause head of the first alternative, and continues with the goals of that first alternative. If any goal fails in the course of executing the program, all variable bindings that were made since the most recent choice-point was created are undone, and execution continues with the next alternative of that choice-point. This execution strategy is called chronological backtracking. For example:

sibling(X, Y)      :- parent_child(Z, X), parent_child(Z, Y).

parent_child(X, Y) :- father_child(X, Y).
parent_child(X, Y) :- mother_child(X, Y).

mother_child(trude, sally).

father_child(tom, sally).
father_child(tom, erica).
father_child(mike, tom).

This results in the following query being evaluated as true:

?- sibling(sally, erica).

This is obtained as follows: Initially, the only matching clause-head for the query sibling(sally, erica) is the first one, so proving the query is equivalent to proving the body of that clause with the appropriate variable bindings in place, i.e., the conjunction (parent_child(Z,sally), parent_child(Z,erica)). The next goal to be proved is the leftmost one of this conjunction, i.e., parent_child(Z, sally). Two clause heads match this goal. The system creates a choice-point and tries the first alternative, whose body is father_child(Z, sally). This goal can be proved using the fact father_child(tom, sally), so the binding Z = tom is generated, and the next goal to be proved is the second part of the above conjunction: parent_child(tom, erica). Again, this can be proved by the corresponding fact. Since all goals could be proved, the query succeeds. Since the query contained no variables, no bindings are reported to the user. A query with variables, like:

?- father_child(Father, Child).

enumerates all valid answers on backtracking.

Notice that with the code as stated above, the query sibling(sally, sally) also succeeds (X = Y). One would insert additional goals to describe the relevant restrictions, if desired.

Loops and recursion


Iterative algorithms can be implemented by means of recursive predicates. Prolog systems typically implement a well-known optimization technique called tail call optimization (TCO) for deterministic predicates exhibiting tail recursion or, more generally, tail calls: A clause's stack frame is discarded before performing a call in a tail position. Therefore, deterministic tail-recursive predicates are executed with constant stack space, like loops in other languages.



The built-in Prolog predicate \+/1 provides negation as failure, which allows for non-monotonic reasoning. The goal "\+ illegal(X)" in the rule

legal(X) :- \+ illegal(X).

is evaluated is follows: Prolog attempts to prove illegal(X). If a proof for that goal can be found, the original goal (i.e., \+ illegal(X)) fails. If no proof can be found, the original goal succeeds. Therefore, the \+/1 prefix operator is called the "not provable" operator, since the query "?- \+ Goal" succeeds if Goal is not provable. This kind of negation is sound if its argument is ground. Soundness is lost if the argument contains variables. In particular, the query "?- legal(X)." can now not be used to enumerate all things that are legal.

Operational considerations


Under a declarative reading, the order of rules, and of goals within rules, is irrelevant since logical disjunction and conjunction are commutative. Procedurally, however, it is often important to take into account Prolog's execution strategy, either for efficiency reasons, or due to the semantics of impure built-in predicates for which the order of evaluation matters.

DCGs and parsing


There is a special notation called definite clause grammars (DCGs). A rule defined via -->/2 instead of :-/2 is expanded by the preprocessor (expand_term/2, a facility analogous to macros in other languages) according to a few straight-forward rewriting rules, resulting in ordinary Prolog clauses. Most notably, the rewriting equips the predicate with two additional arguments, which can be used to implicitly thread state around, analogous to monads in other languages. DCGs are often used to write parsers or list generators, as they also provide a convenient interface to list differences.

Parser example


A larger example will show the potential of using Prolog in parsing.

Given the sentence expressed in BNF:

<sentence>    ::=  <stat_part>
<stat_part>   ::=  <statement> | <stat_part> <statement>
<statement>   ::=  <id> = <expression> ;
<expression>  ::=  <operand> | <expression> <operator> <operand>
<operand>     ::=  <id> | <digit>
<id>          ::=  a | b
<digit>       ::=  0..9
<operator>    ::=  + | - | *

This can be written in Prolog using DCGs, corresponding to a predictive parser with one token look-ahead:

sentence(S)                --> statement(S0), sentence_r(S0, S).
sentence_r(S, S)           --> [].
sentence_r(S0, seq(S0, S)) --> statement(S1), sentence_r(S1, S).

statement(assign(Id,E)) --> id(Id), [=], expression(E), [;].

expression(E) --> term(T), expression_r(T, E).
expression_r(E, E)  --> [].
expression_r(E0, E) --> [+], term(T), expression_r(plus(E0,T), E).
expression_r(E0, E) --> [-], term(T), expression_r(minus(E0, T), E).

term(T)       --> factor(F), term_r(F, T).
term_r(T, T)  --> [].
term_r(T0, T) --> [*], factor(F), term_r(times(T0, F), T).

factor(id(ID))   --> id(ID).
factor(digit(D)) --> [D], { (number(D) ; var(D)), between(0, 9, D)}.

id(a) --> [a].
id(b) --> [b].

This code defines a relation between a sentence (given as a list of tokens) and its abstract syntax tree (AST). Example query:

?- phrase(sentence(AST), [a,=,1,+,3,*,b,;,b,=,0,;]).
AST = seq(assign(a, plus(digit(1), times(digit(3), id(b)))), assign(b, digit(0))) ;

The AST is represented using Prolog terms and can be used to apply optimizations, to compile such expressions to machine-code, or to directly interpret such statements. As is typical for the relational nature of predicates, these definitions can be used both to parse and generate sentences, and also to check whether a given tree corresponds to a given list of tokens. Using iterative deepening for fair enumeration, each arbitrary but fixed sentence and its corresponding AST will be generated eventually:

?- length(Tokens, _), phrase(sentence(AST), Tokens).
 Tokens = [a, =, a, (;)], AST = assign(a, id(a)) ;
 Tokens = [a, =, b, (;)], AST = assign(a, id(b)) 

Higher-order programming


Since arbitrary Prolog goals can be constructed and evaluated at run-time, it is easy to write higher-order predicates like maplist/2, which applies an arbitrary predicate to each member of a given list, and sublist/3, which filters elements that satisfy a given predicate, also allowing for currying.

To convert solutions from temporal representation (answer substitutions on backtracking) to spatial representation (terms), Prolog has various all-solutions predicates that collect all answer substitutions of a given query in a list. This can be used for list comprehension. For example, perfect numbers equal the sum of their proper divisors:

perfect(N) :-
    between(1, inf, N), U is N // 2,
    findall(D, (between(1,U,D), N mod D =:= 0), Ds),
    sumlist(Ds, N).

This can be used to enumerate perfect numbers, and also to check whether a number is perfect.

Meta-interpreters and reflection


Prolog is a homoiconic language and provides many facilities for reflection. Its implicit execution strategy makes it possible to write a concise meta-circular evaluator (also called meta-interpreter) for pure Prolog code. Since Prolog programs are themselves sequences of Prolog terms (:-/2 is an infix operator) that are easily read and inspected using built-in mechanisms (like read/1), it is easy to write customized interpreters that augment Prolog with domain-specific features.

Implementation techniques


For efficiency, Prolog code is typically compiled to abstract machine code, often influenced by the register-based Warren Abstract Machine (WAM) instruction set. Some implementations employ abstract interpretation to derive type and mode information of predicates at compile time, or compile to real machine code for high performance. Devising efficient implementation techniques for Prolog code is a field of active research in the logic programming community, and various other execution techniques are employed in some implementations. These include clause binarization and stack-based virtual machines.



Here follow some example programs written in Prolog.



The QuickSort sorting algorithm, relating a list to its sorted version:

partition([], _, [], []).
partition([X|Xs], Pivot, Smalls, Bigs) :-
    (   X @< Pivot ->
        Smalls = [X|Rest],
        partition(Xs, Pivot, Rest, Bigs)
    ;   Bigs = [X|Rest],
        partition(Xs, Pivot, Smalls, Rest)

quicksort([])     --> [].
quicksort([X|Xs]) --> 
    { partition(Xs, X, Smaller, Bigger) },
    quicksort(Smaller), [X], quicksort(Bigger).

Turing machine


Turing completeness of Prolog can be shown by using it to simulate a Turing machine:

turing(Tape0, Tape) :-
    perform(q0, [], Ls, Tape0, Rs),
    reverse(Ls, Ls1),
    append(Ls1, Rs, Tape).

perform(qf, Ls, Ls, Rs, Rs) :- !.
perform(Q0, Ls0, Ls, Rs0, Rs) :-
    symbol(Rs0, Sym, RsRest),
    once(rule(Q0, Sym, Q1, NewSym, Action)),
    action(Action, Ls0, Ls1, [NewSym|RsRest], Rs1),
    perform(Q1, Ls1, Ls, Rs1, Rs).

symbol([], b, []).
symbol([Sym|Rs], Sym, Rs).

action(left, Ls0, Ls, Rs0, Rs) :- left(Ls0, Ls, Rs0, Rs).
action(stay, Ls, Ls, Rs, Rs).
action(right, Ls0, [Sym|Ls0], [Sym|Rs], Rs).

left([], [], Rs0, [b|Rs0]).
left([L|Ls], Ls, Rs, [L|Rs]).

A simple example Turing machine is specified by the facts:

rule(q0, 1, q0, 1, right).
rule(q0, b, qf, 1, stay).

This machine performs incrementation by one of a number in unary encoding: It loops over any number of "1" cells and appends an additional "1" at the end. Example query and result:

?- turing([1,1,1], Ts).
Ts = [1, 1, 1, 1] ;

This example illustrates how any computation can be expressed declaratively as a sequence of state transitions, implemented in Prolog as a relation between successive states of interest. As another example for this, an optimizing compiler with three optimization passes could be implemented as a relation between an initial program and its optimized form:

program_optimized(Prog0, Prog) :-
    optimization_pass_1(Prog0, Prog1),
    optimization_pass_2(Prog1, Prog2),
    optimization_pass_3(Prog2, Prog).

or equivalently using DCG notation:

program_optimized --> optimization_pass_1, optimization_pass_2, optimization_pass_3.

Dynamic programming


The following Prolog program uses dynamic programming to find the longest common subsequence of two lists in polynomial time. The clause database is used for memoization:

:- dynamic stored/1.

memo(Goal) :- ( stored(Goal) -> true ; Goal, assertz(stored(Goal)) ).

lcs([], _, []) :- !.
lcs(_, [], []) :- !.
lcs([X|Xs], [X|Ys], [X|Ls]) :- !, memo(lcs(Xs, Ys, Ls)).
lcs([X|Xs], [Y|Ys], Ls) :-
    memo(lcs([X|Xs], Ys, Ls1)), memo(lcs(Xs, [Y|Ys], Ls2)),
    length(Ls1, L1), length(Ls2, L2),
    (   L1 >= L2 -> Ls = Ls1 ; Ls = Ls2 ).

Example query:

?- lcs([x,m,j,y,a,u,z], [m,z,j,a,w,x,u], Ls).
Ls = [m, j, a, u]

Some Prolog systems, like BProlog and XSB, implement an extension called tabling, which frees the user from manually storing intermediate results.


  • Constraint logic programming is important for many Prolog applications in industrial settings, like time tabling and other scheduling tasks. Most Prolog systems ship with at least one constraint solver for finite domains, and often also with solvers for other domains like rational numbers.
  • HiLog and λProlog extend Prolog with higher-order programming features.
  • F-logic extends Prolog with frames/objects for knowledge representation.
  • OW Prolog has been created in order to answer Prolog's lack of graphics and interface.
  • Logtalk is an open source object-oriented extension to the Prolog programming language. Integrating logic programming with object-oriented and event-driven programming, it is compatible with most Prolog compilers. It supports both prototypes and classes. In addition, it supports component-based programming through category-based composition.
  • Prolog-MPI is an open-source SWI-Prolog extension for distributed computing over the Message Passing Interface.
  • Visual Prolog, also formerly known as PDC Prolog and Turbo Prolog. Visual Prolog is a strongly-typed object-oriented dialect of Prolog, which is considerably different from standard Prolog. As Turbo Prolog it was marketed by Borland, but it is now developed and marketed by the Danish firm PDC (Prolog Development Center) that originally produced it.
  • Datalog is actually a subset of Prolog. It is limited to relationships that may be stratified and does not allow compound terms. In contrast to Prolog, Datalog is not Turing-complete.
  • In some ways Prolog is a subset of Planner, e.g., see Kowalski's early history of logic programming. The ideas in Planner were later further developed in the Scientific Community Metaphor.

Frameworks also exist which can provide a bridge between Prolog and the Java programming language:

  • InterProlog, a programming library bridge between Java and Prolog, implementing bi-directional predicate/method calling between both languages. Java objects can be mapped into Prolog terms and vice-versa. Allows the development of GUIs and other functionality in Java while leaving logic processing in the Prolog layer. Supports XSB, SWI-Prolog and YAP.
  • Prova provides native syntax integration with Java, agent messaging and reaction rules. Prova positions itself as a rule-based scripting (RBS) system for middleware. The language breaks new ground in combining imperative and declarative programming.


  • Michael A. Covington, Donald Nute, Andre Vellino, Prolog Programming in Depth, 1996, ISBN 0-13-138645-X.
  • Michael A. Covington, Natural Language Processing for Prolog Programmers, 1994, ISBN 0-13-629213-5.
  • Leon Sterling and Ehud Shapiro, The Art of Prolog: Advanced Programming Techniques, 1994, ISBN 0-262-19338-8.
  • Ivan Bratko, PROLOG Programming for Artificial Intelligence, 2000, ISBN 0-201-40375-7.
  • Robert Kowalski, The Early Years of Logic Programming, CACM January 1988.
  • ISO/IEC 13211: Information technology — Programming languages — Prolog. International Organization for Standardization, Geneva.
  • Alain Colmerauer and Philippe Roussel, The birth of Prolog, in The second ACM SIGPLAN conference on History of programming languages, p. 37-52, 1992.
  • Richard O'Keefe, The Craft of Prolog, ISBN 0-262-15039-5.
  • Patrick Blackburn, Johan Bos, Kristina Striegnitz, Learn Prolog Now!, 2006, ISBN 1-904987-17-6.

See also


What is Knowledge?





Rules are structured as "if"/"then" rules as follows:


Here, X is known as the antecedent and Y is known as the consequent. The antecedent represents a series of requirements, and if those requirements are satisfied, then the actions in the consequent are performed.


Productions are set of situation-action pairs. They are used in reaching conclusions and making decisions in problem solving situations. It is an algorithmic problem solving criteria, needed to make decision.

The Agenda

The Agenda


The agenda is a queue that contains all the rules whose conditions have been met and are capable of firing. The inference engine will examine the rules on the agenda, and determine which rules will be acted upon.



When there are 2 or more rules on the agenda, that is when there are several rules whose conditions have been met and are prepared to be fired. In these cases, the inference engine needs to decide which rules to fire, and which order to follow them in.

Rule Order


There are many different ways that the inference engine can decide which rules to fire on the agenda, and in which order.

First In, First Out


First in, first out (FIFO) means that rules are added to the agenda in the order which they appear in the knowledge base, and the inference engine fires rules in the same order in which they appear in the agenda.



In a precedence system, each rule is given a precedence value, and rules with the higher precedence are fired. The problem with these systems, is that there is a temptation for the programmer to assign precedence values in such a way as to duplicate a procedural program.

Manually scripting the exact order in which rules fire is generally a bad idea. Expert systems are far more robust when the knowledge engineer, instead, collects rules which are true no matter which order they are fired.

Number of Antecedents


Rules which have more antecedents, and therefore have more requirements, are likely to be more accurate, and more likely to lead to a goal quickly. This is the kind of logic that leads to systems which fire rules based on the number of antecedents.

Forward Chaining

Forward Chaining


In a forward chaining system, the inference engine is presented with some starting information, and works forwards through its rule base to reach some conclusion. ...

Further reading


Backward Chaining

Backward Chaining

Illustration of backward chaining.

In a backward chaining system, the inference engine selects a goal, and then attempts to find evidence to support or disprove that goal. If a goal is disproved, a new goal is selected and new evidence is searched for to support it.




Refraction is the act of preventing a rule from firing multiple times in succession. Without refraction, at each loop iteration the same rules can be added to the agenda repeatedly. This is because the same conditions are satisfied. To prevent a single rule from firing repeatedly, a refraction condition is implemented. There are many types of refraction rules.

Refraction Rules


Once Only


Some systems use a refraction condition that each rule may only be fired once. Once a rule has been fired, it may never be fired again until the system is reset.

Intermittant Rules


In an intermittant rule refraction condition, a rule may never fire twice consecutively, and may not be added to the agenda again until a different rule fires first.

Changing Antecedent


In a changing antecedent refraction condition, a rule may only fire if the attributes in the antecedent have changed.

Dempster-Shafer Theory

About Dempster-Shafer Theory


The Dempster-Shafer theory is a mathematical theory of evidence[1] based on belief functions and plausible reasoning, which is used to combine separate pieces of information (evidence) to calculate the probability of an event. The theory was developed by Arthur P. Dempster and Glenn Shafer.

Consider two possible gambles


The first gamble is that we bet on a head turning up when we toss a coin that is known to be fair. Now consider the second gamble, in which we bet on the outcome of a fight between the world's greatest boxer and the world's greatest wrestler. Assume we are fairly ignorant about martial arts and would have great difficulty making a choice of who to bet on.

Many people would feel more unsure about taking the second gamble, in which the probabilities are unknown, rather than the first gamble, in which the probabilities are easily seen to be one half for each outcome. Dempster-Shafer theory allows one to consider the confidence one has in the probabilities assigned to the various outcomes.



Let X be the universal set: the set of all states under consideration. The power set,  , is the set of all possible sub-sets of X, including the empty set. For example, if:




The elements of the power set can be taken to represent propositions that one might be interested in, by containing all and only the states in which this proposition is true.

The theory of evidence assigns a belief mass to each subset of the power set. Formally, a function  , is called a basic belief assignment (BBA), when it verifies two axioms. First, the mass of the empty set is zero:


Second, the masses of the remaining members of the power set add up to a total of 1:


The mass m(A) of a given member of the power set, A, expresses the proportion of all relevant and available evidence that supports the claim that the actual state belongs to A but to no particular subset of A. The value of m(A) pertains only to the set A and makes no additional claims about any subsets of A, each of which has, by definition, its own mass.

From the mass assignments, the upper and lower bounds of a probability interval can be defined. This interval contains the precise probability of a set of interest (in the classical sense), and is bounded by two non-additive continuous measures called belief (or support) and plausibility:


The belief bel(A) for a set A is defined as the sum of all the masses of (not necessarily proper) subsets of the set of interest:


The plausibility pl(A) is the sum of all the masses of the sets B that intersect the set of interest A:


The two measures are related to each other as follows:


It follows from the above that you need know but one of the three (mass, belief, or plausibility) to deduce the other two, though you may need to know the values for many sets in order to calculate one of the other values for a particular set.

Dempster's rule of combination


The problem we now face is how to combine two independent sets of mass assignments. The original combination rule, known as Dempster's rule of combination, is a generalization of Bayes' rule. This rule strongly emphasises the agreement between multiple sources and ignores all the conflicting evidence through a normalization factor. Use of that rule has come under serious criticism when significant conflict in the information is encountered.

Specifically, the combination (called the joint mass) is calculated from the two sets of masses   and   in the following manner:




  is a measure of the amount of conflict between the two mass sets. The normalization factor,  , has the effect of completely ignoring conflict and attributing any mass associated with conflict to the null set. Consequently, this operation yields counterintuitive results in the face of significant conflict in certain contexts.



Dempster-Shafer theory is a generalization of the Bayesian theory of subjective probability; whereas the latter requires probabilities for each question of interest, belief functions base degrees of belief (or confidence, or trust) for one question on the probabilities for a related question. These degrees of belief may or may not have the mathematical properties of probabilities; how much they differ depends on how closely the two questions are related.[2] Put another way, it is a way of representing epistemic plausibilities but it can yield answers which contradict those arrived at using probability theory.

Often used as a method of sensor fusion, Dempster-Shafer theory is based on two ideas: obtaining degrees of belief for one question from subjective probabilities for a related question, and Dempster's rule[3] for combining such degrees of belief when they are based on independent items of evidence. In essence, the degree of belief in a proposition depends primarily upon the number of answers (to the related questions) containing the proposition, and the subjective probability of each answer. Also contributing are the rules of combination that reflect general assumptions about the data.

In this formalism a degree of belief (also referred to as a mass) is represented as a belief function rather than a Bayesian probability distribution. Probability values are assigned to sets of possibilities rather than single events: their appeal rests on the fact they naturally encode evidence in favor of propositions.

Dempster-Shafer theory assigns its masses to all of the subsets of the entities that comprise a system. Suppose for example that a system has five members, that is to say five independent states, exactly one of which is actual. If the original set is called S,  , then the set of all subsets —the power set— is called 2S. Since you can express each possible subset as a binary vector (describing whether any particular member is present or not by writing a “1” or a “0” for that member's slot), it can be seen that there are 25 subsets possible (  in general), ranging from the empty subset (0, 0, 0, 0, 0) to the "everything" subset (1, 1, 1, 1, 1). The empty subset represents a contradiction, which is not true in any state, and is thus assigned a mass of zero; the remaining masses are normalised so that their total is 1. The "everything" subset is often labelled "unknown" as it represents the state where all elements are present, in the sense that you cannot tell which is actual.

Belief and plausibility


Shafer's framework allows for belief about propositions to be represented as intervals, bounded by two values, belief (or support) and plausibility:


Belief in a hypothesis is constituted by the sum of the masses of all sets enclosed by it (i.e. the sum of the masses of all subsets of the hypothesis). It is the amount of belief that directly supports a given hypothesis at least in part, forming a lower bound. Plausibility is 1 minus the sum of the masses of all sets whose intersection with the hypothesis is empty (equivalently, it is the sum of the masses of all sets whose intersection with the hypothesis is not empty). It is an upper bound on the possibility that the hypothesis could possibly happen, i.e. it "could possibly happen" up to that value, because there is only so much evidence that contradicts that hypothesis.

For example, suppose we have a belief of 0.5 and a plausibility of 0.8 for a proposition, say "the cat in the box is dead." This means that we have evidence that allows us to state strongly that the proposition is true with a confidence of 0.5. However, the evidence contrary to that hypothesis (i.e. "the cat is alive") only has a confidence of 0.2. The remaining mass of 0.3 (the gap between the 0.5 supporting evidence on the one hand, and the 0.2 contrary evidence on the other) is "indeterminate," meaning that the cat could either be dead or alive. This interval represents the level of uncertainty based on the evidence in your system.

Hypothesis Mass Belief Plausibility
Null (neither alive nor dead) 0 0 0
Alive 0.2 0.2 0.5
Dead 0.5 0.5 0.8
Either (alive or dead) 0.3 1.0 1.0

The null hypothesis is set to zero by definition (it corresponds to "no solution"). The orthogonal hypotheses "Alive" and "Dead" have probabilities of 0.2 and 0.5, respectively. This could correspond to "Live/Dead Cat Detector" signals, which have respective reliabilities of 0.2 and 0.5. Finally, the all-encompassing "Either" hypothesis (which simply acknowledges there is a cat in the box) picks up the slack so that the sum of the masses is 1. The support for the "Alive" and "Dead" hypotheses matches their corresponding masses because they have no subsets; support for "Either" consists of the sum of all three masses (Either, Alive, and Dead) because "Alive" and "Dead" are each subsets of "Either". The "Alive" plausibility is m(Alive)+m(Either), since only "Either" intersects "Alive". Likewise, the "Dead" plausibility is m(Dead)+m(Either). Finally, the "Either" plausibility sums m(Alive)+m(Dead)+m(Either). The universal hypothesis ("Either") will always have 100% support and plausibility —it acts as a checksum of sorts.

Here is a somewhat more elaborate example where the behaviour of support and plausibility begins to emerge. We're looking at a faraway object, which can only be coloured in one of three colours (red, white, and blue) through a variety of detector modes:

Hypothesis Mass Belief Plausibility
Null 0 0 0
Red 0.35 0.35 0.56
White 0.25 0.25 0.45
Blue 0.15 0.15 0.34
Red or white 0.06 0.66 0.85
Red or blue 0.05 0.55 0.75
White or blue 0.04 0.44 0.65
Any 0.1 1.0 1.0

Although these are rather bad examples, as events of that kind would not be modeled as disjoint sets in the probability space, rather would the event "red or blue" be considered as the union of the events "red" and "blue", thereby (see the axioms of probability theory) p(red or white)>= p(white) = 0.25 and p(any)=1. Only the three disjoint events "Blue" "Red" and "White" would need to add up to 1. In fact one could model a probability measure on the space linear proportional to "plausibility" (normalized so that p(red)+p(white)+p(blue) = 1, and with the exception that still all probabilities are <=1)

Combining probability sets


Beliefs corresponding to independent pieces of information are combined using Dempster's rule of combination which is a generalisation of the special case of Bayes' theorem where events are independent (There is as yet no method of combining non-independent pieces of information). Note that the probability masses from propositions that contradict each other can also be used to obtain a measure of how much conflict there is in a system. This measure has been used as a criterion for clustering multiple pieces of seemingly conflicting evidence around competing hypotheses.

In addition, one of the computational advantages of the Dempster-Shafer framework is that priors and conditionals need not be specified, unlike Bayesian methods which often use a symmetry (minimax error) argument to assign prior probabilities to random variables (e.g. assigning 0.5 to binary values for which no information is available about which is more likely). However, any information contained in the missing priors and conditionals is not used in the Dempster-Shafer framework unless it can be obtained indirectly - and arguably is then available for calculation using Bayes equations.

Dempster-Shafer theory allows one to specify a degree of ignorance in this situation instead of being forced to supply prior probabilities which add to unity. This sort of situation, and whether there is a real distinction between risk and ignorance, has been extensively discussed by statisticians and economists. See, for example, the contrasting views of Daniel Ellsberg, Howard Raiffa, Kenneth Arrow and Frank Knight.



Judea Pearl (1988a, chapter 9;[4] 1988b[5] and 1990);[6] has argued that it is misleading to interpret belief functions as representing either "probabilities of an event," or "the confidence one has in the probabilities assigned to various outcomes," or "degrees of belief (or confidence, or trust) in a proposition," or "degree of ignorance in a situation." Instead, belief functions represent the probability that a given proposition is provable from a set of other propositions, to which probabilities are assigned. Confusing probabilities of truth with probabilities of provability may lead to counterintuitive results in reasoning tasks such as (1) representing complete knowledge, (2) belief-updating and (3) evidence pooling. He further demonstrated that, if partial knowledge is encoded and updated by belief function methods, the resulting beliefs cannot serve as a basis for rational decisions.


  1. Shafer, Glenn; A Mathematical Theory of Evidence, Princeton University Press, 1976
  2. Shafer, Glenn; Dempster-Shafer theory, 2002
  3. Dempster, Arthur P.; A generalization of Bayesian inference, Journal of the Royal Statistical Society, Series B, Vol. 30, pp. 205-247, 1968
  4. Pearl, J. (1988a), Probabilistic Reasoning in Intelligent Systems, (Revised Second Printing) San Mateo, CA: Morgan Kaufmann.
  5. Pearl, J. (1988b) "On Probability Intervals," International Journal of Approximate Reasoning, 2(3):211-216.
  6. Pearl, J. (1990) Reasoning with Belief Functions: An Analysis of Compatibility. The International Journal of Approximate Reasoning, 4(5/6):363-389.

Fuzzy Logic

About Fuzzy Logic


Fuzzy logic is derived from fuzzy set theory dealing with reasoning that is approximate rather than precisely deduced from classical predicate logic. It can be thought of as the application side of fuzzy set theory dealing with well thought out real world expert values for a complex problem (Klir 1997).

Degrees of truth are often confused with probabilities. However, they are conceptually distinct; fuzzy truth represents membership in vaguely defined sets, not likelihood of some event or condition. For example, if a 100-ml glass contains 30 ml of water, then, for two fuzzy sets, Empty and Full, one might define the glass as being 0.7 empty and 0.3 full. Note that the concept of emptiness would be subjective and thus would depend on the observer or designer. Another designer might equally well design a set membership function where the glass would be considered full for all values down to 50 ml. A probabilistic setting would first define a scalar variable for the fullness of the glass, and second, conditional distributions describing the probability that someone would call the glass full given a specific fullness level. Note that the conditioning can be achieved by having a specific observer that randomly selects the label for the glass, a distribution over deterministic observers, or both. While fuzzy logic avoids talking about randomness in this context, this simplification at the same time obscures what is exactly meant by the statement the 'glass is 0.3 full'.

Fuzzy logic allows for set membership values to range (inclusively) between 0 and 1, and in its linguistic form, imprecise concepts like "slightly", "quite" and "very". Specifically, it allows partial membership in a set. It is related to fuzzy sets and possibility theory. It was introduced in 1965 by Lotfi Zadeh at the University of California, Berkeley.

Fuzzy logic is controversial in some circles and is rejected by some control engineers and by most statisticians who hold that probability is the only rigorous mathematical description of uncertainty. Critics also argue that it cannot be a superset of ordinary set theory since membership functions are defined in terms of conventional sets.



Fuzzy logic can be used to control household appliances such as washing machines (which sense load size and detergent concentration and adjust their wash cycles accordingly) and refrigerators.

A basic application might characterize subranges of a continuous variable. For instance, a temperature measurement for anti-lock brakes might have several separate membership functions defining particular temperature ranges needed to control the brakes properly. Each function maps the same temperature value to a truth value in the 0 to 1 range. These truth values can then be used to determine how the brakes should be controlled.


In this image, cold, warm, and hot are functions mapping a temperature scale. A point on that scale has three "truth values" — one for each of the three functions. For the particular temperature illustrated with the vertical line, the three truth values could be interpreted as describing the temperature as, say, "fairly cold" (blue arrow), "slightly warm" (yellow arrow), and "not hot" (red arrow).

Misconceptions and controversies

Fuzzy logic is the same as "imprecise logic".
Fuzzy logic is not any less precise than any other form of logic: it is an organized and mathematical method of handling inherently imprecise concepts. The concept of "coldness" cannot be expressed in an equation, because although temperature is a quantity, "coldness" is not. However, people have an idea of what "cold" is, and agree that there is no sharp cutoff between "cold" and "not cold", where something is "cold" at N degrees but "not cold" at N+1 degrees — a concept classical logic cannot easily handle due to the principle of bivalence. The result has no set answer so it is believed to be a 'fuzzy' answer.
Fuzzy logic is a new way of expressing probability.
Fuzzy logic and probability are different ways of expressing uncertainty. While both fuzzy logic and probability theory can be used to represent subjective belief, fuzzy set theory uses the concept of fuzzy set membership (i.e. how much a variable is in a set), probability theory uses the concept of subjective probability (i.e. how probable do I think that a variable is in a set). While this distinction is mostly philosophical, the fuzzy-logic-derived possibility measure is inherently different from the probability measure, hence they are not directly equivalent. However, many statisticians are persuaded by the work of Bruno de Finetti that only one kind of mathematical uncertainty is needed and thus fuzzy logic is unnecessary. On the other hand, Bart Kosko argues that probability is a subtheory of fuzzy logic, as probability only handles one kind of uncertainty. He also claims to have proven a derivation of Bayes' theorem from the concept of fuzzy subsethood. Lotfi Zadeh, the creator of fuzzy logic, argues that fuzzy logic is different in character from probability, and is not a replacement for it. He has created a fuzzy alternative to probability, which he calls possibility theory. Other controversial approaches to uncertainty include Dempster-Shafer theory and rough sets.
Fuzzy logic will be difficult to scale to larger problems.
This criticism is mainly due to the fact that there exist problems with conditional possibility, the fuzzy set theory equivalent of conditional probability (see Halpen (2003), section 3.8). This makes it difficult to perform inference. However there have not been many studies comparing fuzzy-based systems with probabilistic ones.

Examples where fuzzy logic is used

  • Automobile and other vehicle subsystems, such as automatic transmissions, ABS and cruise control (e.g. Tokyo monorail)
  • Air conditioners
  • The Massive engine used in the Lord of the Rings films, which helped show huge scale armies create random, yet orderly movements
  • Cameras
  • Digital image processing, such as edge detection
  • Rice cookers
  • Dishwashers
  • Elevators
  • Washing machines and other home appliances
  • Video game artificial intelligence
  • Language filters on message boards and chat rooms for filtering out offensive text
  • Pattern recognition in Remote Sensing
  • Fuzzy logic has also been incorporated into some microcontrollers and microprocessors, for instance, the Freescale 68HC12.

How fuzzy logic is applied


Fuzzy Set Theory defines Fuzzy Operators on Fuzzy Sets. The problem in applying this is that the appropriate Fuzzy Operator may not be known. For this reason, Fuzzy logic usually uses IF/THEN rules, or constructs that are equivalent, such as fuzzy associative matrices.

Rules are usually expressed in the form:
IF variable IS set THEN action

For example, an extremely simple temperature regulator that uses a fan might look like this:
IF temperature IS very cold THEN stop fan
IF temperature IS cold THEN turn down fan
IF temperature IS normal THEN maintain level
IF temperature IS hot THEN speed up fan

Notice there is no "ELSE". All of the rules are evaluated, because the temperature might be "cold" and "normal" at the same time to differing degrees.

The AND, OR, and NOT operators of boolean logic exist in fuzzy logic, usually defined as the minimum, maximum, and complement; when they are defined this way, they are called the Zadeh operators, because they were first defined as such in Zadeh's original papers. So for the fuzzy variables x and y:

NOT x = (1 - truth(x))

x AND y = minimum(truth(x), truth(y))

x OR y = maximum(truth(x), truth(y))

There are also other operators, more linguistic in nature, called hedges that can be applied. These are generally adverbs such as "very", or "somewhat", which modify the meaning of a set using a mathematical formula.

In application, the programming language Prolog is well geared to implementing fuzzy logic with its facilities to set up a database of "rules" which are queried to deduct logic. This sort of programming is known as logic programming.

Once fuzzy relations are defined, it is possible to develop fuzzy relational databases. The first fuzzy relational database, FRDB, appeared in Maria Zemankova's dissertation. After, some other models arose like the Buckles-Petry model, the Prade-Testemale Model, the Umano-Fukami model or the GEFRED model by J.M. Medina, M.A. Vila et al. In the context of fuzzy databases, some fuzzy querying languages have been defined, highlighting the SQLf by P. Bosc et al. and the FSQL by J. Galindo et al. These languages define some structures in order to include fuzzy aspects in the SQL statements, like fuzzy conditions, fuzzy comparators, fuzzy constants, fuzzy constraints, fuzzy thresholds, linguistic labels and so on.

Other examples

  • If a man is 1.8 meters, consider him as tall:

IF male IS true AND height >= 1.8 THEN is_tall IS true; is_short IS false

  • The fuzzy rules do not make the sharp distinction between tall and short, that is not so realistic:

IF height <= medium male THEN is_short IS agree somewhat
IF height >= medium male THEN is_tall IS agree somewhat

In the fuzzy case, there are no such heights like 1.83 meters, but there are fuzzy values, like the following assignments:

dwarf male = [0, 1.3] m
short male = (1.3, 1.5]
medium male = (1.5, 1.8]
tall male = (1.8, 2.0]
giant male > 2.0 m

For the consequent, there are also not only two values, but five, say:

agree not = 0
agree little = 1
agree somewhat = 2
agree a lot = 3
agree fully = 4

In the binary, or "crisp", case, a person of 1.79 meters of height is considered short. If another person is 1.8 meters or 2.25 meters, these persons are considered tall.

The crisp example differs deliberately from the fuzzy one. We did not put in the antecedent

IF male >= agree somewhat AND ...

as gender is often considered as a binary information. So, it is not so complex as being tall.

Formal fuzzy logic


In mathematical logic, there are several formal systems that model the above notions of "fuzzy logic"; most of them belong among so-called t-norm fuzzy logics. Note that they use a different set of operations than above mentioned Zadeh operators.

Propositional fuzzy logics


The most important propositional fuzzy logics are:

  • Basic propositional fuzzy logic BL is an axiomatization of logic where conjunction is defined by a continuous |t-norm, and implication is defined as the residuum of the t-norm. Its models correspond to BL-algebras.
  • Łukasiewicz fuzzy logic is a special case of basic fuzzy logic where conjunction is Łukasiewicz t-norm. It has the axioms of basic logic plus an axiom of double negation (so it is not intuitionistic logic), and its models correspond to MV-algebras.
  • Gödel fuzzy logic is a special case of basic fuzzy logic where conjunction is Gödel t-norm. It has the axioms of basic logic plus an axiom of idempotence of conjunction, and its models are called G-algebras.
  • Product fuzzy logic is a special case of basic fuzzy logic where conjunction is product t-norm. It has the axioms of basic logic plus another axiom, and its models are called product algebras.
  • Monoidal t-norm logic MTL is a generalization of basic fuzzy logic BL where conjunction is realized by a left-continuous t-norm. Its models (MTL-algebras) are prelinear commutative bounded integral residuated lattices.
  • Rational Pavelka logic is a generalization of multi-valued logic. It is an extension of Łukasziewicz fuzzy logic with additional constants.

All these logics encompass the traditional propositional logic (whose models correspond to Boolean algebras).

Predicate fuzzy logics


These extend the above-mentioned fuzzy logics by adding universal and existential quantifiers in a manner similar to the way that predicate logic is created from propositional logic. The semantics of the universal resp. existential quantifier in t-norm fuzzy logics is the infimum resp. supremum of the truth degrees of the instances of the quantified subformula.

Effectiveness for fuzzy logics


The notions of a "decidable subset" and "recursively enumerable subset" are basic ones for classical mathematics and classical logic. Then, the question of a suitable extension of such concepts to fuzzy set theory arises. A first proposal in such a direction was made by E.S. Santos by the notions of fuzzy Turing machine, Markov normal fuzzy algorithm and fuzzy program. Successively, L. Biacino and G. Gerla proposed the following definition where Ü denotes the set of rational numbers in [0,1]. A fuzzy subset μ : S  [0,1] of a set S is recursively enumerable if a recursive map h : S×N  Ü exists such that, for every x in S, the function h(x,n) is increasing with respect to n and μ(x) = lim h(x,n). We say that μ is decidable if both μ and its complement –μ are recursively enumerable. An extension of such a theory to the general case of the L-subsets is proposed in a paper by G. Gerla. The proposed definitions are well related with fuzzy logic. Indeed, the following theorem holds true (provided that the deduction apparatus of the fuzzy logic satisfies some obvious effectiveness property).

Theorem. Any axiomatizable fuzzy theory is recursively enumerable. In particular, the fuzzy set of logically true formulas is recursively enumerable in spite of the fact that the crisp set of valid formulas is not recursively enumerable, in general. Moreover, any axiomatizable and complete theory is decidable.

It is an open question to give supports for a Church thesis for fuzzy logic claiming that the proposed notion of recursive enumerability for fuzzy subsets is the adequate one. To this aim, further investigations on the notions of fuzzy grammar and fuzzy Turing machine should be necessary (see for example Wiedermann's paper). Another open question is to start from this notion to find an extension of Gödel’s theorems to fuzzy logic.


  • Von Altrock, Constantin (1995). Fuzzy logic and NeuroFuzzy applications explained. Upper Saddle River, NJ: Prentice Hall PTR. ISBN 0-13-368465-2.
  • Biacino, L. (2002). "Fuzzy logic, continuity and effectiveness". Archive for Mathematical Logic. 41 (7): 643–667. doi:10.1007/s001530100128. ISSN 0933-5846. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  • Cox, Earl (1994). The fuzzy systems handbook: a practitioner's guide to building, using, maintaining fuzzy systems. Boston: AP Professional. ISBN 0-12-194270-8.
  • Elkan, C. (1994). "The Paradoxical Success of Fuzzy Logic". IEEE Expert. 9 (4): 3–8. doi:10.1109/64.336150. ISSN 0885-9000. {{cite journal}}: Cite has empty unknown parameter: |coauthors= (help). Available from Elkan's home page.
  • Gerla, Giangiacomo (2006). "Effectiveness and Multivalued Logics". Journal of Symbolic Logic. 71 (1): 137–162. ISSN 0022-4812. {{cite journal}}: Cite has empty unknown parameter: |coauthors= (help)
  • Hájek, Petr (1998). Metamathematics of fuzzy logic. Dordrecht: Kluwer. ISBN 0792352386.
  • Hájek, Petr (1995). "Fuzzy logic and arithmetical hierarchy". Fuzzy Sets and Systems. 3 (8): 359–363. doi:10.1016/0165-0114(94)00299-M. ISSN 0165-0114. {{cite journal}}: Cite has empty unknown parameter: |coauthors= (help)
  • Halpern, Joseph Y. (2003). Reasoning about uncertainty. Cambridge, Mass: MIT Press. ISBN 0-262-08320-5.
  • Höppner, Frank (1999). Fuzzy cluster analysis: methods for classification, data analysis and image recognition. New York: John Wiley. ISBN 0-471-98864-2. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  • Ibrahim, Ahmad M. (1997). Introduction to Applied Fuzzy Electronics. Englewood Cliffs, N.J: Prentice Hall. ISBN 0-13-206400-6. {{cite book}}: Cite has empty unknown parameter: |coauthors= (help)
  • Klir, George J. (1988). Fuzzy sets, uncertainty, and information. Englewood Cliffs, N.J: Prentice Hall. ISBN 0-13-345984-5. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  • Klir, George J. (1997). Fuzzy set theory: foundations and applications. Englewood Cliffs, NJ: Prentice Hall. ISBN 0133410587. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  • Klir, George J. (1995). Fuzzy sets and fuzzy logic: theory and applications. Upper Saddle River, NJ: Prentice Hall PTR. ISBN 0-13-101171-5. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  • Kosko, Bart (1993). Fuzzy thinking: the new science of fuzzy logic. New York: Hyperion. ISBN 0-7868-8021-X.
  • Montagna, F. (2001). "Three complexity problems in quantified fuzzy logic". Studia Logica. 68 (1): 143–152. doi:10.1023/A:1011958407631. ISSN 0039-3215. {{cite journal}}: Cite has empty unknown parameter: |coauthors= (help)
  • Mundici, Daniele (1999). Algebraic foundations of many-valued reasoning. Dodre: Kluwer Academic. ISBN 0-7923-6009-5. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  • Novák, Vilém (1999). Mathematical principles of fuzzy logic. Dodrecht: Kluwer Academic. ISBN 0-7923-8595-0. {{cite book}}: Unknown parameter |coauthorsr= ignored (help)
  • Passino, Kevin M. (1998). Fuzzy control. Boston: Addison-Wesley. ISBN 020118074X. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  • Scarpellini, Bruno (1962). "Die Nichaxiomatisierbarkeit des unendlichwertigen Prädikatenkalküls von Łukasiewicz". Journal of Symbolic Logic. 27 (2): 159–170. doi:10.2307/2964111. ISSN 0022-4812. {{cite journal}}: Cite has empty unknown parameter: |coauthors= (help)
  • Wiedermann, J. (2004). "Characterizing the super-Turing computing power and efficiency of classical fuzzy Turing machines". Theor. Comput. Sci. 317: 61–69. {{cite journal}}: Cite has empty unknown parameter: |coauthors= (help)
  • Yager, Ronald R. (1994). Essentials of fuzzy modeling and control. New York: Wiley. ISBN 0-471-01761-2. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  • Zadeh, L.A. (1968). "Fuzzy algorithms". Information and Control. 12 (2): 94–102. doi:10.1016/S0019-9958(68)90211-8. ISSN 0019-9958. {{cite journal}}: Cite has empty unknown parameter: |coauthors= (help)
  • Zadeh, L.A. (1965). "Fuzzy sets". Information and Control. 8 (3): 338-­353. doi:10.1016/S0019-9958(65)90241-X. ISSN 0019-9958. {{cite journal}}: Cite has empty unknown parameter: |coauthors= (help); soft hyphen character in |pages= at position 5 (help)
  • Zemankova-Leech, M. (1983). "Fuzzy Relational Data Bases". Ph. D. Dissertation. Florida State University. {{cite journal}}: Cite journal requires |journal= (help)
  • Zimmermann, H. (2001). Fuzzy set theory and its applications. Boston: Kluwer Academic Publishers. ISBN 0-7923-7435-5.

Additional articles

Links pages

Software & tools


Sample applications


Rete Algorithm

About the Rete Algorithm


The Rete algorithm is an efficient pattern matching algorithm for implementing production rule systems. The Rete algorithm was designed by Dr Charles L. Forgy of Carnegie Mellon University, first published in a working paper in 1974, and later elaborated in his 1979 Ph.D. thesis and a 1982 paper (see References). Rete has become the basis for many popular expert systems, including CLIPS, Jess, JBoss Rules, and Soar.

A naïve implementation of an expert system might check each rule against the known facts in the Knowledge base, firing that rule if necessary, then moving on to the next rule (and looping back to the first rule when finished). For even moderate sized rules and facts knowledge-bases, this naïve approach performs far too slowly.

The Rete algorithm (usually pronounced either 'REET', 'REE-tee' or, in Europe, 're-tay' after the Latin pronunciation, from the Latin 'rete' for net, or network) provides the basis for a more efficient implementation of an expert system. A Rete-based expert system builds a network of nodes, where each node (except the root) corresponds to a pattern occurring in the left-hand-side of a rule. The path from the root node to a leaf node defines a complete rule left-hand-side. Each node has a memory of facts which satisfy that pattern. This structure is essentially a generalized Trie.

As new facts are asserted or modified, they propagate along the network, causing nodes to be annotated when that fact matches that pattern. When a fact or combination of facts causes all of the patterns for a given rule to be satisfied, a leaf node is reached and the corresponding rule is triggered.

The Rete algorithm is designed to sacrifice memory for increased speed. In most cases, the speed increase over naïve implementations is several orders of magnitude (because Rete performance is theoretically independent of the number of rules in the system). In very large expert systems, however, the original Rete algorithm tends to run into memory consumption problems. Other algorithms, both novel and Rete-based, have since been designed which require less memory.



The Rete algorithm provides a generalized logical description of an implementation of functionality responsible for matching data tuples (‘facts’) against productions (‘rules’) in a pattern-matching production system (a category of rule engine). A production consists of one or more conditions and a set of actions which may be undertaken for each complete set of facts that match the conditions. Conditions test fact attributes, including fact type specifiers/identifiers. The Rete algorithm exhibits the following major characteristics:

  • It reduces or eliminates certain types of redundancy through the use of node sharing.
  • It stores partial matches when performing joins between different fact types. This, in turn, allows production systems to avoid complete re-evaluation of all facts each time changes are made to the production system’s working memory. Instead, the production system needs only to evaluate the changes (deltas) to working memory.
  • It allows for efficient removal of memory elements when facts are retracted from working memory.

The Rete algorithm is widely used to implement matching functionality within pattern-matching engines that exploit a match-resolve-act cycle to support forward chaining and inferencing.

Retes are directed acyclic graphs that represent higher-level rule sets. They are generally represented at run-time using a network of in-memory objects. These networks match rule conditions (patterns) to facts (relational data tuples). Rete networks act as a type of relational query processor, performing projections, selections and joins conditionally on arbitrary numbers of data tuples.

Productions (rules) are typically captured and defined by analysts and developers using some high-level rules language. They are collected into rule sets which are then translated, often at run time, into an executable Rete.

When facts are ‘asserted’ to working memory, the engine creates ‘working memory elements’ (WMEs) for each fact. Facts are n-tuples, and may therefore contain an arbitrary number of data items. Each WME may hold an entire n-tuple, or, alternatively, each fact may be represented by a set of WMEs where each WME contains a fixed-length tuple. In this case, tuples are typically triplets (3-tuples).

Each WME enters the Rete network at a single root node. The root node passes each WME on to its child nodes, and each WME may then be propagated through the network, possibly being stored in intermediate memories, until it arrives at a terminal node.

Alpha Network


The ‘left’ (alpha) side of the node graph forms a discrimination network responsible for selecting individual WMEs based on simple conditional tests which match WME attributes against constant values. Nodes in the discrimination network may also perform tests that compare two or more attributes of the same WME. If a WME is successfully matched against the conditions represented by one node, it is passed to the next node. In most engines, the immediate child nodes of the root node are used to test the entity identifier or fact type of each WME. Hence, all the WMEs which represent the same entity type typically traverse a given branch of nodes in the discrimination network.

Within the discrimination network, each branch of alpha nodes (also called 1-input nodes) terminates at a memory, called an ‘alpha’ memory. These memories store collections of WMEs that match each condition in each node in a given node branch. WMEs that fail to match at least one condition in a branch are not materialised within the corresponding alpha memory. Alpha node branches may fork in order to minimise condition redundancy.

A possible variation is to introduce additional memories for each intermediate node in the discrimination network. This increases the overhead of the Rete, but may have advantages in situations where rules are dynamically added to or removed from the Rete, making it easier to vary the topography of the discrimination network dynamically.

An alternative implementation is described by Doorenbos. In this case, the discrimination network is replaced by a set of memories and an index. The index may be implemented using a hash table. Each memory holds WMEs that match a single conditional pattern, and the index is used to reference memories by their pattern. This approach is only practical when WMEs represent fixed-length tuples, and the length of each tuple is short (e.g., 3-tuples). In addition, the approach only applies to conditional patterns that perform equality tests against constant values. When a WME enters the Rete, the index is used to locate a set of memories whose conditional pattern matches the WME attributes, and the WME is then added directly to each of these memories. In itself, this implementation contains no 1-input nodes. However, in order to implement non-equality tests, the Rete may contain additional 1-input node networks through which WMEs are passed before being placed in a memory. Alternatively, non-equality tests may be performed in the beta network described below.

Beta Network


The ‘right’ (beta) side of the graph chiefly performs joins between different WMEs. It is optional, and is only included if required. It consists of 2-input nodes where each node has a ‘left’ and a ‘right’ input. Each beta node sends its output to a ‘beta’ memory.

Beta nodes process tokens. A token is a unit of storage within a memory and also a unit of exchange between memories and nodes. In many implementations, tokens are introduced within alpha memories where they are used to hold single WMEs. These tokens are then passed to the beta network.

Each beta node performs its work and, as a result, may create new tokens to hold a list of WMEs representing a partial match. These extended tokens are then stored in beta memories, and passed to subsequent beta nodes. In this case, the beta nodes typically pass lists of WMEs through the beta network by copying existing WME lists from each received token into new tokens and then adding a further WMEs to the lists as a result of performing a join or some other action. The new tokens are then stored in the output memory.

A common variation is to build linked lists of tokens where each token holds a single WME. In this case, lists of WMEs for a partial match are represented by the linked list of tokens. This approach may be more optimal because it eliminates the need to copy lists of WMEs from one token to another. Instead, a beta node needs only to create a new token to hold a WME it wishes to join to the partial match list, and then link the new token to a parent token stored in the input beta memory. The new token now forms the head of the token list, and is stored in the output beta memory.

In descriptions of Rete, it is common to refer to token passing within the beta network. In this article, however, we will describe data propagation in terms of WME lists, rather than tokens, in recognition of different implementation options and the underlying purpose and use of tokens. As any one WME list passes through the beta network, new WMEs may be added to it, and the list may be stored in beta memories. A WME list in a beta memory represents a partial match for the conditions in a given production.

WME lists that reach the end of a branch of beta nodes represent a complete match for a single production, and are passed to terminal nodes. These nodes are sometimes called ‘p-nodes’, where ‘p’ stands for ‘production’. Each terminal node represents a single production, and each WME list that arrives at a terminal node represents a complete set of matching WMEs for the conditions in that production. For each WME list it receives, a production node will ‘activate’ a new production instance on the ‘agenda’. Agendas are typically implemented as prioritised queues.

Beta nodes typically perform joins between WME lists stored in beta memories and individual WMEs stored in alpha memories. Each beta node is associated with two input memories. An alpha memory holds WMEs and performs ‘right’ activations on the beta node each time it stores a new WME. A beta memory holds WME lists and performs ‘left’ activations on the beta node each time it stores a new WME list. When a join node is right-activated, it compares one or more attributes of the newly stored WME from its input alpha memory against given attributes of specific WMEs in each WME list contained in the input beta memory. When a join node is left-activated it traverses a single newly stored WME list in the beta memory, retrieving specific attribute values of given WMEs. It compares these values with attribute values of each WME in the alpha memory.

Each beta node outputs WME lists which are either stored in a beta memory or sent directly to a terminal node. WME lists are stored in beta memories whenever the engine will perform additional left activations on subsequent beta nodes.

Logically, a beta node at the head of a branch of beta nodes is a special case because it takes no input from any beta memory higher in the network. Different engines handle this issue in different ways. Some engines use specialised adapter nodes to connect alpha memories to the left input of beta nodes. Other engines allow beta nodes to take input directly from two alpha memories, treating one as a ‘left’ input and the other as a ‘right’ input. In both cases, ‘head’ beta nodes take their input from two alpha memories.

In order to eliminate node redundancies, any one alpha or beta memory may be used to perform activations on multiple beta nodes. As well as join nodes, the beta network may contain additional node types, some of which are described below. If a Rete contains no beta network, alpha nodes feed tokens, each containing a single WME, directly to p-nodes. In this case, there may be no need to store WMEs in alpha memories.

Conflict Resolution


During any one match-resolve-act cycle, the engine will find all possible matches for the facts currently asserted to working memory. Once all the current matches have been found, and corresponding production instances have been activated on the agenda, the engine determines an order in which the production instances may be ‘fired’. This is termed ‘conflict resolution’, and the list of activated production instances is termed the ‘conflict set’. The order may be based on rule priority (salience), rule order, the time at which facts contained in each instance were asserted to the working memory, the complexity of each production or some other criteria. Many engines allow rule developers to select between different conflict resolution strategies or to chain a selection of multiple strategies.

Conflict resolution is not defined as part of the Rete algorithm, but is used alongside the algorithm. Some specialised production systems do not perform conflict resolution.

Production Execution


Having performed conflict resolution, the engine now ‘fires’ the first production instance, executing a list of actions associated with the production. The actions act on the data represented by the production instance’s WME list.

By default, the engine will continue to fire each production instance in order until all production instances have been fired. Each production instance will fire only once, at most, during any one match-resolve-act cycle. This characteristic is termed ‘refraction’. However, the sequence of production instance firings may be interrupted at any stage by performing changes to the working memory. Rule actions can contain instructions to assert or retract WMEs from the ‘working memory’ of the engine. Each time any single production instance performs one or more such changes, the engine immediately enters a new match-resolve-act cycle. This includes ‘updates’ to WMEs currently in the working memory. Updates are represented by retracting and then re-asserting the WME. The engine undertakes matching of the changed data which, in turn, may result in changes to the list of production instances on the agenda. Hence, after the actions for any one specific production instance have been executed, previously activated instances may have been de-activated and removed from the agenda, and new instances may have been activated.

As part of the new match-resolve-act cycle, the engine performs conflict resolution on the agenda and then executes the current first instance. The engine continues to fire production instances, and to enter new match-resolve-act cycles, until no further production instances exist on the agenda. At this point the rule engine is deemed to have completed its work, and halts.

Some engines support advanced refraction strategies in which certain production instances executed in a previous cycle are not re-executed in the new cycle, even though they may still exist on the agenda.

It is possible for the engine to enter into never-ending loops in which the agenda never reaches the empty state. For this reason, most engines support explicit ‘halt’ verbs that can be invoked from production action lists. They may also provide automatic loop detection in which never-ending loops are automatically halted after a given number of iterations. Some engines support a model in which, instead of halting when the agenda is empty, the engine enters a wait state until new facts are asserted externally.

As for conflict resolution, the firing of activated production instances is not a feature of the Rete algorithm. However, it is a central feature of engines that use Rete networks. Some of the optimisations offered by Rete networks are only useful in scenarios where the engine performs multiple match-resolve-act cycles.

Existential and Universal Quantifications


Conditional tests are most commonly used to perform selections and joins on individual tuples. However, by implementing additional beta node types, it is possible for Rete networks to perform quantifications. Existential quantification involves testing for the existence of at least one set of matching WMEs in working memory. Universal quantification involves testing that an entire set of WMEs in working memory meets a given condition. A variation of universal quantification might test that a given number of WMEs, drawn from a set of WMEs, meets a given criteria. This might be in terms of testing for either an exact number or a minimum number of matches.

Quantification is not universally implemented in Rete engines, and, where it is supported, several variations exist. A variant of existential quantification referred to as ‘negation’ is widely, though not universally, supported, and is described in seminal documents. Existentially negated conditions and conjunctions involve the use of specialised beta nodes that test for non-existence of matching WMEs or sets of WMEs. These nodes propagate WME lists only when no match is found. The exact implementation of negation varies. In one approach, the node maintains a simple count on each WME list it receives from its left input. The count specifies the number of matches found with WMEs received from the right input. The node only propagates WME lists whose count is zero. In another approach, the node maintains an additional memory on each WME list received from the left input. These memories are a form of beta memory, and store WME lists for each match with WMEs received on the right input. If a WME list does not have any WME lists in its memory, it is propagated down the network. In this approach, negation nodes generally activate further beta nodes directly, rather than storing their output in an additional beta memory. Negation nodes provide a form of 'negation as failure'.

When changes are made to working memory, a WME list that previously matched no WMEs may now match newly asserted WMEs. In this case, the propagated WME list and all its extended copies need to be retracted from beta memories further down the network. The second approach described above is often used to support efficient mechanisms for removal of WME lists. When WME lists are removed, any corresponding production instances are de-activated and removed from the agenda.

Existential quantification can be performed by combining two negation beta nodes. This represents the semantics of double negation (e.g., ‘If NOT NOT any matching WMEs, then...’). This is a common approach taken by several production systems.

Memory Indexing


The Rete algorithm does not mandate any specific approach to indexing the working memory. However, most modern production systems provide indexing mechanisms. In some cases, only beta memories are indexed, whilst in others, indexing is used for both alpha and beta memories. A good indexing strategy is a major factor in deciding the overall performance of a production system, especially when executing rule sets that result in highly combinatorial pattern matching (i.e., intensive use of beta join nodes), or, for some engines, when executing rules sets that perform a significant number of WME retractions during multiple match-resolve-act cycles. Memories are often implemented using combinations of hash tables, and hash values are used to perform conditional joins on subsets of WME lists and WMEs, rather than on the entire contents of memories. This, in turn, often significantly reduces the number of evaluations performed by the Rete network.

Removal of WMEs and WME lists


When a WME is retracted from working memory, it must be removed from every alpha memory in which it is stored. In addition, WME lists that contain the WME must be removed from beta memories, and activated production instances for these WME lists must be de-activated and removed from the agenda. Several implementation variations exist, including tree-based and rematch-based removal. Memory indexing may be used in some cases to optimise removal.

Handling OR’ed conditions


When defining productions in a rule set, it is common to allow conditions to be grouped using an ‘OR’ connective. In many production systems, this is handled by interpreting a single production containing multiple OR’ed patterns as the equivalent of multiple productions. The resulting Rete network contains sets of terminal nodes which, together, represent single productions. This approach disallows any form of short-circuiting of the OR’ed conditions. It can also, in some cases, lead to duplicate production instances being activated on the agenda where the same set of WMEs match multiple internal productions. Some engines provide agenda de-duplication in order to handle this issue.

Miscellaneous Considerations


Although not defined by the Rete algorithm, some engines provide extended functionality to support greater control of truth maintenance. For example, when a match is found for one production, this may result in the assertion of new WMEs which, in turn, match the conditions for another production. If a subsequent change to working memory causes the first match to become invalid, it may be that this implies that the second match is also invalid. The Rete algorithm does not define any mechanism to define and handle these logical truth dependencies automatically. Some engines, however, support additional functionality in which truth dependencies can be automatically maintained. In this case, the retraction of one WME may lead to the automatic retraction of additional WMEs in order to maintain logical truth assertions.

The Rete algorithm does not define any approach to justification. Justification refers to mechanisms commonly required in expert and decision systems in which, at its simplest, the system reports each of the inner decisions used to reach some final conclusion. For example, an expert system might justify a conclusion that an animal is an elephant by reporting that it is large, grey, has big ears, a trunk and tusks. Some engines provide built-in justification systems in conjunction with their implementation of the Rete algorithm.

This article does not provide an exhaustive description of every possible variation or extension of the Rete algorithm. Other considerations and innovations exist. For example, engines may provide specialised support within the Rete network in order to apply pattern-matching rule processing to specific data types and sources such as programmatic objects, XML data or relational data tables. Another example concerns additional time-stamping facilities provided by many engines for each WME entering a Rete network, and the use these time-stamps in conjunction with conflict resolution strategies. Engines exhibit significant variation in the way they allow programmatic access to the engine and its working memory, and may extend the basic Rete model to support forms of parallel and distributed processing.

Optimisation and Performance


Several optimisations for Rete have been identified and described in academic literature. Several of these, however, apply only in very specific scenarios, and therefore often have little or no application in a general-purpose rules engine. In addition, alternative algorithms such as TREAT and LEAPS have been formulated which may provide additional performance improvements. There are currently very few commercial or open source examples of productions systems that support these alternative algorithms.

The Rete algorithm is orientated to scenarios where forward chaining and ‘inferencing’ is used to calculate new facts from existing facts, or to filter and discard facts in order to arrive at some conclusion. It is also exploited as a reasonably efficient mechanism for performing highly combinatorial evaluations of facts where large numbers of joins must be performed between fact tuples. Other approaches to performing rule evaluation, such as the use of decision trees, or the implementation of sequential engines, may be more appropriate for simple scenarios, and should be considered as possible alternatives.

The following diagram illustrates the basic Rete topography, and shows the associations between different node types and memories.

  • Most implementations use type nodes to perform the first level of selection on n-tuple working memory elements. Type nodes can be considered as specialized select nodes. They discriminate between different tuple relation types.
  • The diagram does not illustrate the use of specialized nodes types such as negated conjunction nodes. Some engines implement several different node specialisations in order to extend functionality and maximise optimisation.
  • The diagram provides a logical view of the Rete. Implementations may differ in physical detail. In particular, the diagram shows ‘dummy’ inputs providing right activations at the head of beta node branches. Engines may implement other approaches, such as adapters that allow alpha memories to perform right activations directly.
  • The diagram does not illustrate all node-sharing possibilities.

For a more detailed and complete description of the Rete algorithm, see chapter 2 of Production Matching for Large Learning Systems by Robert Doorenbos (see link below).

Rete II


In the 1980s Charles L. Forgy developed a successor to the Rete algorithm named Rete II [1]. Unlike the original Rete (which is public domain) this algorithm was not disclosed. Rete II claims better performance for more complex problems (even orders of magnitude).


  • Charles Forgy, "A network match routine for production systems." Working Paper, 1974.
  • Charles Forgy, "On the efficient implementation of production systems." Ph.D. Thesis, Carnegie-Mellon University, 1979.
  • Charles Forgy, "Rete: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem", Artificial Intelligence, 19, pp 17–37, 1982


Dendral was an influential pioneer project in artificial intelligence (AI) of the 1960s, and the computer software expert system that it produced. Its primary aim was to help organic chemists in identifying unknown organic molecules, by analyzing their mass spectra and using knowledge of chemistry.[1] It was done at Stanford University by Edward Feigenbaum, Bruce Buchanan, Joshua Lederberg, and Carl Djerassi.[2] It began in 1965 and spans approximately half the history of AI research.[3]

The software program Dendral is considered the first expert system because it automated the decision-making process and problem-solving behavior of organic chemists.[4] It consists of two sub-programs, Heuristic Dendral and Meta-Dendral,.[5] It was written in Lisp, which was considered the language of AI.[6]

Many systems were derived from Dendral, including MYCIN, MOLGEN, MACSYMA, PROSPECTOR, XCON, and STEAMER.

The name Dendral is a portmanteaux of the term "Dendritic Algorithm".[7]

Heuristic Dendral


Heuristic Dendral is a program that uses mass spectra or other experimental data together with knowledge base of chemistry, to produce a set of possible chemical structures that may be responsible for producing the data.[8] A mass spectrum of a compound is produced by a mass spectrometer, and is used to deterime its molecular weight, the sum of the masses of its atomic constituents. For example, the compound water (H2O), has a molecular weight of 18 since hydrogen has a mass of 1.01 and oxygen 16.00, and its mass spectrum has a peak at 18 units. Heuristic Dendral would use this input mass and the knowledge of atomic mass numbers and valence rules, to determine the possible combinations of atomic constituents whose mass would add up to 18.[9] As the weight increases and the molecules become more complex, the number of possible compounds increases drastically. Thus, a program that is able to reduce this number of candidate solutions through the process of hypothesis formation is essential.



Meta-Dendral is a knowledge acquisition system that receives the set of possible chemical structures and corresponding mass spectra as input, and proposes a set of hypotheses to explain correlation between some of the proposed structures and the mass spectrum.[10] These hypotheses would be fed back to Heuristic Dendral to test their applicability.[11] Thus, "Heuristic Dendral is a performance system and Meta-Dendral is a learning system".[12] The program is based on two important features: the plan-generate-test paradigm and knowledge engineering.[13]

Plan-generate-test paradigm


The plan-generate-test paradigm is the basic organization of the problem-solving method, and is a common paradigm used by both Heuristic Dendral and Meta-Dendral systems.[14] The generator generates potential solutions for a particular problem, which are then expressed as chemical graphs in Dendral.[15] However, this is feasible only when the number of candidate solutions is minimal. When there are large numbers of possible solutions, Dendral has to find a way to put constraints that rules out large sets of candidate solutions.[16] This is the primary aim of Dendral planner, which is a “hypothesis-formation” program that employs “task-specific knowledge to find constraints for the generator”.[17] Last but not least, the tester analyzes each proposed candidate solution and discards those that fail to fulfill certain criteria.[18] This mechanism of plan-generate-test paradigm is what holds Dendral together.[19]

Knowledge Engineering


The primary aim of knowledge engineering is to attain a productive interaction between the available knowledge base and problem solving techniques.[20] This is possible through development of a procedure in which large amounts of task-specific information is encoded into heuristic programs.[21] Thus, the first essential component of knowledge engineering is a large “knowledge base.” The knowledge base would include specific knowledge about the mass spectrometry technique, large amount of information that forms the basis of chemistry and graph theory, and any information that might be helpful in finding the solution of a particular chemical structure elucidation problem.[22] Through knowledge engineering Dendral is able to use this “knowledge base” to both determine the set of possible chemical structures that correspond to the input data, and form new “general rules” that helps it reduce the number of candidate solutions. Thus, at the end the user is now provided with a minimal number of possible solutions, which can now be tested by any non-expert user to find the right solution.



A heuristic is a rule of thumb, an algorithm that does not guarantee a solution, but reduces the number of possible solutions by discarding unlikely and irrelevant solutions.[23] The use of heuristics to solve problems is called "heuristics programming", and was used in Dendral to allow it to replicate in machines the process through which human experts induce the solution to problems via rules of thumb and specific information.

Heuristics programming was a major approach and a giant step forward in artificial intelligence,[24] as it allowed scientists to finally automate certain traits of human intelligence. It became prominent among scientists in the late 1940s through George Polya’s book, How to Solve It: A New Aspect of Mathematical Method.[25] As Herbert Simon said in The Sciences of the Artificial, "if you take a heuristic conclusion as certain, you may be fooled and disappointed; but if you neglect heuristic conclusions altogether you will make no progress at all."



During mid 20th century, the question "can machines think?" became intriguing and popular among scientists, primarily to add humanistic characteristics to machine behavior. John McCarthy, who was one of the prime researchers of this field, termed this concept of machine intelligence as "artificial intelligence" (AI) during the Dartmouth summer in 1956. AI is usually defined as the capacity of a machine to perform operations that are analogous to human cognitive capabilities.[26] Much research to create AI was done during the 20th century.

Also around mid 20th century, science, especially biology, faced a fast-increasing need to develop a "man-computer symbiosis", to aid scientists in solving problems.[27] For example, the structural analysis of myogoblin, hemoglobin, and other proteins relentlessly needed instrumentation development due to its complexity.

In the early 1960s, Joshua Lederberg started working with computers and quickly became tremendously interested in creating interactive computers to help him in his exobiology research.[28] Specifically, he was interested in designing computing systems that to help him study alien organic compounds.[29] As he was not an expert in either chemistry or computer programming, he collaborated with Stanford chemist Carl Djerassi to help him with chemistry, and Edward Feigenbaum with programming, to automate the process of determining chemical structures from raw mass spectrometry data.[30] Feigenbaum was an expert in programming languages and heuristics, and helped Lederberg design a system that replicated the way Carl Djerassi solved structure elucidation problems.[31] They devised a system called Dendritic Algorithm (Dendral) that was able to generate possible chemical structures corresponding to the mass spectrometry data as an output.[32]

Dendral then was still very inaccurate in assessing spectra of ketones, alcohols, and isomers of chemical compounds.[33] Thus, as seen in figure 1, Djerassi "taught" general rules to Dendral that could help eliminate most of the "chemically implausible" structures, and produce a set of structures that could now be analyzed by a "non-expert" user to determine the right structure.[34] Figure 2 shows how Dendral operates without an expert, after all the general rules were entered into Dendral's knowledge base.[35]

The Dendral team recruited Bruce Buchanan to refine Feigenbaum’s Lisp program.[36] Buchanan had similar ideas and interests as Feigenbaum and Lederberg, but his special interest was hypothesis formation.[37] As Joseph November said in Digitizing Life: The Introduction of Computers to Biology and Medicine, "(Buchanan) wanted the system (Dendral) to make discoveries on its own, not just help humans make them". Buchanan designed "Meta-Dendral", which was a "hypothesis maker".[38] Meta-Dendral and Dendral were fused together to create Heuristic Dendral" in 1966-67.[39] The prime difference was that Heuristic Dendral "would serve as a template for similar systems in other areas" rather than just concentrating in the field of organic chemistry.[40]


  1. November, 2006
  2. Lederberg, 1987
  3. Lindsay et al., 1980
  4. November, 2006
  5. Lindsay et al., 1980
  6. November, 2006
  7. Lindsay et al., 1980
  8. Lindsay et al., 1980
  9. November, 2006
  10. Lindsay et al., 1980
  11. November, 2006
  12. Lindsay et al., 1980
  13. Lindsay et al., 1980
  14. Lindsay et al., 1980
  15. Lindsay et al., 1980
  16. Lindsay et al., 1980
  17. Lindsay et al., 1980
  18. Lindsay et al., 1980
  19. Lindsay et al., 1980
  20. Lindsay et al., 1980
  21. Lindsay et al., 1980
  22. Lindsay et al., 1980
  23. November, 2006
  24. Lindsay et al., 1980
  25. November, 2006
  26. Berk, 1985
  27. Lederberg, 1963
  28. November, 2006
  29. November, 2006
  30. November, 2006
  31. November, 2006
  32. November, 2006
  33. November, 2006
  34. November, 2006
  35. November, 2006
  36. November, 2006
  37. November, 2006
  38. November, 2006
  39. November, 2006
  40. November, 2006


  1. Berk, A A. LISP: the Language of Artificial Intelligence. New York: Van Nostrand Reinhold Company, 1985. 1-25.
  2. Lederberg, Joshua. An Instrumentation Crisis in Biology. Stanford University Medical School. Palo Alto, 1963.
  3. Lederberg, Joshua. How Dendral Was Conceived and Born. ACM Symposium on the History of Medical Informatics, 05 Nov. 1987, Rockefeller University. New York: National Library of Medicine, 1987.
  4. Lindsay, Robert K., Bruce G. Buchanan, Edward A. Feigenbaum, and Joshua Lederberg. Applications of Artificial Intelligence for Organic Chemistry: The Dendral Project. McGraw-Hill Book Company, 1980.
  5. November, Joseph A. “Digitizing Life: The Introduction of Computers to Biology and Medicine.” Doctoral dissertation, Princeton University, 2006.


Mycin was an early expert system developed over five or six years in the early 1970s at Stanford University; it was written in Lisp, by Edward Shortliffe under the direction of Bruce Buchanan and others; it derived from the earlier Dendra] expert system, but considerably modified and extended the basic Dendral software. This computer system was designed to diagnose infectious blood diseases and recommend antibiotics, with the dosage adjusted for patient's body weight — the name derived from the antibiotics themselves, as many antibiotics have the suffix "-mycin".



Mycin operated using a fairly simple inference engine,a knowledge base of ~500 rules. It would query the physician running the program via a long series of simple yes/no or textual questions. At the end, it provided a list of possible culprit bacteria ranked from high to low based on the probability of each diagnosis, its confidence in each diagnosis' probability, the reasoning behind each diagnosis (that is, Mycin would also list the questions and rules which led it to rank a diagnosis a particular way), and its recommended course of drug treatment.

Despite Mycin's success, it is sometimes now used as a cautionary tale in AI courses for people who generate their own ad hoc probability frameworks. Mycin had a limited depth to its reasoning hierarchy due to the noise introduced by its certainty factor system. This problem can be solved by using a rigorous probabilistic framework, such as Bayesian statistics.



Research conducted at the Stanford Medical School found Mycin to have a correct diagnosis rate of about 65%, which was better than most physicians who were not specialists in diagnosing bacterial infections, even though it was worse than those physicians who were themselves experts in the field who had average correct diagnosis rates of about 80%.

Practical use


Mycin was never actually used in practice. This wasn't because of any weakness in its performance — in tests it outperformed members of the Stanford medical school. It was as much because of ethical and legal issues related to the use of computers in medicine — if it gives the wrong diagnosis, who can be held responsible? Issues with whether human experts would find it acceptable to use arose as well.

A difficulty that rose to prominence during the development of Mycin and subsequent complex expert systems has been the extraction of the necessary knowledge for the inference engine to use from the humans expert in the relevant fields into the rule base (the so-called knowledge engineering).


  • The AI Business: The commercial uses of artificial intelligence, ed. Patrick Winston and Karen A. Prendergast. ISBN 0-262-23117-4


Wikimedia Resources




Wikiversity Courses


Wikipedia Articles


External Sources

  • Turing, Allen, Computing Machinery and Intellegence, Mind, 1959, p433-460.
  • Minsky, Marvin, Why People Think Computers Can't, AI Magazine, Vol 3 No 4, Fall 1982.
  • Kurzweil, Ray, The Singularity is Near, Penguin Books, 2005. ISBN 0143037889

Initial outline based in part on:

  • Giarratano, Joseph C., Riley, Gary D., Expert Systems: Principals and Programming, Fourth Edition, Thomson Course Technology, 2005. ISBN 0534384471




The text of this book is released under the following license: