# Optimal Classification/Printable version

This is the print version of Optimal ClassificationYou won't see this message or any elements not part of the book's content when you print or preview this page. |

The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at

https://en.wikibooks.org/wiki/Optimal_Classification

# Rypka Method

## Rypka's Method

editRypka's^{[1]} method^{[2]} utilizes the theoretical and empirical separatory equations shown below to perform the task of optimal classification. The method finds the optimal order of the fewest attributes, which in combination define a **bounded** class of elements.

Application of the method begins with construction of an attribute-valued system in truth table^{[3]} or spreadsheet form with elements listed in the left most column beginning in the second row. Characteristics^{[4]} are listed in the first row beginning in the second column with the title of the attributes in the upper left most cell. Normally the file name of the data is given the title of the element class. The values which connect each characteristic with each element are placed in the intersecting cells. Selecting characteristics which all elements share may be the most difficult part of creating a database which can utilizing this method.

The elements are first sorted in descending order according to their truth table value, which is calculated from the existing sequence and value of characteristics for each element. Duplicate truth table values or multisets for the entire **bounded** class reveal either the need to eliminate duplicate elements or the need to include additional characteristics.

An empirical separatory value is calculated for each characteristic in the set and the characteristic with the greatest empirical separatory value is exchanged with the characteristic which occupies the most significant attribute position.

Next the second most significant characteristic is found by calculating an empirical separatory value for each remaining characteristic in combination with the first characteristic. The characteristic which produces the greatest separatory value is then exchanged with the characteristic which occupies the second most significant attribute position.

Next the third most significant characteristic is found by calculating an empirical separatory value for each remaining characteristic in combination with the first and second characteristics. The characteristic which produces the greatest empirical separatory value is then exchanged with the characteristic which occupies the third most significant attribute position. This procedure may continue until all characteristics have been processed or until one hundred percent separation of the elements has been achieved.

A larger radix will allow faster identification by excluding a greater percentage of elements per characteristic. A binary radix for instance excludes only fifty percent of the elements per characteristic whereas a five-valued radix excludes eighty percent of the elements per characteristic.^{[5]}
What follows is an elucidation of the matrix and separatory equations.^{[6]}

## Examples

edit## Notes and References

edit- ↑ :Eugene Weston Rypka passed away on April 27, 2006. Gene was born on May 6, 1925 in Owatonna, MN to Charles Frederick and Ethel Marie Rypka. He served in World War II as a paramedic in Iwo Jima and received several medals and commendations. In 1958, Gene received a Ph.D. in Medical Microbiology from Stanford University. He had a long and distinguished career, including work with Russian scientists at Lovelace Medical Center and the University of New Mexico. Bicycle racing was a lifetime love and occupation, and in later years, he also studied martial arts.
- ↑
**Primary reference:**Biological Identification with Computers edited by R.J. Pankhurst, British museum (natural history) London, England proceedings of a meeting held at Kings College, Cambridge 27 and 28 September 1973 of the Systematics Association Special Volume Number 7 and published by the Academic Press 1975 noting the work of Eugene W. Rypka, Dept. of Microbiology, Lovelace Center for Health Sciences, Albuquerque, New Mexico, "Pattern Recognition and Microbial Identification." ISBN 0125448503 - ↑
- ↑ Characteristics and attributes are used interchangeably.
- ↑ See Table II. page 158 of the
**primary reference** - ↑ The
**primary reference**should be consulted for a more detailed and in depth explanation of the method.

# Rypka Method/Equations/Truth Table Size

## Truth Table Size-Related Equations

editThese equations derive the truth table^{[1]} size *from* the highest value of logic (V) and the highest number of characteristics (C} in the target set and in the group of the **bounded** class^{[2]} of elements which denote the range for the truth table values of the set and the group.^{[3]}

### Group size

edit- , where:

is the group size**Q**^{[4]}is the highest value of logic in the group,**V**is the highest number of characteristics in the group.**C**

### Target Set size

edit- , where:

is the target set size**R**^{[5]}is the highest value of logic in the group,**V**is the highest number of characteristics in the target set.**K**

## Notes

edit

# Rypka Method/Equations/Separatory/Elements

## Element Related Equations

edit### Order of elements

editThe elements are arranged in descending order according to their truth table value, i.e., the value calculated as the sum of each characteristic's logic state value times the highest value of logic raised to the power of the order of the characteristic.^{[1]} The element truth table value allows elements to be sorted and identified as unique or equivalent and the **bounded** class identified as a set or multiset.

- , where:

is the element truth table value in the group,**e**_{i}is the highest value of logic in the group,**V**is the value of logic of each characteristic in the group,**v**is the**j**characteristic index, where:**jth**

and where:**j = 0..C**is the number of characteristics in the group,**C**

is the**i**element index, where:**ith**

and where:**i = 0..G**is the number of elements in the**G****bounded**class.

### Maximum number of pairs of elements to separate

editMaximum number of pairs of elements to separate refers to triangularization of the matrix to permit comparison of each element with every other element to determine the number of pairs that are separable or disjoint. Pairs are separable or disjoint whenever the logic values of the elements that make up a pair are different. In theory, therefore the maximum possible number of pairs that can be separated is determined by the following equation:^{[2]}

- , where:
^{[3]}

is the maximum number of pairs to separate, and**p**_{max}is the number of elements in the**G****bounded**class.

## Notes

edit- ↑ See Number Systems - essentially using the characteristic values to compute a network or memory address, followed by sorting.
- ↑ See page 176 Table XI of the
**primary reference**. - ↑ Triangular Number

# Rypka Method/Equations/Separatory/Characteristic/Theoretical

## Theoretical Separation

edit### The general identification equation

edit- , where:
^{[1]}

is the theoretical separatory value per**S**_{j}characteristic,**jth**is the number of characteristics in the group,**C**is the highest value of logic in the group and**V**is the**j**characteristic index in the target set, where:**jth**

and where:**j = 0..K**is the number of characteristics in the target set.**K**

### Minimal number of characteristics to result in theoretical separation

edit- , where:
^{[2]}

is the minimal number of characteristics to result in theoretical separation,**t**_{min}is the number of elements in the**G****bounded**class andis the highest value of logic in the group.**V**

## Notes

edit

# Rypka Method/Equations/Separatory/Characteristic/Empirical/Target Set

## Target Set Truth Table Value

edit , where:^{[1]}

is the truth table value of the target set,**t**_{i}

is the element's attribute value,**v**_{i,j}

is the**i**th element's index value, where,**i**

where**i = 0...G'**is the number of elements in the**G****bounded**class, and,

is the**j**th characteristic's index value, where,**j**

and where,**j = 0...K**

is the number of characteristics in the target set,**K**

is highest value of logic in the group,**V**is the positional value of the**V**^{(K-j)}th characteristic.**j**

, where,

**n**_{ti}contains the multiset count for each truth table value^{[2]}.

## Notes

edit- ↑ As the characteristic with the greatest separatory value is moved to the next most significant position, K is incremented to expand the target set size from two characteristics to the number of characteristics in the group or the number of characteristics which result in 100% separation. For the initial target set with one characteristic the separatory value is computed individually for each characteristic in the group to find the initial characteristic with the highest separatory value.
- ↑ coefficient of association, , see page 172 of the
**primary reference**

# Rypka Method/Equations/Separatory/Characteristic/Empirical/Separation

## Separation Stages

edit### Initial separation

edit , where:^{[1]}

is the initial empirical separatory value for each characteristic, where,**S**_{j}

and is the index of the**j = 0...C**th characteristic in the group and**j**is the number of characteristics in the group, and,**C**and is the truth table value of the**l = 0...R**th characteristic, where**j**is the truth table size, where:**R**=**R****V**, and,^{0}is the highest value of logic in the group and,**V**is the target set exponent for a single characteristic, and,**0**

is the number of elements in the**G****bounded**class.

### Subsequent separation

edit, where:

is the initial empirical separatory value for each characteristic, where,**S**_{j}

and is the target set truth table index value, where**l = 0...R**is the target set truth table size value, where:**R**=**R****V**, and,^{K}is the highest value of logic in the group and,**V**is the number of characteristics in the target set, and,**K**

is the number of elements in the**G****bounded**class.

## Notes

edit- ↑ Must be applied initially to each characteristic. (The equations have been implemented in Mathcad and using the Visual Basic programming language. See Application_Example

# Rypka Method/Examples/Computational

**Bounded** Class Data

edit
**Bounded** Class Dimensions

edit
* G = 28 * - 28 elements -

**i = 0...G-1**^{[1]}

* C = 10* - 10 characteristics or attributes -

**j = 0...C-1*** V = 5* - 5 valued logic -

**l = 0...V-1**## Order of Elements

edit## Separatory Values

edit

max(T) = 309 = S_{8} = highest initial separatory value

## Notes

edit- ↑ Mathcad's ORIGIN function applies to all arrays such that if more than one array is being used and one array requires a zero origin then the other arrays must use a zero origin with all variables being adapted as well.

# Rypka Method/Examples/Application

#### Flag Recognition

editAlthough the following example lacks an intuitive sense of optimization it serves as a good example of how optimization can reduce the number of queries. While the method of optimal classification is highly beneficial for reducing the number of queries required for manual identification, automated identification may be better served by use of a neural network.

##### Flag overlay grid

edit*Designated areas (*characteristics

*) for sampling background*states

**colors**(*) of all*elements

**flags**(*).*

##### Data set

editFLAGS/LOC,A,B,C,D,E,F,G,H,I BELGIUM,BLACK,YELLOW,ORANGE,BLACK,YELLOW,ORANGE,BLACK,YELLOW,ORANGE FRANCE,BLUE,WHITE,RED,BLUE,WHITE,RED,BLUE,WHITE,RED GERMANY,BLACK,BLACK,BLACK,RED,RED,RED,YELLOW,YELLOW,YELLOW IRELAND,GREEN,WHITE,ORANGE,GREEN,WHITE,ORANGE,GREEN,WHITE,ORANGE ITALY,GREEN,WHITE,RED,GREEN,WHITE,RED,GREEN,WHITE,RED JAPAN,WHITE,WHITE,WHITE,WHITE,RED,WHITE,WHITE,WHITE,WHITE LUXEMBOURG,RED,RED,RED,WHITE,WHITE,WHITE,BABY,BABY,BABY NETHERLANDS,RED,RED,RED,WHITE,WHITE,WHITE,BLUE,BLUE,BLUE SPAIN,RED,RED,RED,YELLOW,YELLOW,YELLOW,RED,RED,RED HELIGOLAND,GREEN,GREEN,GREEN,RED,RED,RED,WHITE,WHITE,WHITE

##### Original order

edit##### Systematic query

editStarting with area "A" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of the Netherlands. The answer to the first query in regard to area "A" is RED which would remove 2/3 of the flags from further consideration. The next query for the color in area "B" would be RED which would serve to eliminate none of the remaining flags. In fact, since the colors in columns "D", "E" and "F" are the same for each remaining flag, we would not be able to eliminate any remaining flags until column "G" where the color BLUE would provide a unique answer to the final necessary query. Here all remaining flags except the flag of the Netherlands would be eliminated from further consideration. It would therefore take a minimum of seven queries using the systematic query method to establish the identity of the flag in our possession as belonging to the Netherlands.

##### Optimized order

edit##### Minimized query

editThe results of optimization are shown above and include a listing of the theoretical and empirical percentages. The original characteristic sequence is indexed in the bottom row. Starting with area "G" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of Ireland. The answer to this first query would be GREEN. The next query is for the color in area "F" to which we would answer ORANGE. Since no other flags have this combination of GREEN and ORANGE in these areas our query can end here. The method has minimized the number of necessary queries required to identify the flag by optimizing the order of characteristics. (Please note that there may be more than one minimal solution.)

##### Computer program GUI

edit##### Identification by Polychotomous Key

edit