Optimal Classification/Printable version
This is the print version of Optimal Classification You won't see this message or any elements not part of the book's content when you print or preview this page. |
The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/Optimal_Classification
Rypka Method
Rypka's Method
editRypka's[1] method[2] utilizes the theoretical and empirical separatory equations shown below to perform the task of optimal classification. The method finds the optimal order of the fewest attributes, which in combination define a bounded class of elements.
Application of the method begins with construction of an attribute-valued system in truth table[3] or spreadsheet form with elements listed in the left most column beginning in the second row. Characteristics[4] are listed in the first row beginning in the second column with the title of the attributes in the upper left most cell. Normally the file name of the data is given the title of the element class. The values which connect each characteristic with each element are placed in the intersecting cells. Selecting characteristics which all elements share may be the most difficult part of creating a database which can utilizing this method.
The elements are first sorted in descending order according to their truth table value, which is calculated from the existing sequence and value of characteristics for each element. Duplicate truth table values or multisets for the entire bounded class reveal either the need to eliminate duplicate elements or the need to include additional characteristics.
An empirical separatory value is calculated for each characteristic in the set and the characteristic with the greatest empirical separatory value is exchanged with the characteristic which occupies the most significant attribute position.
Next the second most significant characteristic is found by calculating an empirical separatory value for each remaining characteristic in combination with the first characteristic. The characteristic which produces the greatest separatory value is then exchanged with the characteristic which occupies the second most significant attribute position.
Next the third most significant characteristic is found by calculating an empirical separatory value for each remaining characteristic in combination with the first and second characteristics. The characteristic which produces the greatest empirical separatory value is then exchanged with the characteristic which occupies the third most significant attribute position. This procedure may continue until all characteristics have been processed or until one hundred percent separation of the elements has been achieved.
A larger radix will allow faster identification by excluding a greater percentage of elements per characteristic. A binary radix for instance excludes only fifty percent of the elements per characteristic whereas a five-valued radix excludes eighty percent of the elements per characteristic.[5] What follows is an elucidation of the matrix and separatory equations.[6]
Examples
editNotes and References
edit- ↑ :Eugene Weston Rypka passed away on April 27, 2006. Gene was born on May 6, 1925 in Owatonna, MN to Charles Frederick and Ethel Marie Rypka. He served in World War II as a paramedic in Iwo Jima and received several medals and commendations. In 1958, Gene received a Ph.D. in Medical Microbiology from Stanford University. He had a long and distinguished career, including work with Russian scientists at Lovelace Medical Center and the University of New Mexico. Bicycle racing was a lifetime love and occupation, and in later years, he also studied martial arts.
- ↑ Primary reference: Biological Identification with Computers edited by R.J. Pankhurst, British museum (natural history) London, England proceedings of a meeting held at Kings College, Cambridge 27 and 28 September 1973 of the Systematics Association Special Volume Number 7 and published by the Academic Press 1975 noting the work of Eugene W. Rypka, Dept. of Microbiology, Lovelace Center for Health Sciences, Albuquerque, New Mexico, "Pattern Recognition and Microbial Identification." ISBN 0125448503
- ↑
- ↑ Characteristics and attributes are used interchangeably.
- ↑ See Table II. page 158 of the primary reference
- ↑ The primary reference should be consulted for a more detailed and in depth explanation of the method.
Rypka Method/Equations/Truth Table Size
Truth Table Size-Related Equations
editThese equations derive the truth table[1] size from the highest value of logic (V) and the highest number of characteristics (C} in the target set and in the group of the bounded class[2] of elements which denote the range for the truth table values of the set and the group.[3]
Group size
edit- , where:
- Q is the group size[4]
- V is the highest value of logic in the group,
- C is the highest number of characteristics in the group.
Target Set size
edit- , where:
- R is the target set size[5]
- V is the highest value of logic in the group,
- K is the highest number of characteristics in the target set.
Notes
edit
Rypka Method/Equations/Separatory/Elements
Element Related Equations
editOrder of elements
editThe elements are arranged in descending order according to their truth table value, i.e., the value calculated as the sum of each characteristic's logic state value times the highest value of logic raised to the power of the order of the characteristic.[1] The element truth table value allows elements to be sorted and identified as unique or equivalent and the bounded class identified as a set or multiset.
- , where:
- ei is the element truth table value in the group,
- V is the highest value of logic in the group,
- v is the value of logic of each characteristic in the group,
- j is the jth characteristic index, where:
- j = 0..C and where:
- C is the number of characteristics in the group,
- j = 0..C and where:
- i is the ith element index, where:
- i = 0..G and where:
- G is the number of elements in the bounded class.
- i = 0..G and where:
Maximum number of pairs of elements to separate
editMaximum number of pairs of elements to separate refers to triangularization of the matrix to permit comparison of each element with every other element to determine the number of pairs that are separable or disjoint. Pairs are separable or disjoint whenever the logic values of the elements that make up a pair are different. In theory, therefore the maximum possible number of pairs that can be separated is determined by the following equation:[2]
- , where:[3]
- pmax is the maximum number of pairs to separate, and
- G is the number of elements in the bounded class.
Notes
edit- ↑ See Number Systems - essentially using the characteristic values to compute a network or memory address, followed by sorting.
- ↑ See page 176 Table XI of the primary reference.
- ↑ Triangular Number
Rypka Method/Equations/Separatory/Characteristic/Theoretical
Theoretical Separation
editThe general identification equation
edit- , where:[1]
- Sj is the theoretical separatory value per jth characteristic,
- C is the number of characteristics in the group,
- V is the highest value of logic in the group and
- j is the jth characteristic index in the target set, where:
- j = 0..K and where:
- K is the number of characteristics in the target set.
- j = 0..K and where:
Minimal number of characteristics to result in theoretical separation
edit- , where:[2]
- tmin is the minimal number of characteristics to result in theoretical separation,
- G is the number of elements in the bounded class and
- V is the highest value of logic in the group.
Notes
edit
Rypka Method/Equations/Separatory/Characteristic/Empirical/Target Set
Target Set Truth Table Value
edit, where:[1]
- ti is the truth table value of the target set,
- vi,j is the element's attribute value,
- i is the ith element's index value, where,
- i = 0...G' where G is the number of elements in the bounded class, and,
- j is the jth characteristic's index value, where,
- j = 0...K and where,
- K is the number of characteristics in the target set,
- V is highest value of logic in the group,
- V(K-j) is the positional value of the jth characteristic.
, where,
- nti contains the multiset count for each truth table value[2].
Notes
edit- ↑ As the characteristic with the greatest separatory value is moved to the next most significant position, K is incremented to expand the target set size from two characteristics to the number of characteristics in the group or the number of characteristics which result in 100% separation. For the initial target set with one characteristic the separatory value is computed individually for each characteristic in the group to find the initial characteristic with the highest separatory value.
- ↑ coefficient of association, , see page 172 of the primary reference
Rypka Method/Equations/Separatory/Characteristic/Empirical/Separation
Separation Stages
editInitial separation
edit, where:[1]
- Sj is the initial empirical separatory value for each characteristic, where,
- j = 0...C and is the index of the jth characteristic in the group and C is the number of characteristics in the group, and,
- l = 0...R and is the truth table value of the jth characteristic, where R is the truth table size, where:
- R = V0, and,
- V is the highest value of logic in the group and,
- 0 is the target set exponent for a single characteristic, and,
- R = V0, and,
- G is the number of elements in the bounded class.
Subsequent separation
edit, where:
- Sj is the initial empirical separatory value for each characteristic, where,
- l = 0...R and is the target set truth table index value, where R is the target set truth table size value, where:
- R = VK, and,
- V is the highest value of logic in the group and,
- K is the number of characteristics in the target set, and,
- R = VK, and,
- G is the number of elements in the bounded class.
Notes
edit- ↑ Must be applied initially to each characteristic. (The equations have been implemented in Mathcad and using the Visual Basic programming language. See Application_Example
Rypka Method/Examples/Computational
Bounded Class Data
editBounded Class Dimensions
editG = 28 - 28 elements - i = 0...G-1[1]
C = 10 - 10 characteristics or attributes - j = 0...C-1
V = 5 - 5 valued logic - l = 0...V-1
Order of Elements
editSeparatory Values
edit
max(T) = 309 = S8 = highest initial separatory value
Notes
edit- ↑ Mathcad's ORIGIN function applies to all arrays such that if more than one array is being used and one array requires a zero origin then the other arrays must use a zero origin with all variables being adapted as well.
Rypka Method/Examples/Application
Flag Recognition
editAlthough the following example lacks an intuitive sense of optimization it serves as a good example of how optimization can reduce the number of queries. While the method of optimal classification is highly beneficial for reducing the number of queries required for manual identification, automated identification may be better served by use of a neural network.
Flag overlay grid
editDesignated areas (characteristics) for sampling background colors (states) of all flags (elements).
The overlay is used to determine the color of each area for each flag and the color is recorded in the table as the logical state of the area. The table data is then submitted to the optimization program and processed until an optimal empirical separatory value is obtained.Data set
editFLAGS/LOC,A,B,C,D,E,F,G,H,I BELGIUM,BLACK,YELLOW,ORANGE,BLACK,YELLOW,ORANGE,BLACK,YELLOW,ORANGE FRANCE,BLUE,WHITE,RED,BLUE,WHITE,RED,BLUE,WHITE,RED GERMANY,BLACK,BLACK,BLACK,RED,RED,RED,YELLOW,YELLOW,YELLOW IRELAND,GREEN,WHITE,ORANGE,GREEN,WHITE,ORANGE,GREEN,WHITE,ORANGE ITALY,GREEN,WHITE,RED,GREEN,WHITE,RED,GREEN,WHITE,RED JAPAN,WHITE,WHITE,WHITE,WHITE,RED,WHITE,WHITE,WHITE,WHITE LUXEMBOURG,RED,RED,RED,WHITE,WHITE,WHITE,BABY,BABY,BABY NETHERLANDS,RED,RED,RED,WHITE,WHITE,WHITE,BLUE,BLUE,BLUE SPAIN,RED,RED,RED,YELLOW,YELLOW,YELLOW,RED,RED,RED HELIGOLAND,GREEN,GREEN,GREEN,RED,RED,RED,WHITE,WHITE,WHITE
Original order
editSystematic query
editStarting with area "A" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of the Netherlands. The answer to the first query in regard to area "A" is RED which would remove 2/3 of the flags from further consideration. The next query for the color in area "B" would be RED which would serve to eliminate none of the remaining flags. In fact, since the colors in columns "D", "E" and "F" are the same for each remaining flag, we would not be able to eliminate any remaining flags until column "G" where the color BLUE would provide a unique answer to the final necessary query. Here all remaining flags except the flag of the Netherlands would be eliminated from further consideration. It would therefore take a minimum of seven queries using the systematic query method to establish the identity of the flag in our possession as belonging to the Netherlands.
Optimized order
editMinimized query
editThe results of optimization are shown above and include a listing of the theoretical and empirical percentages. The original characteristic sequence is indexed in the bottom row. Starting with area "G" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of Ireland. The answer to this first query would be GREEN. The next query is for the color in area "F" to which we would answer ORANGE. Since no other flags have this combination of GREEN and ORANGE in these areas our query can end here. The method has minimized the number of necessary queries required to identify the flag by optimizing the order of characteristics. (Please note that there may be more than one minimal solution.)
Computer program GUI
editIdentification by Polychotomous Key
edit