Data Mining Algorithms In R/Clustering/Fuzzy Clustering - Fuzzy C-means

Introduction

edit

The aim of clustering is to minimize a set of data points into self-similar groups such that the points that belong to the same group are more similar than the points belonging to different groups. Each group is called a cluster.

This text will appear one more of the algorithms discussed in the literature known as Fuzzy C-Means. After the presentation of the technique will be presented an R package that implements this algorithm. Finally, a case study will be made using the algorithm implemented and presented the results obtained.

Technique/Algorithm

edit

Fuzzy C-means (FCM----Frequently C Methods) is a method of clustering which allows one point to belong to one or more clusters. The method was developed by Dunn in 1973 and improved by Bezdek in 1981 and it is frequently used in pattern recognition. The FCM algorithm attempts to partition a finite collection of points into a collection of C fuzzy clusters with respect to some given criteria. Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the center of cluster. The FCM algorithm is based on minimization of the following objective function:

 

Algorithm

edit

The FCM is also known as fuzzy c-means nebulous because it uses fuzzy logic [Zadeh 1965] so that each instance is not associated with only one cluster, but has a certain degree of membership for each of the existing centroids. For this, the algorithm creates a matrix U associativity, where each term μij represents the degree of membership of sample i to cluster j. In the FCM algorithm have a variable fuzziness m such that 1.0 < m <   where m and being a real number. The closer m is to infinity ( ), the greater the fuzziness of the solution and the closer to 1, the solution becomes increasingly similar to the clustering of binary k-means [Bezdek 1981]. A good choice is to set m = 2.0 [Hathaway and Bezdek 2001].

You can see both the k-means and FCM together in the same pseudo-code described in (Algorithm 1). In it, we have the k-means or FCM only by changing the formula to calculate the terms μij, changing the average fuzzy [Zadeh 1965] for a binary choice, showing that FCM is indeed the K-Means cloudy.

In (Algorithm 1), the manner stating that   is the distance away Euclidean to the a to b taking as input: set of samples xi (1 < i < N), plain number of clusters K factor cloudiness me a factor of Tolerance, we leave on the: a cluster vector ci (1 < i < K) and a matrix U determines the associativity of each sample with each of the clusters. It should be noted that the values of the matrix U depend only on H (array that stores the distances are the examples of clusters) and the value of m. The upgrade of the clusters depend solely on the values of the iteration matrix U in iteration.

Implementation

edit

The algorithm described above was implemented by the R package "e1071" released in 21/04/2010. This package has GPL-2 and can be found in the CRAN repository.

Install CBA Package

install.packages("e1071")

Importing methods and algorithm

library("e1071")

Usage:

cmeans(x, centers, iter.max = 100, verbose = FALSE,
       dist = "euclidean", method = "cmeans", m = 2,
       rate.par = NULL, weights = 1, control = list())

Arguments:

• x: The data matrix where columns correspond to variables and rows to observations.
• centers: Number of clusters or initial values for cluster centers.
• iter.max: Maximum number of iterations.
• verbose: If TRUE, make some output during learning.
• dist: Must be one of the following: If "euclidean", the mean square error,
if "manhattan", the mean absolute error is computed. Abbreviations are also accepted.
• method:  If "cmeans", then we have the c-means fuzzy clustering method, 
if "ufcl" we have the on-line update. Abbreviations are also accepted.
• m: A number greater than 1 giving the degree of fuzzification.
• rate.par: A number between 0 and 1 giving the parameter of the learning rate for the 
on-line variant. The default corresponds to 0:3.
• weights: a numeric vector with non-negative case weights. Recycled to the 
number of observations in x if necessary.
• control: a list of control parameters. See Details.


If everything goes ok, an object fclust is returned. This object has the following components:

• centers: the final cluster centers.
• size: the number of data points in each cluster of the closest hard clustering.
• cluster: a vector of integers containing the indices of the clusters where the data points are assigned 
to for the closest hard clustering, as obtained by assigning points to the (first) class with maximal membership.
• iter: the number of iterations performed.
• membership: a matrix with the membership values of the data points to the clusters.
• withinerror: the value of the objective function.
• call: the call used to create the object.

View

edit

There is one way to show the result from this algorithm. That way would be printing the object fclust :

print(fclust)

Example

edit

To illustrate the method implemented by the package studied, we have created two examples. The first one is created an array of two dimensions and in the second example creates an array with three dimensions. In both examples we used a normal distribution with a mean ranging and using a standard deviation of 0.3.

Example 1

# a 2-dimensional example
x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2),
matrix(rnorm(100,mean=1,sd=0.3),ncol=2))
cl<-cmeans(x,2,20,verbose=TRUE,method="cmeans",m=2)
print(cl)

Example 2

# a 3-dimensional example
x<-rbind(matrix(rnorm(150,sd=0.3),ncol=3),
matrix(rnorm(150,mean=1,sd=0.3),ncol=3),
matrix(rnorm(150,mean=2,sd=0.3),ncol=3))
cl<-cmeans(x,6,20,verbose=TRUE,method="cmeans")
print(cl)


Output - Example 1


Output - Example 2

Case Study

edit

The topics below describe the case study to demonstrate the results obtained using the Fuzzy C-Means.

Scenario

edit

In the proposed case study, we used the database available in packages Iris R. This database consists of 150 instances each containing features of flower petals. Each instance of a database has its classification. There are three different classifications for the instances of the database (setosa, versicolor and virginica). Each of the classes present in the database studied has 50 examples that in total, summarize the 150 instances of the base. In addition to the definitions of classes in each of these instances, the other features are:

Datasets

edit

The database of iris used in the experiment can be charged in the language R using the following command.

data(iris)

The table below has a small sample of the data contained in the iris database.

Table 1: Samples of iris database

Instance Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 layak
2 4.9 3.0 1.4 0.2 Tidak layak
3 4.7 3.2 1.3 0.2 layak
72 6.1 2.8 4.0 1.3 layak
73 6.3 2.5 4.9 1.5 layak
114 5.7 2.5 5.0 2.0 layak
115 5.8 2.8 5.1 2.4 layak

Execution

edit

To generate the results using the Fuzzy C-Means was run the script below.

data(iris)
x<-rbind(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length)
x<-t(x)
result<-cmeans(x,3,50,verbose=TRUE,method="cmeans")
print(result)
s3d <- scatterplot3d(result$membership, color=result$cluster, type="h", 
angle=55, scale.y=0.7, pch=16, main="Pertinence")
plot(iris, col=result$cluster)

Output

edit

The results obtained by running the script are shown below.

Fuzzy c-means clustering with 3 clusters

Cluster centers:
      [,1]     [,2]     [,3]
1 5.003653 3.412805 1.484775
2 5.874034 2.760272 4.382520
3 6.793622 3.054510 5.644347

Memberships:
                  1            2            3
  [1,] 0.9964733414 2.388793e-03 1.137865e-03
  [2,] 0.9730096494 1.850758e-02 8.482767e-03
  [3,] 0.9776389508 1.515266e-02 7.208389e-03
  [4,] 0.9635322892 2.503070e-02 1.143701e-02
  [5,] 0.9939984763 4.051202e-03 1.950322e-03
  [6,] 0.9304507689 4.703382e-02 2.251542e-02
  [7,] 0.9775132049 1.523242e-02 7.254371e-03
  [8,] 0.9999369153 4.314160e-05 1.994308e-05
  [9,] 0.9225703038 5.279889e-02 2.463081e-02
 [10,] 0.9834280681 1.141773e-02 5.154205e-03
 [11,] 0.9636309639 2.453957e-02 1.182947e-02
 [12,] 0.9914862878 5.851313e-03 2.662399e-03
 [13,] 0.9693327053 2.101145e-02 9.655842e-03
 [14,] 0.9162524471 5.600693e-02 2.774062e-02
 [15,] 0.8773228961 7.968730e-02 4.298980e-02
 [16,] 0.8300898328 1.098729e-01 6.003725e-02
 [17,] 0.9444844043 3.671434e-02 1.880126e-02
 [18,] 0.9964733414 2.388793e-03 1.137865e-03
 [19,] 0.8917869903 7.312086e-02 3.509215e-02
 [20,] 0.9768481880 1.559135e-02 7.560459e-03
 [21,] 0.9638052097 2.505889e-02 1.113590e-02
 [22,] 0.9862983266 9.267056e-03 4.434618e-03
 [23,] 0.9544955743 3.001379e-02 1.549064e-02
 [24,] 0.9879996300 8.348552e-03 3.651818e-03
 [25,] 0.9619796590 2.665281e-02 1.136753e-02
 [26,] 0.9700883809 2.080884e-02 9.102779e-03
 [27,] 0.9978125723 1.505741e-03 6.816871e-04
 [28,] 0.9927414381 4.946076e-03 2.312486e-03
 [29,] 0.9931235860 4.671305e-03 2.205109e-03
 [30,] 0.9770882461 1.581612e-02 7.095630e-03
 [31,] 0.9761442368 1.652997e-02 7.325795e-03
 [32,] 0.9748518908 1.716740e-02 7.980706e-03
 [33,] 0.9320017983 4.510961e-02 2.288859e-02
 [34,] 0.8927033428 7.017637e-02 3.712029e-02
 [35,] 0.9834280681 1.141773e-02 5.154205e-03
 [36,] 0.9833813658 1.121223e-02 5.406405e-03
 [37,] 0.9595420958 2.713202e-02 1.332588e-02
 [38,] 0.9926144115 4.984365e-03 2.401224e-03
 [39,] 0.9331691292 4.525365e-02 2.157722e-02
 [40,] 0.9984835370 1.037187e-03 4.792755e-04
 [41,] 0.9942973683 3.839814e-03 1.862818e-03
 [42,] 0.8379960489 1.102335e-01 5.177041e-02
 [43,] 0.9474894758 3.543286e-02 1.707767e-02
 [44,] 0.9966468801 2.299980e-03 1.053140e-03
 [45,] 0.9422485213 3.983938e-02 1.791210e-02
 [46,] 0.9693327053 2.101145e-02 9.655842e-03
 [47,] 0.9736866788 1.782324e-02 8.490078e-03
 [48,] 0.9713101708 1.953226e-02 9.157566e-03
 [49,] 0.9741716274 1.744780e-02 8.380577e-03
 [50,] 0.9970704014 1.996528e-03 9.330710e-04
 [51,] 0.0396263985 3.645217e-01 5.958519e-01
 [52,] 0.0318692598 7.303043e-01 2.378264e-01
 [53,] 0.0257989245 2.759514e-01 6.982496e-01
 [54,] 0.0547597033 8.587710e-01 8.646929e-02
 [55,] 0.0257236267 7.190557e-01 2.552207e-01
 [56,] 0.0044884340 9.781328e-01 1.737878e-02
 [57,] 0.0312129593 6.547331e-01 3.140540e-01
 [58,] 0.2958341637 5.694232e-01 1.347426e-01
 [59,] 0.0303580096 6.398237e-01 3.298183e-01
 [60,] 0.0880779511 8.134764e-01 9.844565e-02
 [61,] 0.2205273013 6.298451e-01 1.496276e-01
 [62,] 0.0105098293 9.591134e-01 3.037677e-02
 [63,] 0.0462415004 8.537411e-01 1.000174e-01
 [64,] 0.0127683343 8.793406e-01 1.078911e-01
 [65,] 0.1097850604 7.908698e-01 9.934511e-02
 [66,] 0.0439788476 6.323928e-01 3.236284e-01
 [67,] 0.0142403104 9.357246e-01 5.003511e-02
 [68,] 0.0107489244 9.647242e-01 2.452687e-02
 [69,] 0.0297161608 8.212906e-01 1.489932e-01
 [70,] 0.0472514422 8.832597e-01 6.948890e-02
 [71,] 0.0244685053 7.865167e-01 1.890147e-01
 [72,] 0.0231707557 9.204749e-01 5.635439e-02
 [73,] 0.0242412459 6.647915e-01 3.109672e-01
 [74,] 0.0115014923 8.931765e-01 9.532196e-02
 [75,] 0.0252735484 8.457138e-01 1.290126e-01
 [76,] 0.0367090402 7.041271e-01 2.591639e-01
 [77,] 0.0295101840 4.167745e-01 5.537153e-01
 [78,] 0.0196749600 2.703807e-01 7.099444e-01
 [79,] 0.0046165774 9.710517e-01 2.433168e-02
 [80,] 0.1233870645 7.695546e-01 1.070584e-01
 [81,] 0.0763633248 8.316087e-01 9.202798e-02
 [82,] 0.0956781157 8.038125e-01 1.005094e-01
 [83,] 0.0317355256 9.149948e-01 5.326970e-02
 [84,] 0.0237392176 6.474085e-01 3.288522e-01
 [85,] 0.0279975037 8.909775e-01 8.102497e-02
 [86,] 0.0346333192 7.957193e-01 1.696474e-01
 [87,] 0.0327144512 4.847696e-01 4.825159e-01
 [88,] 0.0287006017 8.325287e-01 1.387707e-01
 [89,] 0.0265872664 9.220512e-01 5.136156e-02
 [90,] 0.0425465997 8.901942e-01 6.725921e-02
 [91,] 0.0165454512 9.380639e-01 4.539066e-02
 [92,] 0.0126391727 8.984548e-01 8.890606e-02
 [93,] 0.0217893662 9.356063e-01 4.260431e-02
 [94,] 0.2778346035 5.864740e-01 1.356914e-01
 [95,] 0.0130250339 9.574755e-01 2.949948e-02
 [96,] 0.0143369500 9.506283e-01 3.503480e-02
 [97,] 0.0098869130 9.658287e-01 2.428443e-02
 [98,] 0.0128272275 9.306614e-01 5.651133e-02
 [99,] 0.3958967535 4.819128e-01 1.221905e-01
[100,] 0.0138782641 9.568112e-01 2.931057e-02
[101,] 0.0168213369 1.202409e-01 8.629378e-01
[102,] 0.0261693250 7.099212e-01 2.639094e-01
[103,] 0.0064283347 4.003438e-02 9.535373e-01
[104,] 0.0121558071 1.363357e-01 8.515085e-01
[105,] 0.0051285341 4.386983e-02 9.510016e-01
[106,] 0.0380603711 1.582822e-01 8.036574e-01
[107,] 0.0796613905 7.682136e-01 1.521250e-01
[108,] 0.0215250742 1.079049e-01 8.705700e-01
[109,] 0.0133896976 1.083710e-01 8.782393e-01
[110,] 0.0222927779 1.077325e-01 8.699748e-01
[111,] 0.0188704621 2.634073e-01 7.177222e-01
[112,] 0.0170114260 2.579486e-01 7.250400e-01
[113,] 0.0012069886 1.088884e-02 9.879042e-01
[114,] 0.0272794217 7.782927e-01 1.944279e-01
[115,] 0.0260263274 7.022101e-01 2.717635e-01
[116,] 0.0143300831 1.808070e-01 8.048629e-01
[117,] 0.0055448160 6.051225e-02 9.339429e-01
[118,] 0.0542553271 1.919346e-01 7.538101e-01
[119,] 0.0522340053 2.006703e-01 7.470957e-01
[120,] 0.0331219678 6.903576e-01 2.765205e-01
[121,] 0.0016396213 1.177291e-02 9.865875e-01
[122,] 0.0232292512 8.358772e-01 1.408936e-01
[123,] 0.0446065451 1.785215e-01 7.768719e-01
[124,] 0.0214638942 6.565434e-01 3.219927e-01
[125,] 0.0033893690 2.584416e-02 9.707665e-01
[126,] 0.0114583443 6.335614e-02 9.251855e-01
[127,] 0.0173352405 7.863541e-01 1.963106e-01
[128,] 0.0207474129 7.187218e-01 2.605308e-01
[129,] 0.0101190052 1.107064e-01 8.791746e-01
[130,] 0.0076951023 4.751066e-02 9.447942e-01
[131,] 0.0203964684 1.059183e-01 8.736853e-01
[132,] 0.0542244082 1.915598e-01 7.542157e-01
[133,] 0.0101190052 1.107064e-01 8.791746e-01
[134,] 0.0209695496 4.545456e-01 5.244849e-01
[135,] 0.0248052462 2.990893e-01 6.761055e-01
[136,] 0.0299588767 1.357829e-01 8.342583e-01
[137,] 0.0163979126 1.472580e-01 8.363440e-01
[138,] 0.0087535054 9.693234e-02 8.943141e-01
[139,] 0.0169168345 8.303003e-01 1.527828e-01
[140,] 0.0037050972 3.198944e-02 9.643055e-01
[141,] 0.0006389323 5.579855e-03 9.937812e-01
[142,] 0.0153630352 1.530446e-01 8.315924e-01
[143,] 0.0261693250 7.099212e-01 2.639094e-01
[144,] 0.0036930153 2.507113e-02 9.712359e-01
[145,] 0.0033893690 2.584416e-02 9.707665e-01
[146,] 0.0106923302 1.279688e-01 8.613389e-01
[147,] 0.0250155507 5.900274e-01 3.849571e-01
[148,] 0.0138756337 2.012898e-01 7.848346e-01
[149,] 0.0230709595 2.493458e-01 7.275832e-01
[150,] 0.0261065981 6.399365e-01 3.339569e-01

Closest hard clustering:
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [46] 1 1 1 1 1 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2
 [91] 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 3 3
[136] 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2

Available components:
[1] "centers"     "size"        "cluster"     "membership"  "iter"        "withinerror"
[7] "call"   

The chart below shows the relevance of each party to each of the clusters generated. Each cluster generated can be identified with a color (green, red or black).

 

The graphs below show the classification of each instance by comparing each pair of attributes present in the database.

 

Analysis

edit

Application of Fuzzy C-Means algorithm allowed a homogeneous grouping of classes as expected. Soon, the three generated classes have a very similar amount of instances present. The algorithm presented in addition to the class that was ranked a given instance, the relevance of this instance to that class. This information allows the person responsible for analyzing the results can devote their attention to the proceedings that do not have a high degree of relevance to a particular class.

The instances of the database analyzed that do not have a high degree of relevance, this degree being defined by the user should be analyzed for Chace whether they really belong to the class informed by the algorithm. In the chart presented pertinencia realize that some instances of Class 2, marked in red do not have a high amount of relevance. This possibly indicates that the algorithm may have erred in its classification as the values of the attributes of these instances do not identify with a high degree of certainty these instances.

References

edit

J. C. Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters", Journal of Cybernetics 3: 32-57

J. C. Bezdek (1981): "Pattern Recognition with Fuzzy Objective Function Algorithms", Plenum Press, New York Tariq Rashid: “Clustering”

L. A. Zadeh (1965): "Fuzzy sets and systems". In: Fox J, editor. System Theory. Brooklyn, NY: Polytechnic Press, 1965: 29–39.

Iris Data Base in: http://archive.ics.uci.edu/ml/datasets/Iris

The R Project for Statistical Computing in: http://www.r-project.org/

W. Meira; M. Zaki: Fundamentals of Data Mining Algorithms in: http://www.dcc.ufmg.br/miningalgorithms/DokuWiki/doku.php