Data Mining Algorithms In R/Clustering/CLUES
Introduction
editCluster analysis, the organization of patterns into clusters based on similarity (or dissimilarity) measures, is an unsupervised technique widely applied to a broad range of disciplines. It has many applications in data mining, as large data sets need to be partitioned into smaller and homogeneous groups. Clustering techniques have a wide use, such as artificial intelligence, pattern recognition, economics, biology and marketing. Clustering techniques are important, and its importance increases as the amount of data and processing power of computers increases.
clues: Nonparametric Clustering Based on Local Shrinking
editThe R package clues aims to provide an estimate of the number of clusters and, at the same time, obtain a partition of data set via local shrinking. The shrinking procedure in clues is done by the mean-shift algorithm. It is also influenced by the K-nearest neighbor approach, not using kernel functions. The value K starts with a small number and increases gradually until the measure of strength, CH Index or Silhouette Index, is optimized. A major contribution of the CLUES algorithm is its ability to identify and deal with irregular elements. To help validation of the quality of the number of clusters and the clustering algorithm, five indices are available to support decision making.
Algorithm
editCLUES (CLUstEring based on local Shrinking) algorithm has three procedures:
- Shinking
- Partition
- Determination of optimal K
Shrinking
editFor the shrinking procedure, the data set is calibrated in a way that pushes each data point towards its focal point, the cluster center or mode of the probability density function. The number K is chosen iteratively and, due to the robustness of the median, each data point moves to the element-wise median of the set. This median consists of its K nearest neighbors according to dissimilarity measures, either Euclidean distance or Pearson Correlation.
For this process, a stopping rule needs to be set by the user, beyond which excess iterations will not be significant in terms of accuracy. The mutual gaps in data are apparent after this shrinking procedure.
Partitioning
editThe partition procedure uses the calibrated data obtained from shrinking. This data is used in place of the original data set. The partitioning starts by picking one arbitrary data point and replacing it by its nearest fellow point, recording their distance. The same move is applied to this fellow point. The selection is without replacement. Once a data point is picked for replacement, it is not picked again in that run. To separate the groups, the summation of the mean distance and 1:5 times the interquartile range is introduced. Each new fellow surpass move creates a new group, with a incremented group index.
Optimal K
editOptimal K calculation involves optimizing the strength measure index, either CH Index or Silhouette Index. A factor f is introduced to improve the speed of computation. This factor is, by default, 0.05. Users may modify this factor, but must be aware of the large computation time associated with large factors. The choice of K has little effect on clustering result as long as it lies in a neighborhood of the optimal K. The factor 0.05 is chosen to minimize additional computation that will not substantially affect the outcome.
To strengthen the power of the calculation, we initialize K to be fn and set the size of increment K to be fn for the next iterations. We then use the calibrated data form the previous step as the new data.
Implementation
editInstalation
R > install.packages("clues")
Usage
library("clues")
clues procedure
clues(y, n0, alpha, eps, itmax, K2.vec, strengthMethod, strengthIni, disMethod, quiet)
Parameters
- y: data matrix which is an R matrix object (for dimension > 1) or vector object (for dimension=1) with rows being observations and columns being variables.
- n0: a guess for the number of clusters. Default value is 5.
- alpha: speed factor. Default set as 0.05.
- eps: a small positive number. A value is regarded as zero if it is less than ‘eps’. Default value is 1.0e-4.
- itmax: maximum number of iterations allowed. Default is 20.
- K2.vec: range for the number of nearest neighbors for the second pass of the iteration. Default is n0 (5).
- strengthMethod: specifies the preferred measure of the strength of the clusters (i.e., compactness of the clusters). Two available methods are “sil” (Silhouette index) and “CH” (CH index).
- strengthIni: initial value for the lower bound of the measure of the strength for the clusters. Any negative values will do.
- disMethod: specification of the dissimilarity measure. The available measures are “Euclidean” and “1-corr”.
- quiet: logical. Indicates if intermediate results should be output.
Values
This section lists the values that can be viewed when running clues.
- K: number of nearest neighbors can be used to get final clustering.
- size: vector of the number of data points for clusters.
- mem: vector of the cluster membership of data points. The cluster membership takes values: 1, 2, ..., g, where g is the estimated number of clusters.
- g: an estimate of the number of clusters.
- CH: CH index value for the final partition if ‘strengthMethod’ is “CH”.
- avg.s: average of the Silhoutte index value for the final partition if ‘strengthMethod’ is “sil”.
- s: vector of Silhoutte indices for data points if ‘strengthMethod’ is “sil”.
- K.vec: number of nearest neighbors used for each iteration.
- g.vec: number of clusters obtained in each iteration.
- myupdate: logical. Indicates if the partition obtained in the first pass is the same as that obtained in the second pass.
- y.old1: data used for shrinking and clustering.
- y.old2: data returned after shrinking and clustering.
- y: a copy of the data from the input.
- strengthMethod: a copy of the strengthMethod from the input.
- disMethod: a copy of the dissimilarity measure from the input
Example
editWe will show an example of how to run clues using the Maronna data set. This set has 4 slightly overlapped clusters in the two-dimensional space. Each cluster contains 50 data points. The Maronna data is a simulated data set. The data are drawn from 4 bivariate normal distributions with identity covariance matrix and mean vectors μ = {(0,0), (4,0), (1,6), (5,7)}.
R > data(Maronna) R > maronna <- Maronna$maronna R > res <- clues R > res <- clues(maronna, quiet = TRUE) # run clues
The results are shown below
R > summary(res) Number of data points: [1] 200 Number of variables: [1] 2 Number of clusters: [1] 4 Cluster sizes: [1] 53 47 50 50 Strength method: [1] "sil" avg Silhouette: [1] 0.5736749 dissimilarity measurement: [1] "Euclidean"
Plotting the results, we can see, in figure 1, the four clusters found by the clues algorithm.