Data Mining Algorithms In R/Clustering/RockCluster
Clustering is a useful technique for grouping data points such that points within a single group/cluster have similar characteristics ( or are close to each other), while points in different groups are dissimilar.
We describe the ROCK (RObust Clustering using linKs) clustering algorithm which belongs to the class of agglomerative hierarchical clustering algorithms.
The steps involved in clustering using ROCK are described in the following figure. After drawing a random sample from the database, a hierarchical clustering algorithm that employs links is applied to the sampled points. Finally, the clusters involving only the sampled points are used to assign the remaining data points on disk to the appropriate clusters. In the following subsections, we first describe the steps performed by ROCK in greater detail.
ROCK`s hierarchical clustering algorithm is presented in the following figure. It accepts as input the set S of N sampled points to be clustred (that are drawn randomly from the original data set), and the number of desired clusters k. The procedure begins by computing the number of links between pairs of points in Step 1. Initially, each point is separate cluster. For each cluster i, we build a local heap q[i] and maintain the heap during the execution of the algorithm. q[i] contains every cluster j such that link[i,j] is non-zero. The clusters j in q[i] are ordered in the decreasing order of the goodness measure with respect to i, g(i,j).
In addition to the local heaps q[i] for each cluster i, the algorithm also maintains an additional global heap Q that contains all the clusters. Furthermore, the clusters in Q are ordered in the decreasing order of their best goodness measures. Thus, g(j,max(q[j])) is used to order the various clusters j in Q, where max(q[j]), the max element in q[j], is the best cluster to merge with cluster j. At each step, the max cluster j in Q and the max cluster q[j] are the best pair of clusters to be merged.
Computation of LinksEdit
For every point, after computing a list of its neighbors, the algorithm considers all pairs of its neighbors. For each pair, the point contributes one link. If the process is repeated for every point and the link count is incremented for each pair of neighbors, then at the end, the link counts for all pairs of points will be tained. If Mi is the size of the neighbor list for point i, then for point i, we have to increase the link count by one in M^2i entries. This, the complexity of the algorithm is the sum of M^2i which is O(N * Mm * Ma), where Ma and Mm are the average and maximum number of the neighbors for a point, respectively. In the worst case, the value of Mm can be n in which case the complexity of the algorithm becomes O(Ma * N^2). In practice, we expect Mm to be reasonably close to Ma and thus, for these cases, the complexity of the algorithm reduces to O(M^2a * n) on average.
In order to use the ROCK algorithm in R, one must install "CBA" package. This package includes a function that performs the RockCluster process.
Install CBA Package
Importing methods and algorithm
rockCluster(x, n, beta = 1-theta, theta = 0.5, fun = "dist", funArgs = list(method="binary"), debug = FALSE) rockLink(x, beta = 0.5)
X: a data matrix; for rockLink an object of class dist. n: the number of desired clusters. beta: optional distance threshold. theta: neighborhood parameter in the range [0,1). fun: distance function to use. funArgs: a list of named parameter argu
If everything goes ok, an object rockCluster is returned. This object has the following components:
x: the data matrix or a subset of it. cl: a factor of cluster labels. size: a vector of cluster sizes. beta: see above. theta: see above. rockLink: returns an object of class dist.
There is one way to show the result from this algorithm. That way would be printing the object RockCluster:
For an example, we will use the algorithm with the "Mushroom" dataset provided by the CBA package:
data("Mushroom") x <- as.dummy(Mushroom[-1]) rc <- rockCluster(x[sample(dim(x),1000),], n=10, theta=0.8) print(rc) rp <- predict(rc, x) table(Mushroom$class, rp$cl)
Output - Mushroom
In this section, we illustrate a case study with RockCluster:
Historically the U.S.A elections are characterized by two major political parties. One called Republican Party generally reflecting the American conservatism in the political spectrum and the other called Democratic Party known as more "liberal" or "progressive".
The idea is to use an database of the United States Cogressional votes provided by UCI Machine Learning Repository and perform the RockCluster technique to separate Democrats from Republicans.
The Congressional voting dataset was obtained from the UCI Machine Learning Repository. It is the United States Cogressional Voting Records in 1984. Each record corresponds to one Congress man's votes on 16 issues ( E.g., education spending, crime). All attributes are boolean with Yes (that is, 1) and No (0) values, and very few contain missing values. A classification label of Republican or Democrat is provided with each data record. The data set contains records for 168 Republicans and 267 Democrats.
data(Votes) x <- as.dummy(Votes[-17]) rc <- rockCluster(x, n=2, theta=0.73, debug=TRUE) print(rc) rf <- fitted(rc) table(Votes$Class, rf$cl)
The result of printing the components of the class returned by the function application is shown below:
As the table illustrates, ROCK and the traditional algorithm, both identify two clusters, one containing a large number of republicans and the other containing a majority of democrats. However, in the cluster for republicans found by the traditional algorithm, around 25% of the members are democrats, while with ROCK, only 12% are democrats. The improvement in the quality of clustering can be attributed to the usage of links by ROCK.