Data Mining Algorithms In R/Clustering/CLARA

An obvious way of clustering larger datasets is to try and extend existing methods so that they can cope with a larger number of objects. The focus is on clustering large numbers of objects rather than a small number of objects in high dimensions. Kaufman and Rousseeuw (1990) suggested the CLARA (Clustering for Large Applications) algorithm for tackling large applications. CLARA extends their k-medoids approach for a large number of objects. It works by clustering a sample from the dataset and then assigns all objects in the dataset to these clusters.

Technique To Be Discussed

This work is focused on CLARA, a technique for clustering largers datasets.

Algorithm

edit
Symbols Definitions
D Data set to be clustered
n Number of objects in D
O_i Object i in D
k Number of clusters
S A sample of D
s Size of S

Table 1: Summary of symbols and definitions

CLARA (Clustering LARge Applications) relies on the sampling approach to handle large data sets. Instead of finding medoids for the entire data set, CLARA draws a small sample from the data set and applies the PAM algorithm to generate an optimal set of medoids for the sample. The quality of resulting medoids is measured by the average dissimilarity between every object in the entire data set D and the medoid of its cluster, defined as the following cost function:

 

where M is a set of selected medoids, dissimilarity(Oi, Oj) is the dissimilarity between objects Oi and Oj, and rep(M, Oi) returns a medoid in M which is closest to Oi.

To alleviate sampling bias, CLARA repeats the sampling and clustering process a pre-defined number of times and subsequently selects as the final clustering result the set of medoids with the minimal cost. Assume q to be the number of samplings. The CLARA algorithm is detailed in Figure 1.

File:Algorithm CLARA.png
Figure 1: Clara Algorithm

Since CLARA adopts a sampling approach, the quality of its clustering results depends greatly on the size of the sample. When the sample size is small, CLARA’s efficiency in clustering large data sets comes at the cost of clustering quality.

Implementation

edit

In order to use the CLARA algorithm in R, one must install cluster package. This package includes a function that performs the CLARA process.

Install cluster package

install.packages("cluster")

Import Contents

library("cluster")

The CLARA function, provided by the cluster package, might be used as follow:

clara(x, k, metric = "euclidean", stand = FALSE, samples = 5, sampsize = min(n, 40 + 2 * k), trace = 0, medoids.x = TRUE, 
keep.data = medoids.x, rngR = FALSE)

where the arguments are:

  • x: Data matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed.
  • k: Integer, the number of clusters. It is required that 0 < k < n where n is the number of observations (i.e., n = nrow(x)).
  • metric: Character string specifying the metric to be used for calculating dissimilarities between observations. The currently available options are "euclidean" and "manhattan". Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences.
  • stand: Logical, indicating if the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation.
  • samples: Integer, number of samples to be drawn from the dataset.
  • sampsize: Integer, number of observations in each sample. sampsize should be higher than the number of clusters (k) and at most the number of observations (n = nrow(x)).
  • trace: Integer indicating a trace level for diagnostic output during the algorithm.
  • medoids.x: Logical indicating if the medoids should be returned, identically to some rows of the input data x. If FALSE, keep.data must be false as well, and the medoid indices, i.e., row numbers of the medoids will still be returned (i.med component), and the algorithm saves space by needing one copy less of x.
  • keep.data: Logical indicating if the (scaled if stand is true) data should be kept in the result. Setting this to FALSE saves memory (and hence time), but disables clusplot()ing of the result. Use medoids.x = FALSE to save even more memory.
  • rngR: Logical indicating if R's random number generator should be used instead of the primitive clara()-builtin one. If true, this also means that each call to clara() returns a different result – though only slightly different in good situations.

View

edit

There are actually two ways of viewing the result of a CLARA use. Both of them use the object of class clara returned by the function application.

The first way is to plot the object, creating a chart that represents the data. Thus, if there are N objects divided into K clusters, the chart must contain N points representing the objects, and those points must be colored in K different colors, each one representing a cluster set. For example, given the object clarax, which is a result of the function clara application, all one has to do in order to plot the object is:

plot(clarax)

The second way of viewing the result of a CLARA application is to simply print the components of the object of class clara. For example, given the same object clarax of the previous example, one could print its components using:

print(clarax)

Example

Suppose we have 500 objects and each object have two attributes (or features). Our goal is to group these objects into K=2 groups based on their two features. The function CLARA can be used to define the groups as follow:

## generate 500 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)), cbind(rnorm(300,50,8), rnorm(300,50,8)))
## run clara
clarax <- clara(x, 2)
clarax
clarax$clusinfo

## print components of clarax
print(clarax)

## plot clusters
plot(x, col = clarax$cluster)
## plot centers
points(clarax$centers, col = 1:2, pch = 8)

Result of printing components of clarax:

Call:    clara(x = x, k = 2) 
Medoids:
          [,1]       [,2]
[1,]  1.091033 -0.5367556
[2,] 51.044099 51.0638017
Objective function:      9.946085
Clustering vector:       int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Cluster sizes:           200 300 
Best sample:
 [1]   6  45  51  56  67  75  85  90  94  97 110 111 160 170 176 181 201 219 249
[20] 260 264 275 296 304 317 319 337 340 361 362 369 370 374 379 397 398 411 420
[39] 422 424 436 448 465 489

Available components:
 [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
 [6] "clusinfo"   "diss"       "call"       "silinfo"    "data" 

Result of plotting "clarax"


Figure 2: Result of plotting clarax

Case study

edit

In this section, we illustrate a case study using CLARA.

Scenario

edit

This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

Input Data

edit

A data frame with 50 observations on 4 variables.

[,1] Murder numeric Murder arrests (per 100,000)
[,2] Assault numeric Assault arrests (per 100,000)
[,3] UrbanPop numeric Percent urban population
[,4] Rape numeric Rape arrests (per 100,000)
State Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Connecticut 3.3 110 77 11.1
Delaware 5.9 238 72 15.8
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8
Hawaii 5.3 46 83 20.2
Idaho 2.6 120 54 14.2
Illinois 10.4 249 83 24.0
Indiana 7.2 113 65 21.0
Iowa 2.2 56 57 11.3
Kansas 6.0 115 66 18.0
Kentucky 9.7 109 52 16.3
Louisiana 15.4 249 66 22.2
Maine 2.1 83 51 7.8
Maryland 11.3 300 67 27.8
Massachusetts 4.4 149 85 16.3
Michigan 12.1 255 74 35.1
Minnesota 2.7 72 66 14.9
Mississippi 16.1 259 44 17.1
Missouri 9.0 178 70 28.2
Montana 6.0 109 53 16.4
Nebraska 4.3 102 62 16.5
Nevada 12.2 252 81 46.0
New Hampshire 2.1 57 56 9.5
New Jersey 7.4 159 89 18.8
New Mexico 11.4 285 70 32.1
New York 11.1 254 86 26.1
North Carolina 13.0 337 45 16.1
North Dakota 0.8 45 44 7.3
Ohio 7.3 120 75 21.4
Oklahoma 6.6 151 68 20.0
Oregon 4.9 159 67 29.3
Pennsylvania 6.3 106 72 14.9
Rhode Island 3.4 174 87 8.3
South Carolina 14.4 279 48 22.5
South Dakota 3.8 86 45 12.8
Tennessee 13.2 188 59 26.9
Texas 12.7 201 80 25.5
Utah 3.2 120 80 22.9
Vermont 2.2 48 32 11.2
Virginia 8.5 156 63 20.7
Washington 4.0 145 73 26.2
West Virginia 5.7 81 39 9.3
Wisconsin 2.6 53 66 10.8
Wyoming 6.8 161 60 15.6

Table 2: USArrests Database

Execution

edit

The function "clara" was used as follows:

## import data
x <- USArrests

## run CLARA
clarax <- clara(x[1:4], 3)

## print components of clarax
print(clarax)

## plot clusters
plot(x, col = clarax$cluster)
## plot centers
points(clarax$centers, col = 1:2, pch = 8)
    1. plot(Assault, Murder)

(USArrests) points(254,11.1, pch=16) text(254,11.11, labels ='New York') lines(Assault, (.63168 + (.04191 * Assault)))

Output

edit

The result of printing the components of the class returned by the function application is shown below:

Call:    clara(x = x[1:4], k = 3) 
Medoids:
         Murder Assault UrbanPop Rape
Michigan   12.1     255       74 35.1
Missouri    9.0     178       70 28.2
Nebraska    4.3     102       62 16.5
Objective function:      29.31019
Clustering vector:       Named int [1:50] 1 1 1 2 1 2 3 1 1 2 3 3 1 3 3 3 3 1 ...
 - attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut" ...
Cluster sizes:           16 14 20 
Best sample:
 [1] Alabama        Alaska         Arizona        Arkansas       California    
 [6] Colorado       Delaware       Florida        Georgia        Idaho         
[11] Illinois       Indiana        Iowa           Kansas         Kentucky      
[16] Louisiana      Maine          Maryland       Massachusetts  Michigan      
[21] Minnesota      Mississippi    Missouri       Montana        Nebraska      
[26] Nevada         New Hampshire  New York       North Carolina North Dakota  
[31] Ohio           Oklahoma       Oregon         Pennsylvania   Rhode Island  
[36] South Carolina South Dakota   Tennessee      Texas          Utah          
[41] Vermont        Virginia       Washington     West Virginia  Wisconsin     
[46] Wyoming       

Available components:
 [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
 [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"  

The result of plotting the class returned by the function application it is shown below:


Figure 3: Results of the example

Analysis

edit

The implementation of CLARA generated three clusters, relatively homogeneous, consisting of 16, 14 and 20 countries. Analyzing the cluster means, we can relate each group with each of the three classes of states:

  • The cluster formed by Alabama, Alaska, Arizona, California, Delaware, Florida, Illinois, Louisiana, Maryland, Michigan, Mississippi, Nevada, New Mexico, New York, North Carolina, South Carolina has the highest Murder, Assault and Rape arests (per 100,00) and, not least, the largest population.
  • The cluster formed by Arkansas, Colorado, Georgia, Massachusetts, Missouri, New Jersey, Oklahoma, Oregon, Rhode Island, Tennessee, Texas, Virginia, Washington, Wyoming has the intermediate Murder, Assault and Rape arests (per 100,00) and, not least, the largest population.
  • The cluster formed by Connecticut, Hawaii, Idaho, Indiana, Iowa, Kansas, Kentucky, Maine, Minnesota, Montana, Nebraska, New Hampshire, North Dakota, Ohio, Pennsylvania, South Dakota, Utah, Vermont, West Virginia, Wisconsin has the lowest Murder, Assault and Rape arests (per 100,00) and, not least, the largest population.

Analyzing, based on [3], the states of the two extreme clusters (1,3) it was possible to verify that there is a reason for each country to be in these groups. California, although has a good Human Development Index and Median Personal Earnings rate, has the 3rd biggest Unemployment Rate in the USA, the 2nd is Michigan and the 1st is Nevada, two other states that are also in the cluster one. Connecticut has the highest Human Development Index and is on the cluster three. Wyoming has the best percentage of people with High School Diploma, and is on the cluster two. Others reasons can be verified checking this work together with [3].

References

edit
  1. Chih-Ping, Wei, Yen-Hsien, Lee, and Che-Ming, Hsu, Empirical Comparison of Fast Clustering Algorithms for Large Data Sets. [1]
  2. The R Development Core Team, R: A Language and Environment for Statistical Computing. [2]
  3. American Human Development Project, Mapping the Measure of America [3]
  4. Kaufman, L., Rousseeuw, P. J., Clustering by Means of Medoids.