# Data Mining Algorithms In R/Clustering/Bayesian Hierarchical Clustering

 To do:This section is still been written. But feel free to add your contribution or help in any way

## IntroductionEdit

Hierarchical clustering is one of the most frequently used methods in unsupervised learning. Given a set of data points, the output is a binary tree whose leaves are the data points and whose internal nodes represent nested clusters of various sizes. The tree organizes these clusters hierarchically, where the hope is that this hierarchy agrees with the intuitive organization of real-world data.

The traditional method for hierarchically clustering data as given in (Duda & Hart, 1973) is a bottom-up agglomerative algorithm. It starts with each data point assigned to its own cluster and iteratively merges the two closest clusters together until all the data belongs to a single cluster. The nearest pair of clusters is chosen based on a given distance measure (e.g. Euclidean distance between cluster means, or distance between nearest points).

There are several limitations to the traditional hierarchical clustering algorithm. The algorithm provides no guide to choosing the “correct” number of clusters or the level at which to prune the tree. It is often difficult to know which distance metric to choose, especially for structured data such as images or sequences.

The traditional algorithm does not define a probabilistic model of the data, so it is hard to ask how “good” a clustering is, to compare to other models, to make predictions and cluster new data into an existing hierarchy.

Like this others hierarchical clustering algorithms, HBC constructs a cluster hierarchy (also called dendrogram) from bottom to top by merging two clusters at a time. At the beginning (i.e., at the bottom level in a dendrogram), each datum belongs to a cluster whose only member is the datum itself. For every pair of clusters, HBC calculates the probability of merging the pair and selects for the next merge the best one for which this probability is highest. This merge step takes place N - 1 times for a collection of N data. The last merge reduces a single cluster containing the entire data set. Figure 1 shows an example of a dendrogram.

File:Formula Bayesian Hierarchical Clustering dendrogram example.jpg Figure 1: Example of a dendrogram.

## The techniqueEdit

### AlgorithmEdit

Formally, HBC selects the cluster pair whose merge results in the maximum value of the posterior probability P (C|D), where D is a collection of data (i.e., D = {d1, d2, … , dN}) and C is a set of clusters (i.e., C = {c1, c2, ….}). Each cluster cj ? C is a set of data and the clusters begin mutually exclusive. At the initial stage, each cluster is a singleton set; ci = { di} for all i, P (C|D) defines the probability that a collection of data D is classified into a set of clusters C. Maximizing P (C|D) is a generalization of Maximum Likewood estimation.

To see the details of merge process, consider a merge step k + 1 (0 = k = N - 1). By the step k + 1, a data collection D has been partitioned into a set of clusters Ck. That is each datum d ? D belongs to a cluster c ? Ck. The posterior probability at this point becomes

File:Formula Bayesian Hierarchical Clustering part 1.jpg

Here PC (Ck) corresponds to the prior probability that N random data are classified into a set of clusters Ck. This probability is defined as follows:

File:Formula Bayesian Hierarchical Clustering part 2.jpg

SC(c) defines the probability that all the data in a cluster c are produced from the cluster and is defined as

File:Formula Bayesian Hierarchical Clustering part 3.jpg

When the algorithm merges two clusters cx, cy ? Ck, the set of clusters Ck is updated as follows:

File:Formula Bayesian Hierarchical Clustering part 4.jpg

After the merge, the posterior probability is inductively updated as

File:Formula Bayesian Hierarchical Clustering part 5.jpg

File:Formula Bayesian Hierarchical Clustering algorithm.jpg Figure 2: Bayesian Hierarchical Clustering Algorithm (pseudocode).

### ImplementationEdit

The R package that implements the Bayesian Hierarchical Clustering technique is named bclust, and here we’re going to describe its usage. It implements the agglomerative clustering for high dimensional data with variable selection.

Descrption:

The function clusters data saved in a matrix using an additive linear model with disappearing random effects. The model has built-in spike-and-slab components which quantities important variables for clustering and can be extracted using imp function.

Usage:

bclust(x,rep.id=1:nrow(x),effect.family="gaussian", var.select=TRUE,transformed.par,labels=NULL)

Arguments:

x - A numeric matrix, with clustering individuals in row and the variables in columns rep.id - A vector consisting of positive integer elements having the same length as the number of rows of x. This vector identifies replicates of a clustering type such that the total number of clustering types is max(rep.id). If nothing is declared the function presupposes that the data are unreplicated, that is each row of x is a clustering type.

effect.family - Distribution family of the disappearing random components. Choices are "gaussian" or "alaplace" allowing Gaussian or asymmetric Laplace family, respectively.

var.select - A logical value, TRUE for fitting models that define spike-and-slab distribution in variable level and allows Bayesian variable selection.

transformed.par - The transformed model parameters in a vector. The length of the vector depends on the chosen model and the availability of variable selection. The log transformation is supposed to be applied for the variance parameters, the identity for the mean, and the logit for the proportions. The function loglikelihood can be used to estimate them from the data.

labels - A vector of strings referring to the labels of clustering types. The length of the vector should match to max(rep.id). The first element corresponds to the label of the type having the smallest integer value in rep.id, the second element refers to the label of the type having the second smallest integer in rep.id, and so on.

Details:

The function calls internal C functions depending on the chosen model. The C-stack of the system may overflow if you have a large dataset. You may need to adjust the stack before running R using your operation system command line. If you use Linux, open a console and type >ulimit –s unlimited, then run R in the same console. The Microsoft Windows users don’t need to increase the stack size.

Example:

```# unreplicated clustering
gaelle.bclust<-bclust(x=gaelle,transformed.par=c(-1.84,-0.99,1.63,0.08,-0.16,-1.68))
par(mfrow=c(2,1))
plot(as.dendrogram(gaelle.bclust))
abline(h=gaelle.bclust\$cut)
plot(gaelle.bclust\$clust.number,gaelle.bclust\$logposterior,xlab="Number of clusters",ylab="Log abline(h=max(gaelle.bclust\$logposterior))

#replicated clustering
gaelle.id<-rep(1:14,c(3,rep(4,13))) # first 3 rows replication of ColWT , the other mutants gaelle.lab<-c("ColWT","d172","d263","isa2",
"sex4","dpe2","mex1","sex3","pgm","sex1","WsWT","tpt","RLDWT","ke103")

gaelle.bclust<-bclust(gaelle,rep.id=gaelle.id,labels=gaelle.lab,transformed.par=c(-1.84,-0.99,1.63,0.08) plot(as.dendrogram(gaelle.bclust))
abline(h=gaelle.bclust\$cut)
plot(gaelle.bclust\$clust.number,gaelle.bclust\$logposterior,xlab="Number of clusters",ylab="Log abline(h=max(gaelle.bclust\$logposterior))
```

Other functions are assessorial while using bclust package, and here we’ll describe the most important ones:

imp

Indented line

calculates variable and variable-cluster importance

Descrption:

The function computes the log Bayes factors for the hypothesis H0: the variable or the variable-cluster combination is useful for clustering against H1: the variable or the variable-cluster combination is useless. The Bayes factors are computed for the optimal allocation found by the bclust function.

Usage:

imp(x)

Arguments:

x - A bclustvs object

Arguments:

Value:

var - A vector being the log Bayes factor of dv = 1 against dv = 0, see bclust for details.

varclust - A vector being the log Bayes factor of ?c = 1 against ?c = 0, see bclust for details.

repno - The number of replicates producing each row of varclust.

labels - The vector of variable labels extracted from the bclustvs object.

order - The order of var useful to sort var, varclust, and labels.

loglikelihood computes the model log likehood useful for estimation of the transformed.par

Descrption:

The function is useful for deriving the maximum likelihood estimates of the model parameters.

Usage:

loglikelihood(x.mean,x.css,repno,transformed.par, effect.family="gaussian", var.select=TRUE)

Arguments:

x.mean - The mean matrix of the clustering types from the meancss function.

x.css - The corrected sum of squares matrix of the clustering types from the meancss function.

repno - The vector containing the number of replications of each clustering type corresponding to the each row of x.mean and x.css, from the meancss function.

transformed.par - The vector of transformed model parameters that the data likelihood will be evaluated at. The transformation is the log for the variance parameters, the identity for the mean, and the logit for the proportions. The length of the vector depends on the chosen effect.family and var.select.

effect.family - Distribution family of the disappearing random components. Choices are "gaussian" or "alaplace" allowing Gaussian or asymmetric Laplace family, respectively.

var.select - A logical value, TRUE for fitting models that define spike-and-slab in variable level, thus allowing Bayesian variable selection.

### VisualizationEdit

The bclust package has some functions for plotting visualization graphics. Here we demonstrate some examples of their use.

profileplot a plot useful to visualize replicated data

Descrption:

The profile plot is a flexible function creating profiles of replicated data with many useful options. The resulting plot can be attached to other plots like a horizontal dendrogram plot or a teethplot.

Usage:

```profileplot(x, rep.id, labels = NULL, scale = 1, col.names = colnames(x),
plot.order = 1:max(rep.id), xlab.mar = 5, ylab.mar = 5,
xlab.cex = 0.8, ylab.cex = 1, profile.col = rep(1, max(rep.id)),
blob.matrix = matrix(0, ncol = ncol(x), nrow = max(rep.id)),
blob.col = rev(heat.colors(max(blob.matrix))), blob.cex = 1.5)
```

Example:

```data(gaelle)
#take a subset of gaelle data
subgaelle<-gaelle[1:11,]
#use thresholds to define blob colors
blob<-matrix(0,nrow(subgaelle),ncol(subgaelle))
blob[abs(subgaelle)<=0.74]<-0
blob[abs(subgaelle)>0.74]<-1
blob[abs(subgaelle)>0.94]<-2
blob[abs(subgaelle)>1.28]<-3
profileplot(subgaelle,1:nrow(subgaelle),scale=10,blob.matrix=blob)

#make a profile plot with colored blobs
####### attach a profileplot to a teeth plot
subgaelle.bclust<-bclust(subgaelle,transformed.par=c(-1.84,-0.99,1.63,0.08,-0.16,-1.68))
# divide plot space into two unequal parts
layout(matrix(c(1,2),1,2,byrow=TRUE), c(9,1),10, respect=TRUE)
# associate a color to each subject
subgaelle.color<-c(rep("blue",3),rep("green",4),rep("magenta",4))
# find appropriate order to plot subjects
subgaelle.order<-order.dendrogram(as.dendrogram(subgaelle.bclust))
#leave some space for labels
x.mar<-7
y.mar<-5
profileplot(subgaelle,rep.id=1:nrow(subgaelle),scale=10,profile.col=subgaelle.color
,plot.order=subgaelle.order,xlab.mar=x.mar,ylab.mar=y.mar)
par(mar=c(x.mar,0,0,0))
teethplot(subgaelle.bclust)

####### nowattach it to a dendrogram plot
layout(matrix(c(2,1),1,2,byrow=TRUE), c(2,8),10, respect=TRUE)
profileplot(subgaelle,rep.id=1:nrow(subgaelle),scale=10,profile.col=subgaelle.color
,plot.order=subgaelle.order,xlab.mar=x.mar,ylab.mar=y.mar)
par(mar=c(x.mar,0,0,0))
plot(as.dendrogram(subgaelle.bclust),horiz=TRUE,yaxs="i",axes=FALSE,leaflab="none")
abline(v=subgaelle.bclust\$cut)
```

teethplot produces teeth plot useful for demonstrating a grouping on clustered subjects

Descrption:

The function alone is useless but can be attached to a horizontal dendrogram, a profile plot, or an image plot to show a specified partitioning.

Usage:

```teethplot(x, teeth.space = 0.25, teeth.lwd = 1)
```

Example:

```data(gaelle)
gaelle.id<-rep(1:14,c(3,rep(4,13))) # the first 3 rows are replication of ColWT , the other gaelle.lab<-c("ColWT","d172","d263","isa2","sex4","dpe2","mex1",
"sex3","pgm","sex1","WsWT","tpt","RLDWT","ke103")
gaelle.bclust<-bclust(gaelle,rep.id=gaelle.id,labels=gaelle.lab,
transformed.par=c(-1.84,-0.99,1.63,0.08,-0.16,-1.68),var.select=TRUE)

#start plotting
layout(matrix(c(1,2),1,2,byrow=TRUE), c(9,1),10, respect=TRUE) # divide plot space into two par(mar=c(0,0,0,2)) # preserve some space for labels in dendrogram plot
plot(as.dendrogram(gaelle.bclust),horiz=TRUE,yaxs="i") #plot the dendrogram
abline(v=gaelle.bclust\$cut) # show the optimal allocation with a line on the dendrogram
par(mar=c(0,0,0,0)) # we need no space for teeth plot
teethplot(gaelle.bclust) #show the optimal allocation using teeth plot
```

### Study of CaseEdit

This study of case was retrieved at http://spatial.nhh.no/papers/aag04.pdf. It’s a draft paper for presentation at Association of American Geographers 2004 Annual Conference.

#### SceneryEdit

Choropleth map class intervals

Choropleth maps are among the more frequently used graphical display tools needed in applied spatial data analysis. In addition, this Association has published both landmark legacy papers, such as that by Jenks and Caspall (1971), and recent advances proposed by Armstrong, Xiao and Bennett (2003) in this area. They are of interest because of their broad application, because the media and the public see them as being easy to read, and because both the choice of class intervals and fill colors or hatchings can lead to misunderstanding by design or omission.

#### Input DataEdit

Getting spatial data into R

There are two contributed packages on CRAN for reading shapefiles, shapefiles and maptools; here we use maptools, loaded using the library() function. The shapes and the attribute data are read into a "Map" object using compiled code from shapelib2. The attribute data is next moved to a data frame for ease of access.

```> library(maptools)
Shapefile Type: Polygon # of Shapes: 102
> ill59.df <- ill59\$att.data
```

The SpatialCls package is not yet released, but is under active development on SourceForge4. It provides object classes for spatial data, and it is hoped that other spatial packages will build on these once they stabilise. It also provides functions for transforming (including datum transformation) coordinates, based on PROJ.4 code and other files5. In order to specify a "Polylist4" object, we provide as arguments the "Map" object read from the shapefile, a vector of unique identifiers, and a character string defining the spatial reference system in the format used by PROJ.4 functions. These definitions are checked for validity when the "Polylist4" object is created.

```> library(SpatialCls)
> state <- as.character(ill59.df\$STATE)
> county <- as.character(ill59.df\$COUNTY)
> county.id <- paste(state, county, sep = "")
> ill59.ll <- Map2Poly4(ill59, county.id, ill.ll.nad83)
+ "+lon_0=-88.33333333333333 +k=0.999975 +x_0=152400.3048006096",
+ "+units=us-mi")
```

#### ExecutionEdit

While it is obvious that the use of genetic algorithms and a better choice of measures by Armstrong, Xiao and Bennett (2003) is a better and probably more robust approach, the output displayed on Figure 3 does suggest that raiding the machine learning armoury does seem to yield sensible results, looking much more like a natural breaks 5-class partition of the data set than those we have seen so far. The measures are not the strongest seen on any criterion, but the overall impression is subjectively pleasing. Bagged clustering results provide visually intuitive solutions, although they do point up the question of whether choosing five classes is a good solution, so for a single dimensional class interval explorer it may be necessary to jump between discrete numbers of classes as well as optimizing clustering criteria.

To go beyond this, an attempt is made to explore other sets of breaks that can be constructed using bclust() by repeating its application a number of times and varying the number of classes. The numbers of classes (between 4 and 11) are permuted to add an extra level of randomisation. The measures values are stored in res.score in order to be able to choose solutions for display.

```> set.seed(20040316)
> ks <- sample(c(rep(4:11, 25)))
> n <- length(ks)
> res <- vector(length = n, mode = "list")
> res.score <- matrix(NA, ncol = 5, nrow = n)
> colnames(res.score) <- c("k", "gvf", "tai", "oai", "jc")
> for (i in 1:n) {
+ bc <- bclust(ill59.df\$JENKS71, centers = ks[i], verbose = FALSE)
+ res[[i]] <- bc
+ cols <- match(bc\$cluster, order(bc\$centers))
+ res.score[i, ] <- c(ks[i], jenks.tests(ill59.df\$JENKS71,
+ cols, area.m2, ill59.nb))
+ }

> row.names(res.score) <- 1:n
> table(res.score[, 1])

> res.score1 <- unique(res.score, MARGIN = 1)
> table(res.score1[, 1])
```

#### OutputEdit

```4 5 6 7 8 9 10 11
25 25 25 25 25 25 25 25
```

File:Map constructed using Hierarchical Clustering.jpg Figure 3: Cloropleth map constructed using hierarchical clustering.

#### AnalysisEdit

The hierarchical clustering dendrogram from a single bagged clustering, used to combine the output of bootstrap aggregation, can also be examined by plotting the object returned by bclust() to try to establish an acceptable number of classes.

This will not be examined here, but could arguably provide an interesting alternative to the naive, brute-force approach used above. Single runs of bclust() are not very costly in time, so that a "bclust" object might yield choices of bagged clustering class intervals for different numbers of classes, using access functions clusters.bclust() and centers.bclust(), setting the k= argument to the number of classes required.