Data Mining Algorithms In R/Clustering/Proximus

With the availability of large-scale computing platforms for high-fidelity design and simulations, and instrumentation for gathering scientific as well as business data, increased emphasis is being placed on efficient techniques for analyzing large and extremely high-dimensional data sets. These data sets may comprise discrete attributes, such as those from business processes, information retrieval, and bioinformatics, as well as continuous attributes such as those in scientific simulations, astrophysical measurements, and engineering design.

Analysis of high-dimensional data typically takes the form of extracting correlations between data items, discovering meaningful information in data, clustering data items, and finding efficient representations for clustered data, classification, and event association. Since the volume (and dimensionality) of data is typically large, the emphasis of new algorithms must be on efficiency and scalability to large data sets.

Technique/Algorithm edit

In this section, we focus on Proximus. The intended area of application is the compression of high-dimensional binary data into representative patterns. For instance, purchase incidence (market basket data) or term-document matrices may be preprocessed by Proximus for later association rule mining. In the next subsection, we give a brief explanation of how the algorithm works.

Algorithm edit

The Proximus algorithm cluster the rows of a logical matrix. The compression rate of the algorithm can be influenced by the choice of the maximum cluster radius and the minimum cluster size.

The algorithm is of a recursive partitioning type. Specifically, at each step a binary split is attempted using a local rank-one approximation of the current submatrix (row set). That is a specialization of principal components to binary data which represents a matrix as the outer product of two binary vectors. The node expansion stops if a submatrix is pure, i.e., the column (presence set) vector indicates all the rows and the Hamming distances from the row (dominant attribute set) pattern vector, or the size of the row set, are less than or equal the specified threshold. In the case the rank-one approximation does not result in a split but the radius constraint is violated, the matrix is split using a random row and the radius constraint.

The figure shows the recursive structure of proximus, where A represents the original data matrix. Each rectangular internal node is a rank-one approximation and two circular children of these nodes are the matrices that result from partitioning of parent matrix based on this approximation. Leaves of the recursion tree correspond to final decomposition. The overall decomposition is , where and .


Implementation edit

Proximus is part of cba package. In this section you find more information about installing and using it on R Environment.

Type the following command in R console to install cba package

install.packages("cba")

Type the following command in R console to load the package

library("cba")

The Proximus function, provided by the cba package, might be used as follow

proximus(x, max.radius = 2, min.size = 1, min.retry = 10, max.iter = 16, debug = FALSE) 

where arguments are:

x: a logical matrix. 
max.radius: the maximum number of bits a member in a row set may deviate from its dominant pattern. 
min.size: the minimum split size of a row set. 
min.retry: number of retries to split a pure rank-one approximation (translates into a resampling rate). 
max.iter: the maximum number of iterations for finding a local rank-one approximation. 
debug: optional debugging output.

An object of class proximus with the following components:

nr: the number of rows of the data matrix.
nc: the number of columns of the data matrix.
a: a list containing the approximations (patterns).
a$x: a vector of row (presence set) indexes.
a$y: a vector of column (dominant attribute set) indexes.
a$n: the number of ones in the approximated submatrix.
a$c: the absolute error reduction by the approximation.
max.radius: see arguments.
min.size: see arguments.
rownames: rownames of the data matrix.
colnames: colnames of the data matrix.

View edit

There is one way to show the result from this algorithm. That way would be typing:

summary(ProximusObject)

Example edit

Here we have an example of proximus algorithm processing. The example is quite simple and gives an idea of how it works. Basically, a uniform logical matrix is generated. Then proximus does a rank-4 aproximattion of original logical matrix.

x <- rlbmat() 
pr <- proximus(x, max.radius=8, debug=TRUE) 
op <- par(mfrow=c(1,2), pty="s") 
lmplot(x, main="Data") 
box() 
lmplot(fitted(pr)$x, main="Approximation") 
box() 
par(op) 

Output

Case Study edit

In this section, we illustrate a case study with Proximus:

Scenario edit

In this section, we use PROXIMUS to cluster terms in hypothetical database to extract semantic information. It means, we want to know what are the main subjects of documents.

Suppose there's a library with some documents and we want to divide these documents in categories. We can describe each document as a set of words that occurs on that document.

Data edit

We represent the data as a binary term-document matrix by mapping terms to rows and columns to documents, so that a TRUE entry in the matrix indicates the existence of a word in the corresponding document.

The table below shows 14 terms and 10 documents. Clearly, there are two groups of words: those that are related to computers and those that are related to authors. The word love is isolated.

Execution edit

R Code:

x <- matrix(scan("matriz.txt",what=logical(0),n = 14*10), 14, 10, byrow = TRUE)
pr <- proximus(x, max.radius=8, debug = TRUE)
summary(pr)

Output edit

The result of clustering the matrix of terms and documents is shown below:

Analysis edit

As you can see in the output, there's a table with the result of clustering. Clearly, there are three groups of words (each line). The first group is related to computers. The second one is related to authors. The third group represents the isolated word love. The column named size indicates how many words belongs to the same group. Proximus algorithm clustered the words as follows: Group 1 (computers) = {intel, computer, software, linux, windows, Firefox, explorer, programming} Group 2 (authors) = {kuth, shakespeare, grisham, asimov, book} Group 3 (noise) = {love}

References edit

  1. Meira Jr., W.; Zaki, M. Fundamentals of Data Mining Algorithms. [1]
  2. CBA R package. [2]
  3. Koyuturk; Mehmet and Grama; Ananth. PROXIMUS: a framework for analyzing very high dimensional discrete-attributed dataset.

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pages 147-156, 2003.