Data Mining Algorithms In R/Clustering/Biclust

Introduction

Over the last decade, bicluster methods have become more and more popular in different ﬁelds of two way data analysis and a wide variety of algorithms and analysis methods have been published.

Biclustering is an important new technique in two way data analysis. After Cheng and Church (2000) followed the initial bicluster idea of Hartigan (1972) and started to calculate bicluster on microarray data, a wide range of different articles were published dealing with different kinds of algorithms and methods to preprocess and analyze the results of such methods. Comparisons of several bicluster algorithms can be found, e.g., in Madeira and Oliveira (2004) or Prelic et al. (2006).

Why Biclustering?

Simultaneous clustering of 2 dimensions;
Large datasets where clustering leads to diffuse results;
Only parts of the data inﬂuence each other;

Initial Situation

Two-Way Dataset:

Goal

Finding subgroups of rows and columns which are as similar as possible to each other and as different as possible to the rest.

Technique/Algorithm

Algorithm

Sebastian Kaiser and Friedrich Leisch started to implement a comprehensive bicluster toolbox in R (R Development Core Team, 2007).

It provides a growing list of bicluster methods, together with pre-processing and visualization techniques, using S4 classes and methods (Chambers, 1998). The software is open source and freely available from R-Forge at http://R-Forge.R-project.org.

One of the main design principles of the package is to provide the results as an entity of Biclust-Class, an S4-class containing all information needed for postprocessing of results.

It consists of the four slots Parameters, RowxNumber, NumberxCol and Number. Slot Parameters contains parameters and algorithm used, Number the number of biclusters found. The RowxNumber and NumberxCol slots represents the biclusters that have been found. They are both logical matrices of dimension (rows of data × number of biclusters found) with a TRUE-value in RowxNumber[i,j] if row i is in bicluster j. NumberxCol is the same for the columns, but due to computational reasons, here the rows of the matrix represent the number of biclusters and the columns represent the columns of the data. So by simply calling data [ Biclust@RowxNumber[,a] * Biclust@NumberxCol[a,] ] the values of the bicluster a can be extracted.

Objects of class Biclust-class are created using a uniform interface for all bicluster methods by calls of form biclust(x,method=BiclustMethod,...).

This generic function takes as inputs the preprocessed data matrix x, a bicluster algorithm represented as a Biclustmethod-Class and additional arguments. In the following we give a brief description of the ﬁve algorithms already implemented in the package, subsection headings correspond to the name of the respective Biclustmethod-Class. The naming scheme is BCxxx where xxx is an abbreviation for the name of the algorithm. Some methods have been chosen because open source code from the original authors is available, others have been newly implemented to make the overall toolbox as comprehensive as possible. Of course, there is always room for improvement, and more methods will be added to the package in the future. See also van Mechelen and Schepers (2006) for a discussion on main directions of bicluster calculation. Algorithms are described in alphabetic order and, if not stated otherwise, functions were implemented in interpreted S code.

Implementation

Install

In order to use the Biclust algorithm in R, one must install Biclust package and library:

install.packages("biclust")
library("biclust")

Usage

# S4 method for signature 'matrix,BCBimax': 
biclust(x, method=BCBimax(), minr=2, minc=2, number=100)
# S4 method for signature 'matrix,BCrepBimax': 
biclust(x, method=BCrepBimax(), minr=2, minc=2, number=100, maxc=12)

Where the arguments are:

x - A logical matrix which represents the data.
method - Here BCBimax, to perform Bimax algorithm.
minr - Minimum row size of resulting bicluster.
minc - Minimum column size of resulting bicluster.
number - Number of Bicluster to be found.
maxc - Maximum column size of resulting bicluster.

If everything goes OK, an object Biclust is returned.

View

As we can see below, an object Biclust is returned and we can plot it or just get the final object.

> test <- matrix(rbinom(400, 50, 0.4), 20, 20)
> res1 <- biclust(test, method=BCCC(), delta=1.5, alpha=1, number=10)
> res1

An object of class Biclust 

call:
	biclust(x = test, method = BCCC(), delta = 1.5, alpha = 1, number = 10)

Number of Clusters found:  10 

First  5  Cluster sizes:
                   BC 1 BC 2 BC 3 BC 4 BC 5
Number of Rows:    "7"  "4"  "6"  "6"  "5" 
Number of Columns: "5"  "8"  "4"  "5"  "6"

We'll give some examples of plots in the "Case Study" session.

Case Study

Scenario

As a standard example we ran all the algorithms on the BicatYeast data from Barkow et al. (2006). To do so the data has to be preprocessed and committed to the biclust function

together with the chosen algorithm (here Xmotifs) and parameters.

Datasets

BicatYeast

Subsample of the Saccharomyces Cerevisiae organism (Yeast).
Used to present bicluster algorithms by Barkow et al. (2006)
Microarray data: 419 genes, 80 experiments.

Execution

data(BicatYeast)
x <- discretize(BicatYeast)
res <- biclust(x, method=BCXmotifs(), alpha=0.05, number=50)

Output

To visualize the result you can simply call any visualization function on the result, for example:

> parallelCoordinates( x=BicatYeast, bicResult=res, number=4)

Example for parallel coordinates plot: Expression levels of conditions across their genes in the 4th bicluster in the result of

the Xmotifs algorithm.

Bicluster results similarity measure with an adaptation of Jaccard index:

Analysis

Table 1 shows the pairwise Jaccard indices of all bicluster algorithms. The Jaccard index is a measure of similarity between two

cluster results, zero means no concordance, one means that the results are identical. It can be seen that all algorithms ﬁnd very different sets of biclusters. This can be partly explained by different pre-processing steps which were necessary such that the data conform to the respective assumptions of the algorithms.

Another important aspect is that we selected the ﬁrst algorithms to implement to get a collection of algorithms which differ from

each other as much as possible. It is now very easy for practitioners to try various bicluster methods in R and choose the one which works best for given data set.

References

http://r-forge.r-project.org/projects/biclust/ for the newest developments.
http://www.statistik.lmu.de/~kaiser/bicluster.html for Papers and Links.
BARKOW, S., BLEULER, S., PRELIC, A., ZIMMERMANN, P., and ZITZLER, E. (2006): Bicat: a biclustering analysis toolbox. Bioinformatics, 22,1282–1283.
CHAMBERS, J. M. (1998): Programming with data: A guide to the S Language. Chapman & Hall, London.
CHENG, Y. and CHURCH, G. M. (2000): Biclustering of expression data. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, 1,93–103.
GOVEART, G. and NADIF, M. (2003): Clustering with block mixture models. Pattern Recognition, 36, 463–473.
VAN MECHELEN, I. and SCHEPERS, J. (2006): A unifying model for biclustering. In: Compstat 2006 - Proceedings in Computational Statistics, 81–88.
MURALI, T. and KASIF, S. (2003): Extracting conserved gene expression motifs from gene expression. In: Paciﬁc Symposium on Biocomputing, 8,77–88.
PRELIC, A., BLEULER, S., ZIMMERMANN, P., WIL, A., BUHLMANN, P., GRUISSEM, W., HENNING, L., THIELE, L., and ZITZLER, E. (2006): A

systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22(9),1122–1129.

SANTAMARIA, R., THERON, R., and QUINTALES, L. (2007): A framework to analyze biclustering results on microarray experiments. In: 8th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’07) ,Springer, Berlin, 770–779.
TURNER, H., BAILEY, T., and KRZANOWSKI, W. (2005): Improved biclustering of microarray data demonstrated through systematic performance tests. Computational Statistics and Data Analysis, 48,235–254.