Published by the Foundation for Open Access Statistics
Editors-in-chief: Bettina Grün, Torsten Hothorn, Edzer Pebesma, Achim Zeileis    ISSN 1548-7660; CODEN JSSOBK
BClass: A Bayesian Approach Based on Mixture Models for Clustering and Classification of Heterogeneous Biological Data | Medrano-Soto | Journal of Statistical Software
Authors: Arturo Medrano-Soto, J. Andres Christen, Julio Collado-vides
Title: BClass: A Bayesian Approach Based on Mixture Models for Clustering and Classification of Heterogeneous Biological Data
Abstract: Based on mixture models, we present a Bayesian method (called BClass) to classify biological entities (e.g. genes) when variables of quite heterogeneous nature are analyzed. Various statistical distributions are used to model the continuous/categorical data commonly produced by genetic experiments and large-scale genomic projects. We calculate the posterior probability of each entry to belong to each element (group) in the mixture. In this way, an original set of heterogeneous variables is transformed into a set of purely homogeneous characteristics represented by the probabilities of each entry to belong to the groups. The number of groups in the analysis is controlled dynamically by rendering the groups as 'alive' and 'dormant' depending upon the number of entities classified within them. Using standard Metropolis-Hastings and Gibbs sampling algorithms, we constructed a sampler to approximate posterior moments and grouping probabilities. Since this method does not require the definition of similarity measures, it is especially suitable for data mining and knowledge discovery in biological databases. We applied BClass to classify genes in RegulonDB, a database specialized in information about the transcriptional regulation of gene expression in the bacterium Escherichia coli. The classification obtained is consistent with current knowledge and allowed prediction of missing values for a number of genes.

BClass is object-oriented and fully programmed in Lisp-Stat. The output grouping probabilities are analyzed and interpreted using graphical (dynamically linked plots) and query-based approaches. We discuss the advantages of using Lisp-Stat as a programming language as well as the problems we faced when the data volume increased exponentially due to the ever-growing number of genomic projects.

Page views:: 6679. Submitted: 2004-04-29. Published: 2004-12-20.
Paper: BClass: A Bayesian Approach Based on Mixture Models for Clustering and Classification of Heterogeneous Biological Data     Download PDF (Downloads: 6770)
Supplements:
BClass.tar.gz: R source package Download (Downloads: 1767; 123KB)

DOI: 10.18637/jss.v013.i02

by
This work is licensed under the licenses
Paper: Creative Commons Attribution 3.0 Unported License
Code: GNU General Public License (at least one of version 2 or version 3) or a GPL-compatible license.