Clustering genes using gene expression and text literature data

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 01 Jan 2005Publisher:IEEEJournal:2005 IEEE Computational Systems Bioinformatics Conference (CSB'05)

Authors: Chengyong Yang; Erliang Zeng; Tao Li 0001; Giri Narasimhan;

doi: 10.1109/csb.2005.23

pmid: 16447990

Clustering genes using gene expression and text literature data

- Summary
- Subjects
- Metrics

Abstract

Clustering of gene expression data is a standard technique used to identify closely related genes. In this paper, we develop a new clustering algorithm, MSC (Multi-Source Clustering), to perform exploratory analysis using two or more diverse sources of data. In particular, we investigate the problem of improving the clustering by integrating information obtained from gene expression data with knowledge extracted from biomedical text literature. In each iteration of algorithm MSC, an EM-type procedure is employed to bootstrap the model obtained from one data source by starting with the cluster assignments obtained in the previous iteration using the other data sources. Upon convergence, the two individual models are used to construct the final cluster assignment. We compare the results of algorithm MSC for two data sources with the results obtained when the clustering is applied on the two sources of data separately. We also compare it with that obtained using the feature level integration method that performs the clustering after simply concatenating the features obtained from the two data sources. We show that the z-scores of the clustering results from MSC are better than that from the other methods. To evaluate our clusters better, function enrichment results are presented using terms from the Gene Ontology database. Finally, by investigating the success of motif detection programs that use the clusters, we show that our approach integrating gene expression data and text data reveals clusters that are biologically more meaningful than those identified using gene expression data alone.

Related Organizations

University System of Ohio
United States
Miami University
United States
Florida International University
United States

Keywords

Proteome, Artificial Intelligence, Gene Expression Profiling, Multigene Family, Cluster Analysis, Information Storage and Retrieval, Periodicals as Topic, Natural Language Processing, Oligonucleotide Array Sequence Analysis

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	5
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average