
<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>Genetics Databases By Martin J. Bishop. San Diego, CA: Academic Press (1999). 295 pp. $49.95Biological research is now significantly influenced by the burgeoning amount of genomics data available on the internet. Genetics Databases describes some of the more important sources of this information and explains some of the underlying principles behind the commonly used bioinformatics tools. The book is not aimed at the specialist computational biologist and assumes little knowledge. Each chapter is independent of the others, making the book ideal for reference, and the topics chosen for review include many areas of current interest. As such it will make an ideal resource for the many scientists that are hoping to maximize the impact that this “free” data can have on their research. At the time of writing this review, 7 billion bases of DNA sequence had been deposited in GenBank: much of the book describes available methods and resources for analyzing this sequence data. Gene expression, protein structure, and RNA-related databases are also described, as are two more specialized databases on protein kinases and the biology of E. coli.The past several years of genome research have been dominated by high-throughput sequencing efforts. These efforts fall into two categories: EST sequencing projects, which have, to date, contributed 3.7 million sequences of randomly cloned expressed cDNAs, and genomic sequence projects, which have so far produced entire genomic sequences for 30 organisms, including three eukaryotes. Of particular note, the draft quality human genomic sequence is 65% completed and the first finished human chromosome has recently been published (Dunham et al., Nature 402, 489–495, 1999). Between the human genomic sequence and the 95,000 UniGene clusters of human ESTs, well over half of all human genes are represented in some form in publicly available databases. It is a sobering statistic that only 12% of the human UniGene clusters contain the sequence of a known gene—usually full length and possibly associated with a description in the literature of the properties of its encoded protein. Automated algorithms for annotation of the more anonymous genes depend largely on extrapolation of the information gathered on the small fraction of known genes.One of the first tasks in annotating a stretch of genomic sequence is prediction of the potential coding regions. In an informative and well written chapter, R. Guigo describes the strategies of the common gene prediction programs. These he divides into two groups: those that rely on model-based criteria derived from representative training sets of coding and noncoding sequences, and unbiased methods which assume only that a codon has three bases and look for deviations from random behavior based on that fact. The relative success of each algorithm is nicely demonstrated by example, and URLs (web page addresses) for the various gene prediction programs are provided. Identification of 3′ EST sequences downstream of predicted coding exons allows efficient assembly of sequence homology information (from the coding exons) with gene expression data (usually associated with EST sequences by their frequency of occurrence in cDNA libraries, or by array-based experiments). Thus, convergence of exon predictions with EST sequences not only cross-validates each data type but also provides two key entry points for predicting gene function—homology and expression pattern.Many of the tools and databases available on the internet rely on the solution of a seemingly simple problem: how to align two or more sequences to maximize their homology. For example, both clustering of ESTs and determination of evolutionary relationships between new and known genes depend on sequence alignment algorithms. Three chapters in Genetics Databases cover this subject. M. Gribskov describes pairwise alignment strategies, including the use of dot plots to visualize insertions and duplications and the assessment of statistical significance in BLAST alignments. Profile-based alignments, for example to regular expressions or Hidden Markov Models, are also mentioned. The discussion is extended to multiple sequence alignment by D. Higgins in a chapter that concentrates on the Clustal W and Clustal X programs. W. R. Taylor discusses the principals behind construction of the PAM amino acid substitution matrix and the effect of environment on such matrices, and provides an interesting description of the relationship between the amino acids and “gray coding” in codon usage.Function prediction from primary sequence is covered in a comprehensive way by Ponting and Blake. In their chapter, the authors underline the importance of taking a bottom-up approach to function prediction, starting at the level of individual domains within a protein, then searching for protein homologs and finally pathway homologs. Two well-characterized proteins, SRC and dystrophin, are used to demonstrate their approach. As many years have passed since these genes were first described, the authors can measure the success of function predictions based on sequence alone against subsequent wet-lab experiments; in fact, many of these predictions could not be verified. More encouragingly, paralogous gene clusters within vertebrate genomes, reflective of genome duplications late in vertebrate evolution and well described for the Hox genes, hint that the functional complexity of the genome may be lower than the number of genes would suggest. This has recently been strengthened by the demonstration that paralogous Hox genes (Hoxa3 and Hoxd3) can have identical functions (Greer et al., Nature 403, 661–665, 2000).Baldock and Davidson describe attempts at spatial description of gene expression by in situ techniques that measure RNA or protein levels. The necessity to archive expression patterns in a queryable format is well brought out and exemplified by the Mouse Gene Expression Information Resource. Included in this database is an attempt to spatially map expression patterns onto standardized representations of mouse embryogenesis. The stereotypical nature of embryogenesis makes this ambitious goal feasible, but spatial mapping is unlikely to be possible with less reproducible processes. For example, the molecular and cellular pathology of tumorigenesis is quite heterogeneous, even within a tumor type. In these cases, there is a high value in formalizing the language by which the pathology data are recorded so that spatial descriptions of pathological processes can be easily queried. Surprisingly, little space in this chapter is devoted to high-throughput methods of expression profiling, such as arrays, SAGE, and EST sequencing. While these methods lose the spatial resolution of in situ technologies, they can provide a much more comprehensive view of gene expression and are worthy of more emphasis.What of the future? Sequencing and gene expression analysis have been in the “factory” stage for some years now, and other once labor-intensive procedures seem set to join them. In particular, direct experimental evidence of protein–protein interaction has recently moved into the high-throughput realm, and descriptions of such interactions are now coming in by the hundreds for two-hybrid experiments for both yeast (e.g., Uetz et al., Nature 403, 623–627, 2000) and worm (e.g., Walhout et al., Science 287, 116–122, 2000). More directed proteomics approaches have been applied to the parallel identification of the subunits of multiprotein complexes (e.g., snRNP, Neubauer et al., Proc. Natl. Acad. Sci. USA 94, 385–390, 1997) or their substrates (e.g., GroEL, Houry et al., Nature 402, 147–154, 1999). In perhaps the most extreme example, libraries of mouse mutants are now being made and phenotyped in a manner once restricted to simpler organisms. Projects that once required the dedication of an obsessive post-doc or graduate student are now high throughput and routine. The implication is that entirely new data formats will be generated in sufficient quantities to be labeled as databases.There are a few minor criticisms concerning organization of the book. The chapters vary somewhat in their format: some briefly describe the content of a catalog of URLs on a particular subject, while others are more theoretical. One of the chapters contains exercises, as if written for teaching purposes; the others do not. They are also presented in a rather confusing order with some redundancy. Another disappointment is the lack of any supporting internet content: at the least, a companion web page or bookmark folder would have been more useful than the six pages of URLs listed in an appendix. In fact, it is debatable whether this collection of reviews would have been better published on the internet rather than on paper; these are fast-moving fields and their description would benefit from frequent updates and hyperlinking. It is probably a measure of how much we expect our internet content to be free that publishing books describing it is still popular.That said, there is much in the book to commend. There are few resources, online or otherwise, that describe the diversity of online databases in this way. Most of the chapters are well written and the points raised are often illustrated by examples, making the arguments easy to follow for a nonexpert. Perhaps of most significant consequence for the average reader is that the “black box” approach to database searching will seem less tempting when simple explanations for the algorithms acting behind the scenes are available in such a compact volume. G. Williams closes his chapter on nucleic acid and protein databases with the message “If you get anomalous results, stop and think for a while” (page 36). It will also help to pick up this book.
Biochemistry, Genetics and Molecular Biology(all)
Biochemistry, Genetics and Molecular Biology(all)
| citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
