
As data banks increase their size, one of the current challenges in bioinformatics is to be able to query them in a sensible way. Information is contained in different databases, with various data representations or formats, making it very difficult to use a single query tool to search more than a single data source. Data mining is vital to bioinformatics as it allows users to go beyond simple browsing of genome browsers, such as Ensembl [1],[2] or the UCSC Genome Browser [3], to address questions; for example, the biological meaning of the results obtained with a microarray platform, or how to identify a short motif upstream of a gene, amongst others. There are a number of integrated approaches available, some of which are described below (Figure 1). Figure 1 Diagram depicting the way different applications interact with data mining tools. The Table Browser at UCSC [4] supports text-based batch queries to the UCSC Genome Browser, limiting the output to entries meeting the selected criteria. A disadvantage of this tool is that users need to be familiar with the underlying database schema in order to know where their data is stored. Similarly, performing complex queries might require multiple steps that can be burdensome with this tool. Galaxy [5] provides a set of tools that can retrieve data from the Table Browser (Table Browser and BioMart will be explained below), facilitating complex queries that require multiple joins (Figure 2). Figure 2 BioMart can join different datasets, in this case Reactome and UniProt to identify enzymes involved in carbohydrate metabolism. BioMart provides a query-oriented data management system to interact with different datasets (Ensembl [2], RGD [6],[7], and WormBase [8], among many others). This data “warehouse” was originally developed for Ensembl, creating EnsMart [9],[10]. From there, it was first deployed across the European Bioinformatics Institute (EBI), and now it has become a joint project between EBI and Cold Spring Harbor Laboratory (CSHL). The generic query system has shifted toward a federated approach that has been deployed for several biological databases, and has become a component of the Generic Model Organism Database (GMOD) project. In this contribution, we provide some solutions for data mining; we focus on advanced ways of interacting with BioMart using other applications to retrieve information through different platforms such as Galaxy [5] and the biomaRt package of BioConductor [11],[12]. Many of these tools also interact with the UCSC Table Browser and have similar approaches using the UCSC system. We also address programmatic access using BioMart's own implementation of Web services (MartService). For local deployment of BioMart, see Table 1. Table 1 URLs for additional information. BioMart Web Interface First we will focus on BioMart's Web interface (http://www.biomart.org) to illustrate how to join two different datasets: Reactome [13], a database of metabolic pathways, and UniProt [14], a catalogue of protein information. In this example, we need to obtain a catalogue of enzymes involved in carbohydrate metabolism in humans, as we are interested in a congenic disorder in this pathway. To ask this question without an integrated data mining tool, one would have to start with Reactome to find enzymes involved in reaction pathways in human and then compare those enzymes to a list of entries in UniProt. However, BioMart allows us to join the two databases. We can start our query by clicking on ‘MartView’ from the Web interface at http://www.biomart.org, and selecting the Reactome database. Now, select the reaction dataset. Filters applied will be simply ‘Limit to Species’ Homo sapiens. Attributes can be selected as “Reaction name” and “Gene ENSEMBL ID”. At this stage, 2,432 entries meet our criteria (i.e. we have asked for all human reaction pathways in the Reactome database). Click on the ‘count’ button at the top to obtain this number. Next, we can enrich our search for enzymes in the UniProt database. This will require the ‘linked’ or secondary dataset. Follow this description, or view the tutorials for use of the linked database at http://www.ensembl.org/common/Workshops_Onlineid117. Click on the second ‘Dataset’ option at the left of the page. Select ‘UniProt proteomes’ as the database. In this instance, we will add as a filter the Gene Ontology (GO) [15] term ‘GO:0005975’ (associated with carbohydrate metabolic processes); this will be under ‘EXTERNAL IDENTIFIERS’, ‘Limit to proteins…GO ID(s)’ in the secondary dataset. Also select, under ‘External references’: ‘Entries with EC ID(s)’, to limit our query to enzymes only, and ‘eukaryota’ along with ‘Homo sapiens’ under ‘SPECIES’ (Species and Proteome Name, respectively). This will give a count of 257 in the secondary dataset. The genome location can be displayed in the output by choosing the following Attributes: “Genome component name” for the chromosome, “Start Position” and “End Position” for the coordinates. Click ‘Results’ for the table in Figure 2. Now you have a list of enzymes in UniProt involved in carbohydrate metabolism in humans.
Internet, User-Computer Interface, Databases, Factual, QH301-705.5, Data Interpretation, Statistical, Databases, Genetic, Computational Biology, Genomics, Biology (General), Software, Education
Internet, User-Computer Interface, Databases, Factual, QH301-705.5, Data Interpretation, Statistical, Databases, Genetic, Computational Biology, Genomics, Biology (General), Software, Education
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 12 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
