Powered by OpenAIRE graph
Found an issue? Give us feedback

EMBL - European Bioinformatics Institute

Country: United Kingdom

EMBL - European Bioinformatics Institute

150 Projects, page 1 of 30
  • Funder: UKRI Project Code: BB/T01461X/1
    Funder Contribution: 249,661 GBP

    The number of species with sequenced genomes is rising rapidly, and will continue to do so with projects to sequence all eukaryotic species in the UK (Darwin Tree of Life project) and on the planet (Earth Biogenome Project) underway. To make sense of assembled genome data important features, such as protein-and non-coding genes, need to be identified and described; this general process is called annotation. Despite major advances in methods to automatically annotate genomes, the most accurate annotations require human assessment. However, the prohibitive cost usually prevents manual annotation (with curated updates) from being performed on individual species. A scalable alternative is to direct manual effort towards reference datasets and to harvest contributions from the broader research community. The resulting high quality annotations can then be projected across species based on inferred homology. It is essential that the software used for annotation is fast, flexible and easy to use by different communities of annotators (professional curators, bench biologists, or curious non-experts). Of the currently available software platforms to annotate genomes, Artemis and Apollo are the two most popular and have been in wide use for 20 years. Artemis, developed at the Sanger Institute, has been used primarily for viewing, annotating and analysing the genomes of prokaryotic and eukaryotic microbes. A major strength of Artemis is its companion the 'Artemis Comparison Tool' (ACT) that allows gene structures to be created or edited in the context of discovering and exploring genome conservation. A major limitation of both Artemis and ACT is that the software performs badly on sequences larger than a few tens of megabases. Like Artemis, Apollo started as a desktop tool, but was redesigned as a web-based tool and now runs on a shared server so that multiple users can browse and create annotations across the same genome simultaneously. Apollo comfortably handles any size genome and scales well with multiple concurrent users. Development of Artemis and Apollo software has run in parallel for almost 20 years. The Berkeley-based Apollo team and the Sanger-based Artemis team have, in some cases, found alternative ways to view and annotate genome data; but more often, have found convergence in purpose and approach. The proposed application will integrate the best of Artemis and Apollo to create a single higher performance annotation platform. The new Apollo will benefit from modern and modular architecture, for collaborative development and improved sustainability. Apollo will also be enhanced with new data interfaces, developed in collaboration with the EMBL-EBI group, so that genome comparison data can be accessed across servers, and annotation performed in the context of exploring synteny. The new generation of annotation tool will replace the existing Artemis and Apollo projects and be integrated into major genome annotation projects as well as retaining is usability by individual small-scale users.

    more_vert
  • Funder: UKRI Project Code: BB/N019172/1
    Funder Contribution: 307,672 GBP

    The structure of a protein dictates the manner in which it interacts with other proteins and whether or how it binds and changes the compounds it is exposed to. Knowing a protein's structure can help rationalise the mechanism by which it performs its biological role. It is also important for understanding how genetic changes such as mutations in the residues that make up the protein, can destroy or modify the way in which it performs that role. Revolutionary new technologies in biology, known as next generation sequencing, are now allowing biologists to collect vast amounts of genetic variation data. For example, information on changes in the sequences of proteins collected from humans suffering from different diseases like cancer or heart disease. Alternatively, sequences of proteins from species important in an agricultural context. For example different strains of wheat that may be more resistant to frost or produce higher yields. However, it is much harder and more expensive to determine the 3D structure of a protein than its sequence. It is particularly difficult for human, mouse, chicken, plants and other eukaryotic organisms that we need to study to understand disease or ensure food security. Currently, on average less than 15% of proteins from these important model organisms have an experimentally determined 3D structure. To address this deficit of structural data, algorithms have been developed for predicting the structure of a protein. The most successful approaches identify a relative having a known structure and inherit 3D information by exploiting the known conservation of structural features between evolutionary related proteins. Five of the top world-leading resources generating such annotations are based in the UK (SUPERFAMILY, Gene3D, Phyre, Fugure, pDomTHREADER). These exploit structural relatives in the SCOP and CATH structural classification - the two world leading resources capturing information on domain structures - to use as templates for predicting structures of uncharacterised relatives. The Genome3D resource, which was launched in 2012, integrates domain structure predictions from all five resources for ten model organisms used to study biological systems and important for the study of human health (e.g. human, mouse) or agriculture and food security (e.g. plant). Although the algorithms used by the resources are powerful for recognising even very remote relationships and inheriting structural information between relatives, their accuracy is < 90%. However, by combining all the data in a single resource and identifying positions in the protein where all the methods agree, it is possible to provide much more reliable annotations. Since it is easier to find these consensus regions if equivalent sets of relatives (i.e. families) in SCOP and CATH have been identified, a large part of the project involves mapping between these resources. We now wish to continue this project, improving the mapping of SCOP and CATH and using this to increase the amount of reliable consensus data that Genome3D provides. We will include additional organisms important for health and agriculture. However, a major benefit from this project will be the integration of the Genome3D structural data with structurally uncharacterised sequences in InterPro, a world-leading resource that combines information on protein families from 11 different resources worldwide. By including Genome3D data for families in InterPro we will be able to increase the number of proteins for which we can provide structural data ten-fold. In addition we will provide a very intuitive web-based viewer for looking at the structures and assessing the likely impacts of any changes in the sequence on the function of the protein. Since many biologists are unfamiliar with the value of structural data in assessing genetic variations we will develop web-based training material and arrange workshops both in our institutes and at international meetings.

    visibility6
    visibilityviews6
    downloaddownloads11
    Powered by Usage counts
    more_vert
  • Funder: UKRI Project Code: BB/D018358/1
    Funder Contribution: 444,800 GBP

    This application is for continued core support and further development of the EMBOSS project. EMBOSS (European Molecular Biology Open Software Suite) was started in 1996 by two bioinformatics developers (Rice and Bleasby) who have developed a set of over 200 applications for the analysis of DNA and protein sequences. It is 'Open Source' software - the source code is made available to anyone who can change or extend it to meet their own needs. Users in industry have found that EMBOSS makes their life easier compared to expensive commercial packages. The funding requested will support enough programmers to maintain the core of EMBOSS for 5 years, with many new features and new applications, and will provide the basis for expanding EMBOSS further in protein structure analysis and phylogenetics and advancing into new application areas such as gene expression, proteomics, biostatistics, chemistry and genetics. As the number of applications grows, and as the user base expands, we will need to work extensively on user interfaces and other ways to make it easier for users to find and use the programs they need. All this will be made available for free, and supported by rapid response to email and telephone requests, and by training courses, online tutorials, and documentation. EMBOSS has been installed by more than 20,000 sites worldwide, many of them in the UK. A survey of users (2004) indicated the need for more support for local installations of EMBOSS and the biological databases it uses, improvements to the Jemboss program, and new ways to pass data into programs and to return results so that we can better support use of EMBOSS to build long and complex 'workflows' for routine analysis tasks. EBI will provide rapid support through the external services team in Hinxton, with the developers providing fixes for bugs, and adding requested features to future releases.

    more_vert
  • Funder: UKRI Project Code: BB/T000902/1
    Funder Contribution: 216,677 GBP

    As species diverge and new strains emerge, their proteins evolve through mutations in their sequences that alter functional properties. Very cheap and robust technologies have enabled the sequencing of genomes from many diverse bacterial communities e.g. different soils, oceans, human body sites. Proteins (encoded in the genomes) from these bacteria have enabled adaptation to different environments e.g. extremes of temperature. Although, we possess extensive information about protein sequences- UniProtKB contains >100 million sequences (but < 0.5% are experimentally characterised) - the new sequence data from metagenomes is ten-fold larger, providing a valuable treasure trove to hunt for proteins with novel functionality. Yet, it is challenging to predict protein function from sequence alone, which is why we will combine finer-grained prediction with high-throughput experimental testing. Handling this vast data is challenging but our project benefits from outputs already produced by the MGnify metagenomics analysis platform. We will introduce new strategies to classify this data and focus additional analyses on biomes containing greater functional diversity. To unearth proteins whose functions are very different from any observed previously, we will classify related proteins into evolutionary families and then sub-classify into functional families (called FunFams). RF and CO already have methods for doing this, but they need to be adapted to handle the vast metagenomic data. By aligning sequences in a FunFam, you can find residue positions highly conserved throughout evolution, indicating they are important for function. Residue positions conserved in different ways between different FunFams are particularly interesting as these are sites that change to enable different functions. The massive metagenomic sequence data will facilitate easy discovery of these key functional determinants (FDs) as conservation patterns will be much clearer. We will develop new tools to characterise chemical features of these FDs and score differences in properties of FDs between FunFams to find new FunFams in metagenomes, very likely to have novel functions. The outcomes of experimental tests will give further insights e.g. on whether specificity, efficiency can be ascribed to FDs, making our searches more likely to predict function successfully. Two exemplar classes of biomolecules will be investigated: (1) alpha/beta hydrolases- proteins used for making drugs and laundry detergents; (2) bacteriocins- small antibacterial peptides with valuable applications in novel antibiotic discovery and food preservation. These are more complicated as they are produced as part of a cluster of genes (and hence proteins) on the genome, involved in processing the bacteriocin and rendering the bacteria immune to their own bacteriocin. We will adapt our FD-based methods to analyse key sequence differences across multiple proteins to identify novel bacteriocin functionality. Unlike previous analyses of enzyme superfamilies and bacteriocins, we will test our predictions of functional novelty through novel experimental platforms that can verify the predictions on an unprecedented scale. We will exploit a microfluidic technology that screens the function of >1 million proteins in one afternoon in minute droplets and use it for functionally scanning the gene neighbourhood of predictions (after randomisation) e.g. for discovering mutants with better stability, specificity and evolvability. We will also test predictions for genes derived 50-fold cheaper than currently possible via array-based gene assembly. We will thus be experimentally exploring protein sequence space from metagenome communities at an unprecedented scale. We will deliver powerful new computational and experimental technologies, tested on biomolecules important for industry and human health but applicable to many protein families and secondary metabolite gene clusters.

    more_vert
  • Funder: UKRI Project Code: BB/N023242/1
    Funder Contribution: 37,729 GBP

    The metabolism of a living organism reacts rapidly and sensitively to environmental change, disease conditions or simply the organism's age. Capturing how metabolism and metabolites change provides an exquisite insight into the health status of an individual. The discipline of metabolomics seeks to describe the entire population of metabolites in a cell or tissue. Its key challenge is to identify sometimes thousands of different molecules simultaneously. With its unmatched precision and sensitivity, mass spectrometry has become the tool of choice in this context. However, this technique requires ionized metabolites, so they can be accelerated and analysed in an electro-magnetic field. While ionization techniques are well established, the diversity of charged molecular species generated in this process is poorly understood. As a result, many metabolites are not identified and only a fraction of the data a mass spectrometric experiment provides really informs the biological conclusions. For the first time we have enough assets in our toolbox to assemble and optimise into a new workflow; this is the TOOL we will construct, validate and apply to generate a new computational RESOURCE for the metabolomics community, a publicly accessible computational software for performing metabolite annotation and calculating the statistical probability that the identification is correct. We will make this available via the BBSRC-funded MetaboLights database to be supported long-term at the European Bioinformatics Institute in the UK. The new resource will be widely used, both nationally and internationally, by academic, government and industry scientists. All data will be free to access, training videos will be included and the resource will be widely publicised. This cost effective proposal will collectively develop a new TOOL and new RESOURCE, and by embedding it at the European Bioinformatics Institute will transform the metabolomics community's ability to transform data to new knowledge, allowing metabolomics to deliver on its promises to achieve impact.

    more_vert
Powered by OpenAIRE graph
Found an issue? Give us feedback

Do the share buttons not appear? Please make sure, any blocking addon is disabled, and then reload the page.

Content report
No reports available
Funder report
No option selected
arrow_drop_down

Do you wish to download a CSV file? Note that this process may take a while.

There was an error in csv downloading. Please try again later.