Biotea Dataset (Vr. July 2012)

Background Information reported by scientific literature still remains locked up in discrete documents that are not always interconnected or machine-readable. The Semantic Web together with approaches such as the Resource Description Framework (RDF) and the Linked Open Data (LOD) initiative offer a connectivity tissue that can be used to support the generation of self-describing, machine-readable documents. Results Biotea is an approach to generate RDF from scholarly documents. Our RDF model makes extensive use of existing ontologies and semantic enrichment services. Our dataset comprises 270,834 articles from PubMed Open Central in RDF/XML distributed in 404 zipped files. The RDFization process takes care of metadata, e.g., title, authors and journal, as well as semantic annotations on biological entities along the full text. Biological entities are extracted by using the NCBO Annotator and Whatizit. We use the Bibliographic Ontology (BIBO), Dublin Core Metadata Initiative Terms (DCMI-terms), and the Provenance Ontology (PROV-O) to model the bibliographic metadata. Links to related pages such as PubMed HTML articles are provided via rdfs:seeAlso while links to other semantic representation such as Bio2RDF PubMed articles are provided via owl:sameAs. The NCBO Annotator is used to extract entities covering ChEBI for chemicals; Pathway, and Functional Genomics Data Society (MGED) for genes and proteins; Master Drug Data Base (MDDB), NDDF, and NDFRT for drugs; SNOMED, SYMP, MedDRA, MeSH, MedlinePlus Health Topics (MedlinePlus), Online Mendelian Inheritance in Man (OMIM), FMA, ICD10, and Ontology for Biomedical Investigations (OBI) for diseases and medical terms; PO for plants; and MeSH, SNOMED, and NCIt for general terms. Whatizit is used for GO, UniProt proteins, UniProt Taxonomy, and diseases mapped to the UMLS; UniProt taxa are also mapped to NCBI Taxon vocabulary. Conclusions Biotea delivers models and tools for metadata enrichment and semantic processing of biomedical documents. Our dataset makes it easier to access to the first bunch of RDFized articles following the Biotea model. Our future plans include updating our dataset on regular basis in order to incorporate the latest articles added to the PubMed Open Central collection, next delivery is planned for the first half of 2017. Following datasets will support a mapping to the Semanticscience Integrated Ontology (SIO) in order to accomplish to the guidelines set by Bio2RDF. Notes Biotea approach in full is available at http://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-4-S1-S5 (Garcia Castro, L.J., C. McLaughlin, and A. Garcia, Biotea: RDFizing PubMed Central in Support for the Paper as an Interface to the Web of Data. Biomedical semantics, 2013. 4 Suppl 1: p. S5). Biotea algorithms are publicly available at https://github.com/biotea

Related Organizations

Keywords

Biotea, semantic web, semantic annotation, entitiy recognition, linked data

2 Research products, page 1 of 1

Biotea: RDFizing PubMed Central in support for the paper as an interface to the Web of Data
2013IsAmongTopNSimilarDocuments
Biotea: RDFizing PubMed Central in support for the paper as an interface to the Web of Data
2013IsCompiledBy

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average