Project: UKRI | Lexical Acquisition for t... (EP/G051070/1)
Background: Biomedical natural language processing (NLP) applications that have access to detailed resources about the linguistic characteristics of biomedical language demonstrate improved performance on tasks such as relation extraction and syntactic or semantic parsing. Such applications are important for transforming the growing unstructured information buried in the biomedical literature into structured, actionable information. In this paper, we address the creation of linguistic resources that capture how individual biomedical verbs behave. We specifically consider verb subcategorization, or the tendency of verbs to ''select'' co-occurrence with particular phrase types, which influences the interpretation of verbs and identification of verbal arguments in context. There are currently a limited number of biomedical resources containing information about subcategorization frames (SCFs), and these are the result of either labor-intensive manual collation, or automatic methods that use tools adapted to a single biomedical subdomain. Either method may result in resources that lack coverage. Moreover, the quality of existing verb SCF resources for biomedicine is unknown, due to a lack of available gold standards for evaluation. Results: This paper presents three new resources related to verb subcategorization frames in biomedicine, and four experiments making use of the new resources. We present the first biomedical SCF gold standards, capturing two different but widely-used definitions of subcategorization, and a new SCF lexicon, BioCat, covering a large number of biomedical sub-domains. We evaluate the SCF acquisition methodologies for BioCat with respect to the gold standards, and compare the results with the accuracy of the only previously existing automatically-acquired SCF lexicon for biomedicine, the BioLexicon. Our results show that the BioLexicon has greater precision while BioCat has better coverage of SCFs. Finally, we explore the definition of subcategorization using these resources and its implications for biomedical NLP. All resources are made publicly available. Conclusion: The SCF resources we have evaluated still show considerably lower accuracy than that reported with general English lexicons, demonstrating the need for domain- and subdomain-specific SCF acquisition tools for biomedicine. Our new gold standards reveal major differences when annotators use the different definitions. Moreover, evaluation of BioCat yields major differences in accuracy depending on the gold standard, demonstrating that the definition of subcategorization adopted will have a direct impact on perceived system accuracy for specific tasks.
Project: UKRI | From data to knowledge / ... (BB/F006039/1), UKRI | From data to knowledge / ... (BB/F006039/1)
Abstract Background Negation occurs frequently in scientific literature, especially in biomedical literature. It has previously been reported that around 13% of sentences found in biomedical research articles contain negation. Historically, the main motivation for identifying negated events has been to ensure their exclusion from lists of extracted interactions. However, recently, there has been a growing interest in negative results, which has resulted in negation detection being identified as a key challenge in biomedical relation extraction. In this article, we focus on the problem of identifying negated bio-events, given gold standard event annotations. Results We have conducted a detailed analysis of three open access bio-event corpora containing negation information (i.e., GENIA Event, BioInfer and BioNLP’09 ST), and have identified the main types of negated bio-events. We have analysed the key aspects of a machine learning solution to the problem of detecting negated events, including selection of negation cues, feature engineering and the choice of learning algorithm. Combining the best solutions for each aspect of the problem, we propose a novel framework for the identification of negated bio-events. We have evaluated our system on each of the three open access corpora mentioned above. The performance of the system significantly surpasses the best results previously reported on the BioNLP’09 ST corpus, and achieves even better results on the GENIA Event and BioInfer corpora, both of which contain more varied and complex events. Conclusions Recently, in the field of biomedical text mining, the development and enhancement of event-based systems has received significant interest. The ability to identify negated events is a key performance element for these systems. We have conducted the first detailed study on the analysis and identification of negated bio-events. Our proposed framework can be integrated with state-of-the-art event extraction systems. The resulting systems will be able to extract bio-events with attached polarities from textual documents, which can serve as the foundation for more elaborate systems that are able to detect mutually contradicting bio-events.
HESTIA (the Herodotus Encoded Space-Text-Imaging Archive) employs the latest digital technology to develop an innovative methodology to the study of spatial data in Herodotus’ Histories. Using a digital text of Herodotus, freely available from the Perseus on-line library, to capture all the place-names mentioned in the narrative, we construct a database to house that information and represent it in a series of mapping applications, such as GIS, GoogleEarth and GoogleMap Timeline. As a collaboration of academics from the disciplines of Classics, Geography, and Archaeological Computing, HESTIA has the twin aim of investigating the ways geography is represented in the Histories and of bringing Herodotus’ world into people’s homes.
The assessment of genome function requires a mapping between genome-derived entities and biochemical reactions, and the biomedical literature represents a rich source of information about reactions between biological components. However, the increasingly rapid growth in the volume of literature provides both a challenge and an opportunity for researchers to isolate information about reactions of interest in a timely and efficient manner. In response, recent text mining research in the biology domain has been largely focused on the identification and extraction of ‘events’, i.e. categorised, structured representations of relationships between biochemical entities, from the literature. Functional genomics analyses necessarily encompass events as so defined. Automatic event extraction systems facilitate the development of sophisticated semantic search applications, allowing researchers to formulate structured queries over extracted events, so as to specify the exact types of reactions to be retrieved. This article provides an overview of recent research into event extraction. We cover annotated corpora on which systems are trained, systems that achieve state-of-the-art performance and details of the community shared tasks that have been instrumental in increasing the quality, coverage and scalability of recent systems. Finally, several concrete applications of event extraction are covered, together with emerging directions of research.