Automating classification of free‐text electronic health records for epidemiological studies

descriptionPublicationkeyboard_double_arrow_right Article 24 Jan 2012 Netherlands English Publisher:WileyJournal:Pharmacoepidemiology and Drug Safety, volume 21, pages 651-658 (issn: 1053-8569, eissn: 1099-1557,

Copyright policy )Funded by:EC | EU-ADR

Authors: Schuemie M.J.; Sen E.; 't Jong G.W.; Van Soest E.M.; Sturkenboom M.C.; Kors J.A.;

doi: 10.1002/pds.3205

pmid: 22271492

Automating classification of free‐text electronic health records for epidemiological studies

- Summary
- Subjects
- Metrics

Abstract

ABSTRACTPurposeIncreasingly, patient information is stored in electronic medical records, which could be reused for research. Often these records comprise unstructured narrative data, which are cumbersome to analyze. The authors investigated whether text mining can make these data suitable for epidemiological studies and compared a concept recognition approach and a range of machine learning techniques that require a manually annotated training set. The authors show how this training set can be created with minimal effort by using a broad database query.MethodsThe approaches were tested on two data sets: a publicly available set of English radiology reports for which International Classification of Diseases, Ninth Revision, Clinical Modification code needed to be assigned and a set of Dutch GP records that needed to be classified as either liver disorder cases or noncases. Performance was tested against a manually created gold standard.ResultsThe best overall performance was achieved by a combination of a manually created filter for removing negations and speculations and rule learning algorithms such as RIPPER, with high scores on both the radiology reports (positive predictive value = 0.88, sensitivity = 0.85, specificity = 1.00) and the GP records (positive predictive value = 0.89, sensitivity =0.91, specificity =0.76).ConclusionsAlthough a training set still needs to be created manually, text mining can help reduce the amount of manual work needed to incorporate narrative data in an epidemiological study and will make the data extraction more reproducible. An advantage of machine learning is that it is able to pick up specific language use, such as abbreviations and synonyms used by physicians. Copyright © 2012 John Wiley & Sons, Ltd.

Country

Netherlands

Related Organizations

Erasmus University Rotterdam
Netherlands
Erasmus University Medical Center
Netherlands

Keywords

Electronic Data Processing, Epidemiologic Studies, Artificial Intelligence, International Classification of Diseases, EMC NIHES-03-77-01, Decision Trees, Electronic Health Records, Algorithms, Workflow

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	33
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%