Text mining processing pipeline for semi structured data D3.3

descriptionPublicationkeyboard_double_arrow_right Project deliverable , Other literature type , Article 01 Jan 2021 Netherlands English Publisher:ZenodoFunded by:EC | CINECA

Authors: Copara, Jenny; Naderi, Nona; Kellmann, Alexander; Gosal, Gurinder; Hsiao, William; Teodoro, Douglas;

doi: 10.5281/zenodo.5795432 , 10.5281/zenodo.5795433

handle: 11370/97a75a5b-a066-4b2d-a5cf-5cca25be88fb

Text mining processing pipeline for semi structured data D3.3

- Summary
- Subjects
- Metrics

Abstract

Unstructured and semi-structured cohort data contain relevant information about the health condition of a patient, e.g., free text describing disease diagnoses, drugs, medication reasons, which are often not available in structured formats. One of the challenges posed by medical free texts is that there can be several ways of mentioning a concept. Therefore, encoding free text into unambiguous descriptors allows us to leverage the value of the cohort data, in particular, by facilitating its findability and interoperability across cohorts in the project. Named entity recognition and normalization enable the automatic conversion of free text into standard medical concepts. Given the volume of available data shared in the CINECA project, the WP3 text mining working group has developed named entity normalization techniques to obtain standard concepts from unstructured and semi-structured fields available in the cohorts. In this deliverable, we present the methodology used to develop the different text mining tools created by the dedicated SFU, UMCG, EBI, and HES-SO/SIB groups for specific CINECA cohorts.

Country

Netherlands

Related Organizations

View all View all

Keywords

descriptors, normalization, Zooma, semi-structured data descriptors, LexMapr, Unstructured data, text mining, L2N, semi-structured data, unstructured data, SORTA

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average