• shareshare
  • link
  • cite
  • add
auto_awesome_motion View all 7 versions
Publication . Conference object . 2019

Automatic Identification and Normalisation of Physical Measurements in Scientific Literature

Luca Foppiano; Laurent Romary; Masashi Ishii; Mikiko Tanifuji;
Open Access
Published: 23 Sep 2019
Publisher: HAL CCSD
Country: France
We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make unstructured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (numeric, alphabetic or scientific notation). Identified measurements are normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science [10]. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI) [9].
Proceedings of the ACM Symposium on Document Engineering 2019 (DocEng '19). Article 24, 1–4.
Subjects by Vocabulary

Microsoft Academic Graph classification: Information retrieval Parsing computer.software_genre computer Context (language use) Conditional random field Computer science International System of Units Scientific literature Scientific notation Identification (information) Materials informatics


Physical quantities, Units of measurements, Measurements, Text and data mining, Machine Learning, Document analysis, Applied computing, [INFO]Computer Science [cs], [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, Document meta- data, TDM

13 references, page 1 of 2

[1] Milan Agatonovic, Niraj Aswani, Kalina Bontcheva, Hamish Cunningham, Thomas Heitz, Yaoyong Li, Ian Roberts, and Valentin Tablan. 2008. Large-scale, parallel automatic patent annotation. In Proceedings of the 1st ACM workshop on Patent information retrieval. ACM, 1-8.

[2] Skopinava AM and Lobanov BM. 2013. Processing of quantitative exPressions with units of measurement in scientific texts as aPPlied to Belarusian and russian text-to-sPeech synthesis. (2013).

[3] Hidir Aras, René Hackl-Sommer, Michael Schwantner, and Mustafa Sofean. 2014. Applications and Challenges of Text Mining with Patents.. In IPaMin@ KONVENS.

[4] Soumia Lilia Berrahou, Patrice Buche, Juliette Dibie-Barthélemy, and Mathieu Roche. [n. d.]. How to Extract Unit of Measure in Scientific Documents?.

[5] Contributors [n. d.]. Units of Measurement. unitsofmeasurement.

[6] Contributors 2008 - 2019. GROBID (GeneRation Of BIbliographic Data). swh:1:dir:6a298c1b2008913d62e01e5bc967510500f80710.

[7] André Dazy. 2014. ISTEX: a powerful project for scientific and technical electronic resources archives. Insights 27, 3 (2014).

[8] Thaer M Dieb, Masaharu Yoshioka, Shinjiro Hara, and Marcus C Newton. 2015. Framework for automatic information extraction from research papers on nanocrystal devices. Beilstein journal of nanotechnology 6, 1 (2015), 1872-1882. [OpenAIRE]

[9] Luca Foppiano, M. Dieb Thaer, Akira Suzuki, and Masashi Ishii. 2019. Proposal for Automatic Extraction Framework of Superconductors Related Information from Scientific Literature. In Letters and Technology News, vol. 119, no. 66, SC2019-1 (no.66), Vol. 119. Tsukuba, 1-5. ISSN: 2432-6380. [OpenAIRE]

[10] Kyle Hundman and Chris A Mattmann. 2017. Measurement Context Extraction from Text: Discovering Opportunities and Gaps in Earth Science. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. [OpenAIRE]

Related to Research communities
Social Science and Humanities
Download fromView all 4 sources
Conference object
License: cc-by
Providers: UnpayWall