publication . Conference object . 2020

Multilingual Epidemiological Text Classification: A Comparative Study

Stephen Mutuvi; Emanuela Boros; Antoine Doucet; Adam Jatowt; Gaël Lejeune; Moses Odeo;
Open Access English
  • Published: 08 Dec 2020
  • Publisher: Zenodo
  • Country: France
Abstract
International audience; In this paper, we approach the multilingual text classification task in the context of the epidemiological field. Multilingual text classification models tend to perform differently across different languages (low-or high-resource), more particularly when the dataset is highly imbalanced, which is the case for epidemiological datasets. We conduct a comparative study of different machine and deep learning text classification models using a dataset comprising news articles related to epidemic outbreaks from six languages, four low-resourced and two high-resourced, in order to analyze the influence of the nature of the language, the structure of the document, and the size of the data. Our findings indicate that the performance of the models based on fine-tuned language models exceeds by more than 50% the chosen baseline models that include a specialized epidemiological news surveillance system and several machine learning models. Also, low-resource languages are highly influenced not only by the typology of the languages on which the models have been pre-trained or/and fine-tuned but also by their size. Furthermore, we discover that the beginning and the end of documents provide the most salient features for this task and, as expected, the performance of the models was proportionate to the training data size.
Subjects
free text keywords: [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, Context (language use), Salient, Natural language processing, computer.software_genre, computer, Artificial intelligence, business.industry, business, Language model, Structure (mathematical logic), Task (project management), Deep learning, Field (computer science), Computer science
Communities
Communities with gateway
OpenAIRE Connect image
Funded by
EC| EMBEDDIA
Project
EMBEDDIA
Cross-Lingual Embeddings for Less-Represented Languages in European News Media
  • Funder: European Commission (EC)
  • Project Code: 825153
  • Funding stream: H2020 | RIA
Validated by funder
,
EC| NewsEye
Project
NewsEye
NewsEye: A Digital Investigator for Historical Newspapers
  • Funder: European Commission (EC)
  • Project Code: 770299
  • Funding stream: H2020 | RIA
Download fromView all 8 versions
Open Access
ZENODO
Article . Conference object . 2021
Open Access
ZENODO
Conference object . 2020
Providers: ZENODO
Open Access
https://doi.org/10.5281/zenodo...
Conference object . 2020
Providers: Datacite
35 references, page 1 of 3

Eiji Aramaki, Sachiko Maskawa, and Mizuki Morita. 2011. Twitter catches the flu: detecting influenza epidemics using twitter. In Proceedings of the conference on empirical methods in natural language processing, pages 1568-1576. Association for Computational Linguistics.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In Thirty-Second AAAI Conference on Artificial Intelligence.

Theresa Marie Bernardo, Andrijana Rajic, Ian Young, Katie Robiadek, Mai T Pham, and Julie A Funk. 2013. Scoping review on search queries and social media for disease surveillance: a chronology of innovation. Journal of medical Internet research, 15(7):e147.

Todd Bodnar and Marcel Salathe´. 2013. Validating models for disease detection using twitter. In Proceedings of the 22nd International Conference on World Wide Web, pages 699-702. Acm.

John S Brownstein, Clark C Freifeld, Ben Y Reis, and Kenneth D Mandl. 2008. Surveillance sans frontieres: Internet-based emerging infectious disease intelligence and the healthmap project. PLoS medicine, 5(7):e151.

Lauren E Charles-Smith, Tera L Reynolds, Mark A Cameron, Mike Conway, Eric HY Lau, Jennifer M Olsen, Julie A Pavlin, Mika Shigematsu, Laura C Streichert, Katie J Suda, et al. 2015. Using social media for actionable disease surveillance and outbreak management: a systematic literature review. PloS one, 10(10):e0139701.

Nigel Collier, Son Doan, Ai Kawazoe, Reiko Matsuda Goodwin, Mike Conway, Yoshio Tateno, Quoc-Hung Ngo, Dinh Dien, Asanee Kawtrakul, Koichi Takeuchi, et al. 2008. Biocaster: detecting public health rumors with a web-based text mining system. Bioinformatics, 24(24):2940-2941.

Nigel Collier, Nguyen Truong Son, and Ngoc Mai Nguyen. 2011. Omg u got flu? analysis of shared health messages for bio-surveillance. Journal of biomedical semantics, 2(5):S9.

Nigel Collier. 2011. Towards cross-lingual alerting for bursty epidemic events. Journal of Biomedical Semantics, 2(5):S10.

Crystale Purvis Cooper, Kenneth P Mallon, Steven Leadbetter, Lori A Pollack, and Lucy A Peipins. 2005. Cancer internet search activity on a major search engine, united states 2001-2003. Journal of medical Internet research, 7(3):e36.

Felix Hamborg, Soeren Lachnit, Moritz Schubotz, Thomas Hepp, and Bela Gipp. 2018. Giveme5w: Main event retrieval from news articles by extraction of the five journalistic w questions. 03.

Andrew G Huff, Nathan Breit, Toph Allen, Karissa Whiting, and Christopher Kiley. 2016. Evaluation and verification of the global rapid identification of threats system for infectious diseases in textual data sources. Interdisciplinary perspectives on infectious diseases, 2016.

Aditya Joshi, Sarvnaz Karimi, Ross Sparks, Ce´cile Paris, and C Raina Macintyre. 2019. Survey of text-based epidemic intelligence: A computational linguistics perspective. ACM Computing Surveys (CSUR), 52(6):1-19.

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, He´rve Je´gou, and Tomas Mikolov. 2016. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.

Dan Kondratyuk and Milan Straka. 2019. 75 languages, 1 model: Parsing universal dependencies universally. arXiv preprint arXiv:1904.02099. [OpenAIRE]

35 references, page 1 of 3
Abstract
International audience; In this paper, we approach the multilingual text classification task in the context of the epidemiological field. Multilingual text classification models tend to perform differently across different languages (low-or high-resource), more particularly when the dataset is highly imbalanced, which is the case for epidemiological datasets. We conduct a comparative study of different machine and deep learning text classification models using a dataset comprising news articles related to epidemic outbreaks from six languages, four low-resourced and two high-resourced, in order to analyze the influence of the nature of the language, the structure of the document, and the size of the data. Our findings indicate that the performance of the models based on fine-tuned language models exceeds by more than 50% the chosen baseline models that include a specialized epidemiological news surveillance system and several machine learning models. Also, low-resource languages are highly influenced not only by the typology of the languages on which the models have been pre-trained or/and fine-tuned but also by their size. Furthermore, we discover that the beginning and the end of documents provide the most salient features for this task and, as expected, the performance of the models was proportionate to the training data size.
Subjects
free text keywords: [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, Context (language use), Salient, Natural language processing, computer.software_genre, computer, Artificial intelligence, business.industry, business, Language model, Structure (mathematical logic), Task (project management), Deep learning, Field (computer science), Computer science
Communities
Communities with gateway
OpenAIRE Connect image
Funded by
EC| EMBEDDIA
Project
EMBEDDIA
Cross-Lingual Embeddings for Less-Represented Languages in European News Media
  • Funder: European Commission (EC)
  • Project Code: 825153
  • Funding stream: H2020 | RIA
Validated by funder
,
EC| NewsEye
Project
NewsEye
NewsEye: A Digital Investigator for Historical Newspapers
  • Funder: European Commission (EC)
  • Project Code: 770299
  • Funding stream: H2020 | RIA
Download fromView all 8 versions
Open Access
ZENODO
Article . Conference object . 2021
Open Access
ZENODO
Conference object . 2020
Providers: ZENODO
Open Access
https://doi.org/10.5281/zenodo...
Conference object . 2020
Providers: Datacite
35 references, page 1 of 3

Eiji Aramaki, Sachiko Maskawa, and Mizuki Morita. 2011. Twitter catches the flu: detecting influenza epidemics using twitter. In Proceedings of the conference on empirical methods in natural language processing, pages 1568-1576. Association for Computational Linguistics.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In Thirty-Second AAAI Conference on Artificial Intelligence.

Theresa Marie Bernardo, Andrijana Rajic, Ian Young, Katie Robiadek, Mai T Pham, and Julie A Funk. 2013. Scoping review on search queries and social media for disease surveillance: a chronology of innovation. Journal of medical Internet research, 15(7):e147.

Todd Bodnar and Marcel Salathe´. 2013. Validating models for disease detection using twitter. In Proceedings of the 22nd International Conference on World Wide Web, pages 699-702. Acm.

John S Brownstein, Clark C Freifeld, Ben Y Reis, and Kenneth D Mandl. 2008. Surveillance sans frontieres: Internet-based emerging infectious disease intelligence and the healthmap project. PLoS medicine, 5(7):e151.

Lauren E Charles-Smith, Tera L Reynolds, Mark A Cameron, Mike Conway, Eric HY Lau, Jennifer M Olsen, Julie A Pavlin, Mika Shigematsu, Laura C Streichert, Katie J Suda, et al. 2015. Using social media for actionable disease surveillance and outbreak management: a systematic literature review. PloS one, 10(10):e0139701.

Nigel Collier, Son Doan, Ai Kawazoe, Reiko Matsuda Goodwin, Mike Conway, Yoshio Tateno, Quoc-Hung Ngo, Dinh Dien, Asanee Kawtrakul, Koichi Takeuchi, et al. 2008. Biocaster: detecting public health rumors with a web-based text mining system. Bioinformatics, 24(24):2940-2941.

Nigel Collier, Nguyen Truong Son, and Ngoc Mai Nguyen. 2011. Omg u got flu? analysis of shared health messages for bio-surveillance. Journal of biomedical semantics, 2(5):S9.

Nigel Collier. 2011. Towards cross-lingual alerting for bursty epidemic events. Journal of Biomedical Semantics, 2(5):S10.

Crystale Purvis Cooper, Kenneth P Mallon, Steven Leadbetter, Lori A Pollack, and Lucy A Peipins. 2005. Cancer internet search activity on a major search engine, united states 2001-2003. Journal of medical Internet research, 7(3):e36.

Felix Hamborg, Soeren Lachnit, Moritz Schubotz, Thomas Hepp, and Bela Gipp. 2018. Giveme5w: Main event retrieval from news articles by extraction of the five journalistic w questions. 03.

Andrew G Huff, Nathan Breit, Toph Allen, Karissa Whiting, and Christopher Kiley. 2016. Evaluation and verification of the global rapid identification of threats system for infectious diseases in textual data sources. Interdisciplinary perspectives on infectious diseases, 2016.

Aditya Joshi, Sarvnaz Karimi, Ross Sparks, Ce´cile Paris, and C Raina Macintyre. 2019. Survey of text-based epidemic intelligence: A computational linguistics perspective. ACM Computing Surveys (CSUR), 52(6):1-19.

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, He´rve Je´gou, and Tomas Mikolov. 2016. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.

Dan Kondratyuk and Milan Straka. 2019. 75 languages, 1 model: Parsing universal dependencies universally. arXiv preprint arXiv:1904.02099. [OpenAIRE]

35 references, page 1 of 3
Any information missing or wrong?Report an Issue