Content-based classification of research articles: comparing keyword extraction, BERT, and random forest classifiers

Name: Content-based classification of research articles: comparing keyword extraction, BERT, and random forest classifiers
Keywords: discipline-classification, abstract-classification, content-based-classification, keyword-extraction, Documentation and information

Arhiliuc, Cristina; Guns, Raf

Found an issue? Give us feedback

Institutional Reposi...arrow_drop_down

Institutional Repository Universiteit Antwerpen

Conference object . 2023

Data sources: Institutional Repository Universiteit Antwerpen

ZENODO

Article . 2023

License: CC BY

Data sources: Datacite

ZENODO

Article . 2023

License: CC BY

Data sources: Datacite

Content-based classification of research articles: comparing keyword extraction, BERT, and random forest classifiers

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 01 Jan 2023 English Publisher:Zenodo

Authors: Arhiliuc, Cristina; Guns, Raf;

doi: 10.5281/zenodo.8305874 , 10.5281/zenodo.8305873

handle: 10067/2024770151162165141

Content-based classification of research articles: comparing keyword extraction, BERT, and random forest classifiers

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

The classification of publications into disciplines has multiple applications in scientometrics – from contributing to further studies of the dynamics of research to allowing responsible use of research metrics. However, the most common ways to classify publications into disciplines are mostly based on citation data, which is not always available. Thus, we compare a set of algorithms to classify publications based on the textual data from their abstract and titles. The algorithms learn from a training dataset of Web of Science (WoS) articles that, after mapping their subject categories to the OECD FORD classification schema, have only one assigned discipline. We present different implementations of the Random Forest algorithm, evaluate a BERT-based classifier and introduce a keyword-based methodology for comparison. We find that the BERT classifier performs the best with an accuracy of 0.7 when trying to predict the discipline and an accuracy of 0.91 for the “real discipline” to be in top 3. Additionally, confusion matrices are presented that indicate that frequently the results of misclassifications are similar disciplines to “real” ones. We conclude that, overall, Random Forest-based methods are a compromise between interpretability and performance, being also the fastest to execute.

Related Organizations

University of Antwerp
Belgium

Keywords

discipline-classification, abstract-classification, content-based-classification, keyword-extraction, Documentation and information

1 Research products, page 1 of 1

transformers software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green

Content-based classification of research articles: comparing keyword extraction, BERT, and random forest classifiers

Content-based classification of research articles: comparing keyword extraction, BERT, and random forest classifiers

1 Research products, page 1 of 1

transformers software on GitHub