descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Other literature type , Conference object 23 May 2018Embargo end date: 01 Jan 2018Publisher:ACMJournal:Proceedings of the 18th ACM/IEEE on Joint Conference on Digital LibrariesFunded by:EC | MOVING

Authors: Mai, Florian; Galke, Lukas; Scherp, Ansgar;

doi: 10.1145/3197026.3197039 , 10.48550/arxiv.1801.06717

arXiv: http://arxiv.org/abs/1801.06717

handle: 21.11116/0000-0009-F854-1 , 21.11116/0000-0009-F856-F

Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

For (semi-)automated subject indexing systems in digital libraries, it is often more practical to use metadata such as the title of a publication instead of the full-text or the abstract. Therefore, it is desirable to have good text mining and text classification algorithms that operate well already on the title of a publication. So far, the classification performance on titles is not competitive with the performance on the full-texts if the same number of training samples is used for training. However, it is much easier to obtain title data in large quantities and to use it for training than full-text data. In this paper, we investigate the question how models obtained from training on increasing amounts of title training data compare to models from training on a constant number of full-texts. We evaluate this question on a large-scale dataset from the medical domain (PubMed) and from economics (EconBiz). In these datasets, the titles and annotations of millions of publications are available, and they outnumber the available full-texts by a factor of 20 and 15, respectively. To exploit these large amounts of data to their full potential, we develop three strong deep learning classifiers and evaluate their performance on the two datasets. The results are promising. On the EconBiz dataset, all three classifiers outperform their full-text counterparts by a large margin. The best title-based classifier outperforms the best full-text method by 9.4%. On the PubMed dataset, the best title-based method almost reaches the performance of the best full-text classifier, with a difference of only 2.9%.

Presented at JCDL 2018, 10 pages, code and data at https://github.com/florianmai/Quadflor

Related Organizations

Kiel University
Germany

Keywords

FOS: Computer and information sciences, Computer Science - Digital Libraries, Digital Libraries (cs.DL)

2 Research products, page 1 of 1

Performance Comparison of Ad-Hoc Retrieval Models over Full-Text vs. Titles of Documents
2018IsAmongTopNSimilarDocuments
Quadflor software on GitHub
IsRelatedTo

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	23
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%