publication . Other literature type . Conference object . Preprint . Article . 2018

Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text

Florian Mai;
  • Published: 01 Jan 2018
  • Publisher: Association for Computing Machinery (ACM)
Abstract
For (semi-)automated subject indexing systems in digital libraries, it is often more practical to use metadata such as the title of a publication instead of the full-text or the abstract. Therefore, it is desirable to have good text mining and text classification algorithms that operate well already on the title of a publication. So far, the classification performance on titles is not competitive with the performance on the full-texts if the same number of training samples is used for training. However, it is much easier to obtain title data in large quantities and to use it for training than full-text data. In this paper, we investigate the question how models ...
Subjects
free text keywords: Text Classification, Deep Learning, Digital Libraries, Computer Science - Digital Libraries, Subject indexing, Statistical classification, Information retrieval, Artificial intelligence, business.industry, business, Text mining, Classifier (linguistics), Computer science, Exploit, Metadata, Digital library
Related Organizations
Funded by
EC| MOVING
Project
MOVING
Training towards a society of data-savvy information professionals to enable open leadership innovation
  • Funder: European Commission (EC)
  • Project Code: 693092
  • Funding stream: H2020 | RIA
44 references, page 1 of 3

[1] Damien Brain and G Webb. 1999. On the efect of data set size on bias and variance in classification learning. In Australian Knowledge Acquisition Workshop, AI'99. 117-128.

[2] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014). [OpenAIRE]

[3] Alexis Conneau, Holger Schwenk, Loïc Barrault, and Yann Lecun. 2017. Very deep convolutional networks for text classification. In ECACL, Vol. 1. 1107-1116.

[4] Yarin Gal and Zoubin Ghahramani. 2016. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In NIPS. 1019-1027.

[5] Lukas Galke, Florian Mai, Alan Schelten, Dennis Brunsch, and Ansgar Scherp. 2017. Using Titles vs. Full-text as Source for Automated Semantic Document Annotation. In K-CAP. 9.

[6] Lukas Galke, Ahmed Saleh, and Ansgar Scherp. 2017. Word Embeddings for Practical Information Retrieval. In INFORMATIK. 2155-2167. [OpenAIRE]

[7] Edouard Grave, Tomas Mikolov, Armand Joulin, and Piotr Bojanowski. 2017. Bag of Tricks for Eficient Text Classification. In EACL. 427-431.

[8] Klaus Gref, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE TNNLS (2017).

[9] Gregor Große-Bölting, Chifumi Nishioka, and Ansgar Scherp. 2015. A comparison of diferent strategies for automated semantic document annotation. In K-CAP. ACM, 8.

[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770-778.

[11] Bradley M Hemminger, Billy Saelim, Patrick F Sullivan, and Todd J Vision. 2007. Comparison of full-text searching to metadata searching for genes in two biomedical literature cohorts. JASIST 58, 14 (2007), 2341-2352. [OpenAIRE]

[12] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735-1780.

[13] Minlie Huang, Aurélie Névéol, and Zhiyong Lu. 2011. Recommending MeSH terms for annotating biomedical articles. JAMIA 18, 5 (2011), 660-667.

[14] Sergey Iofe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML. 448-456.

[15] Thorsten Joachims. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In ECML. Springer-Verlag, 137-142. [OpenAIRE]

44 references, page 1 of 3
Abstract
For (semi-)automated subject indexing systems in digital libraries, it is often more practical to use metadata such as the title of a publication instead of the full-text or the abstract. Therefore, it is desirable to have good text mining and text classification algorithms that operate well already on the title of a publication. So far, the classification performance on titles is not competitive with the performance on the full-texts if the same number of training samples is used for training. However, it is much easier to obtain title data in large quantities and to use it for training than full-text data. In this paper, we investigate the question how models ...
Subjects
free text keywords: Text Classification, Deep Learning, Digital Libraries, Computer Science - Digital Libraries, Subject indexing, Statistical classification, Information retrieval, Artificial intelligence, business.industry, business, Text mining, Classifier (linguistics), Computer science, Exploit, Metadata, Digital library
Related Organizations
Funded by
EC| MOVING
Project
MOVING
Training towards a society of data-savvy information professionals to enable open leadership innovation
  • Funder: European Commission (EC)
  • Project Code: 693092
  • Funding stream: H2020 | RIA
44 references, page 1 of 3

[1] Damien Brain and G Webb. 1999. On the efect of data set size on bias and variance in classification learning. In Australian Knowledge Acquisition Workshop, AI'99. 117-128.

[2] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014). [OpenAIRE]

[3] Alexis Conneau, Holger Schwenk, Loïc Barrault, and Yann Lecun. 2017. Very deep convolutional networks for text classification. In ECACL, Vol. 1. 1107-1116.

[4] Yarin Gal and Zoubin Ghahramani. 2016. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In NIPS. 1019-1027.

[5] Lukas Galke, Florian Mai, Alan Schelten, Dennis Brunsch, and Ansgar Scherp. 2017. Using Titles vs. Full-text as Source for Automated Semantic Document Annotation. In K-CAP. 9.

[6] Lukas Galke, Ahmed Saleh, and Ansgar Scherp. 2017. Word Embeddings for Practical Information Retrieval. In INFORMATIK. 2155-2167. [OpenAIRE]

[7] Edouard Grave, Tomas Mikolov, Armand Joulin, and Piotr Bojanowski. 2017. Bag of Tricks for Eficient Text Classification. In EACL. 427-431.

[8] Klaus Gref, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE TNNLS (2017).

[9] Gregor Große-Bölting, Chifumi Nishioka, and Ansgar Scherp. 2015. A comparison of diferent strategies for automated semantic document annotation. In K-CAP. ACM, 8.

[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770-778.

[11] Bradley M Hemminger, Billy Saelim, Patrick F Sullivan, and Todd J Vision. 2007. Comparison of full-text searching to metadata searching for genes in two biomedical literature cohorts. JASIST 58, 14 (2007), 2341-2352. [OpenAIRE]

[12] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735-1780.

[13] Minlie Huang, Aurélie Névéol, and Zhiyong Lu. 2011. Recommending MeSH terms for annotating biomedical articles. JAMIA 18, 5 (2011), 660-667.

[14] Sergey Iofe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML. 448-456.

[15] Thorsten Joachims. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In ECML. Springer-Verlag, 137-142. [OpenAIRE]

44 references, page 1 of 3
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Other literature type . Conference object . Preprint . Article . 2018

Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text

Florian Mai;