publication . Other literature type . Conference object . Preprint . 2018

BioSentVec: creating sentence embeddings for biomedical texts

Chen, Qingyu; Peng, Yifan; Lu, Zhiyong;
Open Access
  • Published: 22 Oct 2018
  • Publisher: Institute of Electrical and Electronics Engineers (IEEE)
Abstract
Sentence embeddings have become an essential part of today's natural language processing (NLP) systems, especially together advanced deep learning methods. Although pre-trained sentence encoders are available in the general domain, none exists for biomedical texts to date. In this work, we introduce BioSentVec: the first open set of sentence embeddings trained with over 30 million documents from both scholarly articles in PubMed and clinical notes in the MIMIC-III Clinical Database. We evaluate BioSentVec embeddings in two sentence pair similarity tasks in different text genres. Our benchmarking results demonstrate that the BioSentVec embeddings can better captu...
Subjects
free text keywords: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning

[1] G. Sogancioglu, H. O¨ztu¨ rk, and A. O¨zgu¨ r, “BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.” Bioinformatics (Oxford, England), vol. 33, pp. i49-i58, Jul. 2017.

[2] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: a review and new perspectives,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, pp. 1798-1828.

[3] B. Chiu, G. Crichton, A. Korhonen, and S. Pyysalo, “How to train good word embeddings for biomedical NLP,” in Proceedings of the 15th Workshop on Biomedical Natural Language Processing, 2016, pp. 166-174.

[4] Y. Wang, S. Liu, N. Afzal, M. Rastegar-Mojarad, L. Wang, F. Shen, P. Kingsbury, and H. Liu, “A comparison of word embeddings for the biomedical natural language processing,” Journal of biomedical informatics, vol. 87, pp. 12-20, Sep. 2018.

[5] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, 2014, pp. 1188-1196. [OpenAIRE]

[6] D. Cer, Y. Yang, S. yi Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y.-H. Sung, B. Strope, and R. Kurzweil, “Universal sentence encoder,” arXiv preprint arXiv:1803.11175, 2018.

[7] M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised learning of sentence embeddings using compositional n-gram features,” in Proceedings of NAACL, vol. 1, 2018, pp. 528-540.

[8] A. P. Tafti, E. Behravesh, M. Assefi, E. LaRose, J. Badger, J. Mayer, A. Doan, D. Page, and P. Peissig, “bigNN: an open-source big data toolkit focused on biomedical sentence classification,” in 2017 IEEE International Conference on Big Data (Big Data), 2017, pp. 3888-3896.

[9] M. Sarrouti and S. Ouatik El Alaoui, “A passage retrieval method based on probabilistic information retrieval model and UMLS concepts in biomedical question answering.” Journal of biomedical informatics, vol. 68, pp. 96-103, Apr. 2017. [OpenAIRE]

[10] A. E. W. Johnson, T. J. Pollard, L. Shen, L.-W. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, “MIMIC-III, a freely accessible critical care database.” Scientific data, vol. 3, p. 160035, May 2016.

[11] S. Bird, E. Klein, and E. Loper, Natural language processing with Python, 1st ed. Cambridge Mass.: O'Reilly, 2009.

[12] Y. Wang, N. Afzal, S. Fu, L. Wang, F. Shen, M. Rastegar-Mojarad, and H. Liu, “MedSTS: a resource for clinical semantic textual similarity,” arXiv preprint arXiv:1808.09397, 2018.

[13] Q. Chen, J. Du, S. Kim, W. J. Wilbur, and Z. Lu, “Combining rich features and deep learning for finding similar sentences in electronic medical records,” in Proceedings of the BioCreative/OHNLP Challenge, 2018.

Abstract
Sentence embeddings have become an essential part of today's natural language processing (NLP) systems, especially together advanced deep learning methods. Although pre-trained sentence encoders are available in the general domain, none exists for biomedical texts to date. In this work, we introduce BioSentVec: the first open set of sentence embeddings trained with over 30 million documents from both scholarly articles in PubMed and clinical notes in the MIMIC-III Clinical Database. We evaluate BioSentVec embeddings in two sentence pair similarity tasks in different text genres. Our benchmarking results demonstrate that the BioSentVec embeddings can better captu...
Subjects
free text keywords: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning

[1] G. Sogancioglu, H. O¨ztu¨ rk, and A. O¨zgu¨ r, “BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.” Bioinformatics (Oxford, England), vol. 33, pp. i49-i58, Jul. 2017.

[2] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: a review and new perspectives,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, pp. 1798-1828.

[3] B. Chiu, G. Crichton, A. Korhonen, and S. Pyysalo, “How to train good word embeddings for biomedical NLP,” in Proceedings of the 15th Workshop on Biomedical Natural Language Processing, 2016, pp. 166-174.

[4] Y. Wang, S. Liu, N. Afzal, M. Rastegar-Mojarad, L. Wang, F. Shen, P. Kingsbury, and H. Liu, “A comparison of word embeddings for the biomedical natural language processing,” Journal of biomedical informatics, vol. 87, pp. 12-20, Sep. 2018.

[5] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, 2014, pp. 1188-1196. [OpenAIRE]

[6] D. Cer, Y. Yang, S. yi Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y.-H. Sung, B. Strope, and R. Kurzweil, “Universal sentence encoder,” arXiv preprint arXiv:1803.11175, 2018.

[7] M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised learning of sentence embeddings using compositional n-gram features,” in Proceedings of NAACL, vol. 1, 2018, pp. 528-540.

[8] A. P. Tafti, E. Behravesh, M. Assefi, E. LaRose, J. Badger, J. Mayer, A. Doan, D. Page, and P. Peissig, “bigNN: an open-source big data toolkit focused on biomedical sentence classification,” in 2017 IEEE International Conference on Big Data (Big Data), 2017, pp. 3888-3896.

[9] M. Sarrouti and S. Ouatik El Alaoui, “A passage retrieval method based on probabilistic information retrieval model and UMLS concepts in biomedical question answering.” Journal of biomedical informatics, vol. 68, pp. 96-103, Apr. 2017. [OpenAIRE]

[10] A. E. W. Johnson, T. J. Pollard, L. Shen, L.-W. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, “MIMIC-III, a freely accessible critical care database.” Scientific data, vol. 3, p. 160035, May 2016.

[11] S. Bird, E. Klein, and E. Loper, Natural language processing with Python, 1st ed. Cambridge Mass.: O'Reilly, 2009.

[12] Y. Wang, N. Afzal, S. Fu, L. Wang, F. Shen, M. Rastegar-Mojarad, and H. Liu, “MedSTS: a resource for clinical semantic textual similarity,” arXiv preprint arXiv:1808.09397, 2018.

[13] Q. Chen, J. Du, S. Kim, W. J. Wilbur, and Z. Lu, “Combining rich features and deep learning for finding similar sentences in electronic medical records,” in Proceedings of the BioCreative/OHNLP Challenge, 2018.

Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue