Named Entity Recognition System for the Biomedical Domain

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type , Conference object 26 Sep 2022Publisher:IEEEJournal:Annals of Computer Science and Information Systems, volume 30, pages 837-840 (issn: 2300-5963,

Copyright policy )

Authors: Raghav Sharma; Deependra Singh; Raksha Sharma;

doi: 10.15439/2022f63 , 10.60692/dq15n-xj343 , 10.60692/erzkf-rx023

Named Entity Recognition System for the Biomedical Domain

- Summary
- Subjects
- Metrics

Abstract

Les récents progrès de la science médicale ont entraîné une accélération considérable de la vitesse à laquelle de nouvelles informations sont publiées. La base de données MEDLINE augmente à 500 000 nouvelles citations chaque année. En raison de cette augmentation exponentielle, il n'est pas facile de suivre manuellement ce gonflement croissant de l'information. Il est donc nécessaire de disposer de systèmes automatiques d'extraction d'informations pour récupérer et organiser les informations dans le domaine biomédical. La reconnaissance des entités nommées biomédicales est l'une de ces tâches fondamentales d'extraction d'informations, conduisant à des objectifs importants de gestion de l'information dans le domaine biomédical. En raison du vocabulaire complexe (par exemple, ARNm) et de la nomenclature libre (par exemple, IL2), l'identification des entités nommées dans le domaine biomédical est plus difficile que tout autre domaine, et nécessite donc une attention particulière. Dans cet article, nous déployons deux nouveaux systèmes basés sur un codeur bidirectionnel, à savoir., BioBERT et RoBERTa pour identifier les entités nommées dans le texte biomédical.En raison de la formation spécifique au domaine de BioBERT, elle donne des performances raisonnablement bonnes pour la tâche ner dans le domaine biomédical.Toutefois, la structure de RoBERTa la rend plus adaptée à la tâche.Nous obtenons une amélioration significative du score F par RoBERTa par rapport à BioBERT.En outre, nous présentons une étude comparative sur la perte d'entraînement atteinte avec les optimiseurs ADAM et LAMB.

Los recientes avances en la ciencia médica han causado una aceleración considerable en la velocidad a la que se publica nueva información. La base de datos MEDLINE está creciendo a 500,000 nuevas citas cada año. Como resultado de este aumento exponencial, no es fácil mantenerse al día con este aumento creciente de información. Por lo tanto, existe la necesidad de sistemas automáticos de extracción de información para recuperar y organizar la información en el dominio biomédico. El Reconocimiento Biomédico de Entidades Nombradas es una de esas tareas fundamentales de extracción de información, que conduce a objetivos significativos de gestión de la información en el dominio biomédico. Debido al vocabulario complejo (por ejemplo, ARNm) y la nomenclatura libre (por ejemplo, IL2), la identificación de entidades nombradas en el dominio biomédico es más desafiante que cualquier otro dominio, por lo que requiere atención especial. En este documento, implementamos dos nuevos sistemas basados en codificadores bidireccionales, a saber., BioBERT y RoBERTa para identificar entidades nombradas en el texto biomédico. Debido a la capacitación específica del dominio de BioBERT, da un rendimiento razonablemente bueno para la tarea Ner en el dominio biomédico. Sin embargo, la estructura de RoBERTa lo hace más adecuado para la tarea. Obtenemos una mejora significativa en la puntuación F de RoBERTa sobre BioBERT. Además, presentamos un estudio comparativo sobre la pérdida de capacitación obtenida con los optimizadores ADAM y LAMB.

The recent advancements in medical science have caused a considerable acceleration in the rate at which new information is being published.The MEDLINE database is growing at 500,000 new citations each year.As a result of this exponential increase, it is not easy to manually keep up with this increasing swell of information.Thus, there is a need for automatic information extraction systems to retrieve and organize information in the biomedical domain.Biomedical Named Entity Recognition is one such fundamental information extraction task, leading to significant information management goals in the biomedical domain.Due to the complex vocabulary (e.g., mRNA) and free nomenclature (e.g., IL2), identifying named entities in the biomedical domain is more challenging than any other domain, hence requires special attention.In this paper, we deploy two novel bi-directional encoder-based systems, viz., BioBERT and RoBERTa to identify named entities in the biomedical text.Due to the domain-specific training of BioBERT, it gives reasonably good performance for the NER task in the biomedical domain.However, the structure of RoBERTa makes it more suitable for the task.We obtain a significant improvement in F-score by RoBERTa over BioBERT.In addition, we present a comparative study on training loss attained with ADAM and LAMB optimizers.

تسببت التطورات الأخيرة في العلوم الطبية في تسارع كبير في معدل نشر المعلومات الجديدة. تنمو قاعدة بيانات MEDLINE بمعدل 500000 استشهاد جديد كل عام. ونتيجة لهذه الزيادة الهائلة، ليس من السهل مواكبة هذه الزيادة المتزايدة في المعلومات يدويًا. وبالتالي، هناك حاجة إلى أنظمة استخلاص المعلومات التلقائية لاسترداد المعلومات وتنظيمها في المجال الطبي الحيوي. إن التعرف على الكيانات المسماة بالطب الحيوي هو إحدى مهام استخراج المعلومات الأساسية هذه، مما يؤدي إلى أهداف مهمة لإدارة المعلومات في المجال الطبي الحيوي. نظرًا للمفردات المعقدة (على سبيل المثال، mRNA) والتسميات المجانية (على سبيل المثال، IL2)، فإن تحديد الكيانات المسماة في المجال الطبي الحيوي أكثر صعوبة من أي مجال آخر، وبالتالي يتطلب اهتمامًا خاصًا. في هذه الورقة، نقوم بنشر نظامين جديدين قائمين على التشفير ثنائي الاتجاه، أي، BioBERT و RoBERTa لتحديد الكيانات المسماة في النص الطبي الحيوي. نظرًا للتدريب الخاص بالمجال لـ BioBERT، فإنه يعطي أداءً جيدًا بشكل معقول لمهمة NER في المجال الطبي الحيوي. ومع ذلك، فإن بنية RoBERTa تجعلها أكثر ملاءمة للمهمة. نحصل على تحسن كبير في درجة F من قبل RoBERTa على BioBERT. بالإضافة إلى ذلك، نقدم دراسة مقارنة حول خسارة التدريب التي تم تحقيقها مع محسنات ADAM و LAMB.

Related Organizations

Keywords

Artificial intelligence, Biomedical Ontologies and Text Mining, Information technology, Mathematical analysis, Biomedical Ontologies, Artificial Intelligence, Biochemistry, Genetics and Molecular Biology, FOS: Mathematics, Molecular Biology, Multilingual Neural Machine Translation, Natural Language Processing, Semantic Web, Domain (mathematical analysis), Natural language processing, Life Sciences, Biomedical Literature, QA75.5-76.95, Statistical Machine Translation and Natural Language Processing, T58.5-58.64, Named Entity Recognition, Computer science, Electronic computers. Computer science, Computer Science, Physical Sciences, Mathematics

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average