Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2022
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2022
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2022
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Biomedical Spanish CBOW Word Embeddings in Floret

Authors: Llop Palao, Joan;

Biomedical Spanish CBOW Word Embeddings in Floret

Abstract

Biomedical Spanish CBOW Word Embeddings in Floret The embeddings have been trained with a biomedical Spanish corpus using floret with the following hyperparameters: mode: str = "floret", model: str = "cbow", dim: int = 300, mincount: int = 10, minn: int = 5, maxn: int = 6, neg: int = 10, hashcount: int = 2, bucket: int = 50000, thread: int = 128, The embeddings were trained on the concatenation of all corpora from the Spanish biomedical corpus that includes Spanish data from various sources for a total of 1.1B tokens across 2,5M documents. Source No. tokens Medical crawler 903,558,136 Clinical cases misc. 102,855,267 EHRs documents* 95,267,204 Scielo 60,007,289 BARR2 Background 24,516,442 Wikipedia (Life Sciences) 13,890,501 Patents 13,463,387 EMEA 5,377,448 Mespen (MedlinePlus) 4,166,077 PubMed 1,858,966 More information about the corpus can be found here https://aclanthology.org/2022.bionlp-1.19/ and here https://arxiv.org/abs/2109.07765 The processing took place on an HPC node equipped with an AMD EPYC 7742 (@ 2.250GHz) processor with 128 threads. How to use First initialize the spacy vectors from the floret table (.floret file): spacy init vectors es floret_embeddings_bio_es.floret floret_embeddings_bio_es --mode floret import spacy # Load the floret vectors floret_embeddings = spacy.load("floret_embeddings_bio_es") # Get the embeddings of some words diabetes = floret_embeddings.vocab["diabetes"] insulina = floret_embeddings.vocab["insulina"] radiografia = floret_embeddings.vocab["radiografia"] # Get some similarities print(diabetes.similarity(insulina)) print(diabetes.similarity(radiografia)) # diabetes should be more similar to insuline than radiografia Intended Uses and Limitations At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this card will be updated. Authors The Text Mining Unit from Barcelona Supercomputing Center. Contact Information For further information, send an email to plantl-gob-es@bsc.es Funding This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL. Copyright Copyright (c) 2022 Secretaría de Estado de Digitalización e Inteligencia Artificial

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan-TL).

Related Organizations
Keywords

subword, floret, spanish, embeddings

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 21
  • 21
    views
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
0
Average
Average
Average
21