Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ eScholarship - Unive...arrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
PLoS ONE
Article
License: cc-by
Data sources: UnpayWall
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
Europe PubMed Central
Article . 2011
Data sources: PubMed Central
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
PLoS ONE
Article . 2011
Data sources: DOAJ-Articles
versions View all 4 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Kevin W. Boyack; David Newman; Russell J. Duhon; Richard Klavans; Michael Patek; Joseph R. Biberstine; Bob J. A. Schijvenaars; +3 Authors

Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Abstract

Background We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents. Methodology We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models – BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE. Conclusions PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.

Country
United States
Subjects by Vocabulary

Microsoft Academic Graph classification: Set (abstract data type) Computer science Latent semantic analysis Similarity (network science) Topic model Similarity matrix Graph Information retrieval Cosine similarity Subject (documents) Cluster analysis Language model Bioinformatics

Library of Congress Subject Headings: lcsh:Medicine lcsh:R lcsh:Science lcsh:Q

Keywords

Research Article, Computer Science, Information Technology, Databases, Numerical Analysis, Text Mining, Science Policy, Research Assessment, Bibliometrics, Social and Behavioral Sciences, Information Science, Information Storage and Retrieval, Biomedical Research, Cluster Analysis, Documentation, Periodicals as Topic, Multidisciplinary, Life Sciences, Medicine and Health Sciences, latent semantic analysis, information-retrieval, science, search, decomposition, models, graph

47 references, page 1 of 5

Cooper, WS. On selecting a measure of retrieval effectiveness.. Journal of the American Society for Information Science. 1973; 24: 87-100

Robertson, SE, Sparck Jones, K. Relevance weighting of search terms.. Journal of the American Society for Information Science. 1976; 27: 129-146

Salton, G, Buckley, C. Term-weighting approaches in automatic text retrieval.. Information Processing & Management. 1988; 24: 513-523 [OpenAIRE]

Belkin, NJ, Kantor, P, Fox, EA, Shaw, JA. Combining the evidence of multiple query representations for information retrieval.. Information Processing & Management. 1995; 31: 431-448

Jardine, N, van Rijsbergen, CJ. The use of hierarchic clustering in information retrieval.. Information Storage and Retrieval. 1971; 7: 217-240 [OpenAIRE]

Voorhees, EM. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval.. Information Processing & Management. 1986; 22: 465-476 [OpenAIRE]

Hearst, MA, Pedersen, JO. Reexamining the cluster hypothesis: Scatter/gather on retrieval results.. Proceedings of ACM SIGIR 1996. 1996: 76-84 [OpenAIRE]

Hjaltason, GR, Samet, H. Index-driven similarity search in metric spaces.. ACM Transactions on Database Systems. 2003; 28: 517-580

Järvelin, K, Kekäläinen, J. Cumulated gain-based evaluation of IR techniques.. ACM Transactions on Information Systems. 2002; 20: 422-446 [OpenAIRE]

Castells, P, Fernández, M, Vallet, D. An adaptation of the vector-space model for ontology-based information retrieval.. IEEE Transactions on Knowledge and Data Engineering. 2007; 19: 261-272

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    197
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Substantial
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
  • citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    197
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Substantial
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    Powered byBIP!BIP!
Powered by OpenAIRE graph
Found an issue? Give us feedback
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
197
Substantial
Average
Average
Metrics badge
Related to Research communities
Science and Innovation Policy Studies
moresidebar

Do the share buttons not appear? Please make sure, any blocking addon is disabled, and then reload the page.