
Finding data for secondary analysis is a challenge. While searching repositories and data catalogs can yield positive results, some researchers begin by consulting the literature and then trace references to datasets. Yet, the process of locating data used in a publication can be arduous. Data citation location and format is heterogeneous. Data from an article may be described in a Data Availability Statement, in the Reference, in Acknowledgments, or in the text itself. To address the challenges of locating data citations and data from publications in repositories, we developed the Data Gatherer.The Data Gatherer is designed to extract dataset references from scientific articles by processing their url or the path to a local PDF file . It employs Large Language Models (LLMs) to identify references, extract information about the datasets, and reconstruct links to their dataset pages or dataset files. This dataset retrieval can be carried out through two main strategies: Full-Document Read (FDR) and Retrieve-Then-Read (RTR). Developing and testing the Data Gatherer required the creation of two benchmark datasets: one manually curated by a librarian and the other reverse-engineered from metadata exports from the ProteomeCentral and Gene Expression Omnibus (GEO) repositories. Additional performance evaluations were carried out using data from an institutional data catalog and with a financial natural language processing dataset. A user interface was developed and refined with feedback from informatics students at a medical school. Future work includes improving performance across disciplines and investigating ways to integrate with other data and citation management tools.The Data Gatherer represents an attempt to address long-standing policy issues in research, including encouraging proper data citation and standardizing data citation practices . As we work to address these policy issues, tools like the Data Gatherer can help fill the current gap in making data more findable for re-use.
