In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration

Name: In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration
Keywords: FOS: Computer and information sciences, Computer Science - Databases, Databases (cs.DB)

Jiajie Fu; Haitong Tang; Arijit Khan; Sharad Mehrotra; Xiangyu Ke; Yunjun Gao

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2025

Data sources: arXiv.org e-Print Archive

Proceedings of the ACM on Management of Data

Article . 2025 . Peer-reviewed

License: ACM Copyright Policies

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2025

License: CC BY

Data sources: Datacite

DBLP

Article

Data sources: DBLP

In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 22 Sep 2025Embargo end date: 01 Jan 2025 English Publisher:Association for Computing Machinery (ACM)Journal:Proceedings of the ACM on Management of Data, volume 3, pages 1-28 (eissn: 2836-6573,

Copyright policy )Funded by:NNF | unidentified

Authors: Jiajie Fu; Haitong Tang; Arijit Khan; Sharad Mehrotra; Xiangyu Ke; Yunjun Gao;

doi: 10.1145/3749170 , 10.48550/arxiv.2506.02509

arXiv: 2506.02509

In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration

- Summary
- Subjects
- Metrics

Abstract

Entity Resolution (ER) is a fundamental data quality improvement task that identifies and links records referring to the same real-world entity. Traditional ER approaches often rely on pairwise comparisons, which can be costly regarding both time and monetary resources, especially when large datasets are involved. Recently, Large Language Models (LLMs) have demonstrated promising results in ER tasks. Still, existing methods typically focus on pairwise matching, missing the potential of LLMs to directly perform clustering in a more cost-effective and scalable manner. In this paper, we propose a novel in-context clustering approach for ER, where LLMs are used to cluster records directly, reducing both time complexity and monetary costs. We systematically investigate the design space for in-context clustering, analyzing the impact of factors such as set size, diversity, variation, and ordering of records on clustering performance. Based on these insights, we develop LLM-CER (LLM-powered Clustering-based ER) that obtains high-quality ER results while minimizing LLM API calls. Our approach addresses key challenges, including efficient cluster merging and LLM's hallucination, providing a scalable and effective solution for ER. Extensive experiments on nine real-world datasets demonstrate that our method significantly improves result quality, achieving up to 150% higher accuracy, 10% increase in the FP-measure, and reducing API calls by up to 5X, while maintaining a comparable monetary cost to the most cost-effective baseline.

Related Organizations

Aalborg University
Denmark
Zhejiang Ocean University
China (People's Republic of)
University System of Ohio
United States
University of California, Irvine
United States
Bowling Green State University
United States

Keywords

FOS: Computer and information sciences, Computer Science - Databases, Databases (cs.DB)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green

Funded by

NNF| unidentified

Related to Research communities

UArctic