AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

Name: AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model
Keywords: FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Databases, Computer Science - Artificial Intelligence, Databases (cs.DB), Computation and Language (cs.CL)

Zhang, Zeyu; Groth, Paul; Calixto, Iacer; Schelter, Sebastian

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

https://dx.doi.org/10.48550/ar...

Article . 2024

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

DBLP

Article

Data sources: DBLP

Pure Amsterdam UMC

Article . 2024

Data sources: Pure Amsterdam UMC

AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2024Embargo end date: 01 Jan 2024 Netherlands Publisher:arXivJournal:CoRR, volume abs/2409.04073

Authors: Zhang, Zeyu; Groth, Paul; Calixto, Iacer; Schelter, Sebastian;

doi: 10.48550/arxiv.2409.04073

arXiv: 2409.04073

AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

- Summary
- Subjects
- Related research
  (3)
- Metrics

Abstract

Entity matching (EM) is the problem of determining whether two records refer to same real-world entity, which is crucial in data integration, e.g., for product catalogs or address databases. A major drawback of many EM approaches is their dependence on labelled examples. We thus focus on the challenging setting of zero-shot entity matching where no labelled examples are available for an unseen target dataset. Recently, large language models (LLMs) have shown promising results for zero-shot EM, but their low throughput and high deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model fine-tuned in a transfer learning setup. We propose several novel data selection techniques to generate fine-tuning data for our model, e.g., by selecting difficult pairs to match via an AutoML filter, by generating additional attribute-level examples, and by controlling label imbalance in the data. We conduct an extensive evaluation of the prediction quality and deployment cost of our model, in a comparison to thirteen baselines on nine benchmark datasets. We find that AnyMatch provides competitive prediction quality despite its small parameter size: it achieves the second-highest F1 score overall, and outperforms several other approaches that employ models with hundreds of billions of parameters. Furthermore, our approach exhibits major cost benefits: the average prediction quality of AnyMatch is within 4.4% of the state-of-the-art method MatchGPT with the proprietary trillion-parameter model GPT-4, yet AnyMatch requires four orders of magnitude less parameters and incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).

12 pages excluding references, 3 figures, and 5 tables

Country

Netherlands

Related Organizations

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Databases, Computer Science - Artificial Intelligence, Databases (cs.DB), Computation and Language (cs.CL)

3 Research products, page 1 of 1

rein-benchmark software on GitHub
IsRelatedTo
mathGPT software on GitHub
IsRelatedTo
anymatch software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Related to Research communities

Netherlands Research Portal

AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

3 Research products, page 1 of 1

rein-benchmark software on GitHub

mathGPT software on GitHub

anymatch software on GitHub