
We present a record linkage solution that scales to big data volumes and velocities. Our method trains a Siamese deep learning network to encode records so that matching can be done fast and distributed. Compared to the current state-of-the-art methods using similarity functions and blocking, our solution links 100x the data in roughly 60% the time, with comparable precision and recall. We detail the design, training, and implementation of our method, and illustrate model and runtime performance results using a large US physician database and streaming data, implemented using keras/Tensorflow and Spark.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 3 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
