PyCon Nigeria Annual Conference

A Tale of Three Fuzzy Matchers

speaker-foto

David Mertz

David is Principal Engineer for Service Employees International Union. He created the data science training program for Anaconda Inc among other companies. With the advent of deep neural networks he has turned to training our robot overlords as well. Earlier, David helped build the world's fastest, highly-specialized, supercomputer for performing molecular dynamics. David was a Director of the PSF for six years, and remains co-chair of its Trademarks Committee and of its Scientific Python Working Group. His columns, Charming Python and XML Matters, written in the 2000s, were the most widely read articles in the Python world. He has written previous books for Manning, Packt, O'Reilly and Addison-Wesley, and has given keynote addresses at numerous international programming conferences.

Description

Heuristic matching, accompanied by scoring of candidate matches, is frequently needed to identify records in data. This talk looks at several heuristics, using several algorithms, each with different performance characteristics and use cases, developed for the same software system.

Abstract

Within a software system I developed for Service Employees International Union, we encountered three distinct contexts in which scored and heuristic matching was needed. The implementations of each approach capture many of the trade-offs similar systems often face.

In one context, the requirement was to quickly identify records which are "probably the same entity (person)" as seen in a prior dataset. One dataset might contain on the order of 100,000 records, but historical data will describe a few million entities. For this operation, it was essential to be able to perform the record correlation quickly—i.e. at several thousand records evaluated per second—and it was desirable that any stored data omits any PII (personally identifiable information).

In another context, we wished to identify likely typos or spelling variations among company names. The number of underlying companies is merely in the thousands, not in the hundreds of thousands or millions, and the speed requirement was not quite as demanding, hence the algorithm permitted use of string similarity metrics, but with scoring based partially on concordance of other fields.

In a third context, we had access to a large external data source with information about nearly all United States voters (and many non-voters), amounting to 380 million person records (some duplicate an identifier). The external source is largely canonical, but it is common for internal data to differ from external in nicknames, changed lastnames, changes or spelling variants in addresses, and other fields. Again the goal was to produce scored matches about "the person most likely referenced by internal data" as efficiently as possible.

Audience level: Intermediate or Advanced