Overview

ChoiceMaker software finds duplicate records in a database.  The figure below depicts how ChoiceMaker software works, by showing three examples of how ChoiceMaker might compare two records. 

In Example 1, the information in the records is closely similar, although not identical. The degree of similarty is high enough that ChoiceMaker software assigns the records a match probability of 93 percent, and on that basis, some customer application merges the records to eliminate the duplication.

In Example 2, there is little similarity between the records, so ChoiceMaker assigns a low match probability of just 12 percent, and the records are kept as distinct entries in the database.

Finally, in Example 3, it is not clear whether the records represent the same person, particularly if there's a chance that some of the data might be erroneous or out-of-date.  ChoiceMaker assigns an intermediate match probability to these pairs, so that the customer application can hold them aside for further human review.
ChoiceMaker Record Matching Process

Why

Consider a registry of children that is maintained by a state or local government.  Such a registry might be an immunization database that tracks inoculations or a student database that tracks academic records across school districts.

In the case of an immunization registry, it is important to avoid duplicate records for children so that each child receives a complete set of inoculations.  In the case of a state-wide student registry, it is important to avoid duplicates so that the state can comply with the mandates of the federal No Child Left Behind Act.

How

It can be difficult to find duplicates in a database.  Again, consider the case of children.  Social security numbers for children, particularly for young children, tend to be unreliable.  Other important information about a child -- such as the child's first, middle or last name, or the child's home address or phone number -- can change over time.  Even information that shouldn't change -- such the child's birth date -- may be missing or unreliable.  In fact, any data entered for a child may contain errors, non-standard spellings or placeholder values.

Even when errors and incomplete or ambiguous data are present, human beings usually have a reliable intuition for whether two database records match each other.  That's because human beings can consider information in context, and factor out unimportant differences or similarities between records.  In the case of personnel records, humans recognize that "Jim" and "James" are nicknames; that "Keanu Reeves" and "Reaves, Keenu" might be the same person with first and last name swapped and slightly misspelled; and that two persons named "Maria Garcia" with a common birthdate of "1/1/2001" might not be a particularly strong match to each other, because the names are not uncommon and the birthdates look like approximations.

ChoiceMaker software works by mimicking human intuition.  It does this in a straightforward way.  The software compares two records at a time.  For each field in the pair of records, the software applies simple true or false tests, called clues, to check whether the field values point toward a match.  For example, in the case of first names, the following tests might be applied:

Match clues:
  • Do the values of the two records match each other exactly?
  • Do the values match each other phonetically?
  • Do the values match each other if a few letters are transposed?
  • Are the values nicknames for each other?
Differ clues:
  • Are the values completely different?
Hold clues (insufficient information)
  • Is one (or both) of the values missing, invalid or a placeholder?
A collection of such clues are applied together as part of a single model for whether records match.  In a realistic, high-accuracy model, there might be hundreds of such clues for dozens of field values.

After all the clues are evaluated for a pair of records, ChoiceMaker combines the results into an overall probability score.  It does this by assigning each clue a numerical weight that indicates its relative significance in the probability calculation.

Machine learning

The magic of ChoiceMaker software is in the weights that it assigns to clues.  ChoiceMaker uses a patented method to compute clue weights so that ChoiceMaker probabilities mimic human intuition.  Data experts are asked to review a set of record pairs, and for each pair, they are asked to mark down one of three decisions:
  • Do the pairs match each other (a match decision)?
  • Are different from each other (a differ decision)?
  • Is there not enough information to decide (a hold decision)?
Using a machine-learning algorithm such as maximum entropy, ChoiceMaker software determines the clue weights that best reproduce the experts' decisions.  This process is called training a model.  When a trained model is subsequently applied to completely different pairs, one finds that ChoiceMaker probabilities closely predict how a data expert would mark the new pairs.

Generality

Registries for children are not unique in the difficulties they face with duplicate records.  Similar issues can arise anywhere that database records need to be correlated with real world people, businesses, places and things.  ChoiceMaker software can help with almost any collection of structured data.  If human beings can tell the difference between duplicates, ChoiceMaker software can be trained to mimic their decisions.