Overview |
||||||
ChoiceMaker software finds duplicate records
in a database. The figure below depicts
how ChoiceMaker software works, by showing three examples
of how ChoiceMaker might compare two records. In Example 1, the information in the records is closely similar, although not identical. The degree of similarty is high enough that ChoiceMaker software assigns the records a match probability of 93 percent, and on that basis, some customer application merges the records to eliminate the duplication. In Example 2, there is little similarity between the records, so ChoiceMaker assigns a low match probability of just 12 percent, and the records are kept as distinct entries in the database. Finally, in Example 3, it is not clear whether the records represent the same person, particularly if there's a chance that some of the data might be erroneous or out-of-date. ChoiceMaker assigns an intermediate match probability to these pairs, so that the customer application can hold them aside for further human review. |
||||||
Why |
||||||
Consider a registry of children that is
maintained by a state or local government. Such a
registry might be an immunization database that tracks
inoculations or a student database that tracks academic
records across school districts. In the case of an immunization registry, it is important to avoid duplicate records for children so that each child receives a complete set of inoculations. In the case of a state-wide student registry, it is important to avoid duplicates so that the state can comply with the mandates of the federal No Child Left Behind Act. |
||||||
How |
||||||
It can be difficult to find duplicates in a database.
Again, consider the case of children. Social security
numbers for children, particularly for young children, tend
to be unreliable. Other important information about a
child -- such as the child's first, middle or last name, or
the child's home address or phone number -- can change over
time. Even information that shouldn't change -- such
the child's birth date -- may be missing or
unreliable. In fact, any data entered for a child may
contain errors, non-standard spellings or placeholder
values. Even when errors and incomplete or ambiguous data are present, human beings usually have a reliable intuition for whether two database records match each other. That's because human beings can consider information in context, and factor out unimportant differences or similarities between records. In the case of personnel records, humans recognize that "Jim" and "James" are nicknames; that "Keanu Reeves" and "Reaves, Keenu" might be the same person with first and last name swapped and slightly misspelled; and that two persons named "Maria Garcia" with a common birthdate of "1/1/2001" might not be a particularly strong match to each other, because the names are not uncommon and the birthdates look like approximations. ChoiceMaker software works by mimicking human intuition. It does this in a straightforward way. The software compares two records at a time. For each field in the pair of records, the software applies simple true or false tests, called clues, to check whether the field values point toward a match. For example, in the case of first names, the following tests might be applied: Match clues:
After all the clues are evaluated for a pair of records, ChoiceMaker combines the results into an overall probability score. It does this by assigning each clue a numerical weight that indicates its relative significance in the probability calculation. |
||||||
Machine learning |
||||||
The magic of ChoiceMaker software is in the weights that it
assigns to clues. ChoiceMaker uses a patented
method to compute clue weights so that ChoiceMaker probabilities
mimic human intuition. Data experts are asked to review a set of
record pairs, and for each pair, they are asked to mark
down one of three decisions:
|
||||||
Generality |
||||||
Registries for children are not unique in the
difficulties they face with duplicate records. Similar
issues can arise anywhere that database records need to
be correlated with real world people, businesses, places and
things. ChoiceMaker software can help with almost any
collection of structured data. If human beings can tell
the difference between duplicates, ChoiceMaker software can
be trained to mimic their decisions. |
||||||