Merging Data with Dirty Keys

8 Oct
  1. Remove stuff irrelevant to matching, e.g., font style, spaces, articles (the, a, an), etc.
  2. Standardize different versions of the same word. One common problem is abbreviations, e.g., Township and Twp., Saint and St., etc. Common misspellings can also be handled similarly. But do all this with care. For example, St. may stand for state and saint.
  3. Remove duplicates if many-to-one (or one-to-many or many-to-many) matches are not allowed. Note that our ability to detect duplicates will be limited by how dirty the data are within each dataset.
  4. Often, a few common problems are at the heart of most issues. For instance, say that you are matching on country, state, and city. Noise in how the countries are spelled across datasets may explain missed matches. Fix those.
  5. To match the rest, the options are:

    • Do a fuzzy outer join with a large enough distance to get most matches. (There is generally a left table and if there is one, do a fuzzy left join.) Create a column that stores the quality of the match.

      • Manually review cases where the matches.
      • If manually reviewing all the rows isn’t an option, code a small random sample and figure out the optimal distance that gives reasonable FP/FN.
      • Many a time keeping only the matches with the smallest distance is sufficient. For instance, say the joined data is as follows:
        id, key1, key2, dist
        1, ram, rama, 1
        2, ram, bama, 1
        3, ram, drama, 2

        Only keep rows 1 and 2 and drop 3. You can validate the assumption with a random sample.

  6. If you are building software for manual review:

    1. Arrange matches intelligently – for example, string + number, followed by string, etc. When there are more than 5 matches, arranging alphabetically works well as a default.
    2. If you have lots of data, consider Mechanical Turk or Captcha.