Say you want to learn about the average number of potholes per unit paved street in a city. To estimate that quantity, the following sampling plan can be employed:
1. Get all the streets in a city from Google Maps or OSM
2. Starting from one end of the street, split each street into .5 km segments till you reach the end of the street. The last segment, or if the street is shorter than .5km, the only segment, can be shorter than .5 km.
3. Get the lat/long of start/end of the segments.
4. Create a database of all the segments: segment_id, street_name, start_lat, start_long, end_lat, end_long
5. Sample from rows of the database
6. Produce a CSV of the sampled segments (subset of step 4)
7. Plot the lat/long on Google Map — filling all the area within the segment.
8. Collect data on the highlighted segments.
For Python package that implements this, see https://github.com/soodoku/geo_sampling.
Datasets often contain missing values. And often enough—at least in social science data—values are missing systematically. So how do we visualize missing values? After all, they are missing.
Some analysts simply list-wise delete points with missing values. Others impute, replacing missing values with mean or median. Yet others use sophisticated methods to impute missing values. None of the methods, however, automatically acknowledge that any of the data are missing in the visualizations.
It is important to acknowledge missing data.
One can do it is by providing a tally of how much data are missing on each of the variables in a small table in the graph. Another, perhaps better, method is to plot the missing values as a function of a covariate. For bivariate graphs, the solution is pretty simple. Create a dummy vector that tallies missing values. And plot the dummy vector in addition to the data. For instance, see:
(The script to produce the graph can be downloaded from the following GitHub Gist.)
In cases, where missing values are imputed, the dummy vector can also be used to ‘color’ the points that were imputed.
- Remove stuff that is irrelevant to matching. It can be font style, spaces, articles (the, a, an) or something else.
- Standardize different versions of the same word. For e.g., Township and Twp. may be converted to twp, and Saint and St. to saint. Common misspellings can also be handled in a similar manner. But do all this with care. For example, St. may stand for state, in addition to Saint.
- Remove duplicates if many to one (or one to many or many to many) matches are not allowed. Note that our ability to detect duplicates will be limited by how dirty the data are within each dataset.
- What is to be matched may be one column but it may have multiple identifiers — match on those identifiers individually. This may provide additional leverage where these multiple identifiers are in different order. If order doesn’t matter — as in string + number = number + string, code it so.
- If data are dirty in similar ways, then use complete automation to coerce identifier to be same across datasets.
- If data are dirty in different ways (misspelled in different ways, abbreviated in different ways, etc.), produce an additional column that carries a numeric value of how close was the match. Producing similarity distances between strings, and making judgments based on those similarity distances, can be done using versions of distance measures between strings, for example — Levenshtein distance. But get baseline error rates using training data.
One has two choices — semi or complete automation. If misclassification penalty is high, and one wants as many matches as one can get — then semi-automated solutions likely provide the best route. Resources can dictate what option one chooses.
Semi automation — Show best matches among which people can choose.
- Produce the list of possible matches via liberal criteria — so as not to miss (m)any matches. Even if matches are missed, they can be eventually done manually. So there is an optimization between number of matches to show versus number of possibilities that come up without a match.
- Arrange matches intelligently – for example string + number, followed by string, etc. Where matches more than 5, arrange alphabetically.
- If lots of data, think of using Mechanical Turk or Captcha.