Unlisted False Negatives: Are 11% Americans Unlisted?

21 Aug

A recent study by Simon Jackman and Bradley Spahn claims that 11% of Americans are ‘unlisted.’ (The paper has since been picked up by liberal media outlets like the Think Progress.)

When I first came across the paper, I thought that the number was much too high for it to have any reasonable chance of being right. My suspicions were roused further by the fact that the paper provided no bounds on the number — no note about measurement error in matching people across imperfect lists. A galling omission when the finding hinges on the name matching procedure, details of which are left to another paper. What makes it to the paper is this incredibly vague line: “ANES collects …. bolstering our confidence in the matches of respondents to the lists.” I take that to mean that the matching procedure was done with the idea of reducing false positives. If so, the estimate is merely an upper bound on the percentage of Americans who could be unlisted. That isn’t a very useful number.

But reality is a bit worse. To my questions about false positive and negative rates, Bradley Spahn responded on Twitter, “I think all of the contentious cases were decided by me. What are my decision-theoretic properties? Hard to say.” That line covers one of the most essential details of the matching procedure, a detail they say the readers can find “in a companion paper.” The primary issue is subjectivity. But not taking adequate account of the relevance of ‘decision theoretic’ properties to the results in the paper grates.

Optimal Cost Function When Cost of Misclassification is Higher for the Customer than for the Business

15 Apr

Consider a bank making decisions about loans. For the bank, making lending decisions optimally means reducing prediction errors*(cost of errors) minus the cost of making predictions (Keeping things simple here). The cost of any one particular error — especially, denial of loan when eligible– is typically small for the bank, but very consequential for the applicant. So the applicant may be willing to pay the bank money to increase the accuracy of their decisions. Say, willing to compensate the bank for the cost of getting a person to take a closer look at the file. If customers are willing to pay the cost, accuracy rates can increase without reducing profits. (Under some circumstances, a bank may well be able to increase profits.) Customer’s willingness to pay for increasing accuracy is typically not exploited by the lending institutions. It may be well worth exploring it.

The Human and the Machine: Semi-automated approaches to ML

12 Apr

For a class of problems, a combination of algorithms and human input makes for the most optimal solution. For instance, three years ago software to recreate shredded documents that won the DARPA award used “human[s] [to] verify what the computer was recommending.” The insight is used in character recognition tasks. I have used it to create software for matching dirty data — the software was used to merge shape files with electoral returns at precinct level.

The class of problems for which human input proves useful has one essential attribute — humans produce unbiased, if error-prone, estimates for these problems. So for instance, it would be unwise to use humans for making the ‘last mile’ of lending decisions (see also this NYT article). (And that is something you may want to verify with training data.)

Estimating Hillary’s Missing Emails

11 Apr

Note:

55000/(365*4) ~ 37.7. That seems a touch low for Sec. of state.

Caveats:
1. Clinton may have used more than one private server
2. Clinton may have sent emails from other servers to unofficial accounts of other state department employees

Lower bound for missing emails from Clinton:

  1. Take a small weighted random sample (weighting seniority more) of top state department employees.
  2. Go through their email accounts on the state dep. server and count # of emails from Clinton to their state dep. addresses.
  3. Compare it to # of emails to these employees from the Clinton cache.

To propose amendments, go to the Github gist

(No) Value Added Models

6 Jul

This note is in response to some of the points raised in the Agnoff Lecture by Ed Haertel.

The lecture makes two big points:
1) Teacher effectiveness ratings based on current Value Added Models are ‘unreliable.’ They are actually much worse than just unreliable; see below.
2) Simulated counterfactuals of gains that can be got from ‘firing bad teachers’ are upwardly biased.

Three simple tricks (one discussed; two not) that may solve some of the issues:
1) Estimating teaching effectiveness: Where possible, random assignment of children to classes. I would only do within school comparisons. Inference will still not be clean (SUTVA violations, though they can be dealt with). Simply cleaner.

2) Experiment with teachers. Teach some teachers some skills. Estimate the impact. Rather than teacher level VAM, do a skill level VAM. Teachers = sum of skills + idiosyncratic variation.

3) For current VAMs: To create better student level counterfactuals, use modern ML techniques (SVM, Neural Networks..), lots of data (past student outcomes, past classmate outcomes etc.), cross-validate to tune. Have a good idea about how good the prediction is. The strategy may be applicable to other venues.

Other points:
1) Haertel says, “Obviously, teachers matter enormously. A classroom full of students with no teacher would probably not learn much — at least not much of the prescribed curriculum.” A better comparison perhaps would be to self-guided technology. My sense is that as technology evolves, teachers will come up short in a comparison between teachers and advanced learning tools. In most of the third world, I think it is already true.

2) It appears no model for calculating teacher effectiveness scores yields identified estimates. And it appears we have no clear understanding of the nature of bias. Pooling biased estimates over multiple years doesn’t recommend itself to me as a natural fix to this situation. And I don’t think calling this situation as ‘unreliability’ of scores is right. These scores aren’t valid. The fact that pooling across years ‘works’ may suggest issues are smaller. But then again, bad things may be happening to some kinds of teachers, especially if people are doing cross-school comparisons.

3) Fade-out concern is important given the earlier 5*5 =25 analysis. My suspicion would be that attenuation of effects varies depending on when the timing of the shock. My hunch would be that shocks at an earlier age matter more – they decay slower.

Impact of selection bias in experiments where people treat each other

20 Jun

Selection biases in the participant pool generally have limited impact on inference. One way to estimate population treatment effect from effects estimated using biased samples is to check if treatment effect varies by ‘kinds of people’, and then weight the treatment effect to population marginals. So far so good.

When people treat each other, selection biases in participant pool change the nature of the treatment. For instance, in a Deliberative Poll, a portion of the treatment is other people. Naturally then, the exact treatment depends on the pool of people. Biases in the initial pool of participants mean treatment is different. For inference, one may exploit across group variation in composition.

Sampling on M-Turk

13 Oct

In many of the studies that use M-Turk, there appears to be little strategy to sampling. A study is posted (and reposted) on M-Turk till a particular number of respondents take the study. If the pool of respondents reflects true population proportions, if people arrive in no particular order, and all kinds of people find the monetary incentive equally attractive, the method should work well. There is reasonable evidence to suggest that at least points 1 and 3 are violated. One costly but easy fix for the third point is to increase payment rates. We can likely do better.

If we are agnostic about variable on which we want precision, here’s one way to sample: Start with a list of strata, and their proportions in the population of interest. If the population of interest is sample of US adults, the proportions are easily known. Set up screening questions, and recruit. Raise price to get people in cells that are running short. Take simple precautions. For one, to prevent gaming, do not change the recruitment prompt to let people know that you want X kinds of people.

On balance, let there be imbalance on observables

27 Sep

For whatever reason, some people are concerned with imbalance when analyzing data from randomized experiments. The concern may be more general, but its fixes devolve into reducing imbalance on observables. Such fixes may fix things or break things. More generally, it is important to keep in mind what one experiment can show. If randomization is done properly, and other assumptions hold, the most common estimator of experiment effects – difference in means – is unbiased. We also have a good idea of how often the true estimate will be in the bounds. For tightening those bounds, relying on sample size is the way to go. General rules apply. Larger is better. But some refinements to that general rule. When everyone/thing is the same – for instance, neutrons* (in most circumstances) – and if measurement error isn’t a concern, samples of 1 will do just fine. The point holds for a potentially easier to obtain case than everybody/thing being same – when treatment effect is constant across things/people. When everyone is different w.r.t treatment effect, randomization won’t help. Though one can always try to quantify the difference. More generally, sample size required for a particular level of balance is greater, greater the heterogeneity. Stratified random assignment (or blocking) will help. This isn’t to say that raw difference estimator will be biased. It won’t. Just that variance will be higher.

* Based on discussion in Gerber and Green on why randomization is often not necessary in physics.

Why Were the Polls so Accurate?

16 Nov

The Quant. Interwebs have overflowed with joy since the election. Poll aggregation works. And so indeed does polling, though you won’t hear as much about it on the news, which is likely biased towards celebrity intellects than the hardworking many. But why were the polls so accurate?

One potential explanation: because they do some things badly. For instance, most fail at collecting “random samples” these days, because of a fair bit of nonresponse bias. This nonresponse bias, if correlated with the propensity to vote, may actually push up the accuracy of the vote choice means. There are a few ways to check this theory.

One way to check this hypothesis: were the results from polls using Likely Voter screens different from those not using them? If not, why not? From the Political Science literature, we know that people who vote (not just those who say they vote) do vary a bit from those who do not vote, even on things like vote choice. For instance, there is just a larger proportion of `independents’ among them.

Other kinds of evidence will be in the form of failure to match population or other benchmarks. For instance, election polls would likely fare poorly when predicting how many people voted in each state. Or tallying up Spanish language households or number of registered. Another way of saying this is that the bias will vary by what parameter we aggregate from these polling data.

So let me reframe the question: how do polls get election numbers right even when they undercount Spanish speakers? One explanation is that there is a positive correlation between selection into polling, and propensity to vote, which makes vote choice means much more reflective of what we will see come election day.

The other possible explanation to all this – post-stratification or other posthoc adjustment to numbers, or innovations in how sampling is done: matching, stratification etc. Doing so uses additional knowledge about the population and can shrink s.e.s and improve accuracy. One way to test such non-randomness: over tight confidence bounds. Many polls tend to do wonderfully on multiple uncorrelated variables, for instance, census region proportions, gender, … etc., something random samples cannot regularly produce.

Randomly Redistricting More Efficiently

25 Sep

In a forthcoming article, Chen and Rodden estimate the effect of ‘Unintentional gerrymandering’ on number of seats that go to a particular party. To do so they pick a precinct at random, and then add (randomly chosen) adjacent precincts to it till the district is of a certain size (decided by the total number of districts one wants to create). Then they go about creating a new district in the same manner, randomly selecting a precinct bordering the first district. This goes on till all the precincts are assigned to a district. There are some additional details but they are immaterial to the point of the note. A smarter way to do the same thing would be to just create one district over and over again (starting with a randomly chosen precinct). This would reduce the computational burden (memory for storing edges, differencing shapefiles, etc.) while leaving estimates unchanged.