Correcting for Differential Measurement Error in Experiments

14 Feb

Differential measurement error across control and treatment groups or in a within-subjects experiment, pre and post-treatment measurement waves, can vitiate estimates of treatment effect. One reason for differential measurement error in surveys is differential motivation. For instance, if participants in the control group (pre-treatment survey) are less motivated to respond accurately than participants in the treatment group (post-treatment survey), the difference in means estimator will be a biased estimator of the treatment effect. For example, in Deliberative Polls, participants acquiesce more during the pre-treatment survey than the post-treatment survey (Weiksner, 2008). To correct for it, one may want to replace agree/disagree responses with construct specific questions (Weiksner, 2008). Perhaps a better solution would be to incentivize all (or a random subset of) responses to the pre-treatment survey. Possible incentives include – monetary rewards, adding a preface to the screens telling people how important accurate responses are to research, etc. This is the same strategy that I advocate for dealing with satisficing more generally (see here) – which translates to minimizing errors, than the more common, more suboptimal strategy of “balancing errors” by randomizing the response order.

Against Proxy Variables

23 Dec

Lacking direct measures of the theoretical variable of interest, some rely on “proxy variables.” For instance, some have used years of education as a proxy for cognitive ability. However, using “proxy variables” can be problematic for the following reasons — (1) proxy variables may not track the theoretical variable of interest very well, (2) they may track other confounding variables, outside the theoretical variable of interest. For instance, in the case of years of education as a proxy for cognitive ability, the concerns manifest themselves as follows —

1) Cognitive ability causes, and is a consequence of, what courses you take, and what school you go to, in addition to of course, years of education. GSS for instance contains more granular measures of education – for instance did the respondent take science course in college. And nearly always the variable proves significant when predicting knowledge, etc. This all is somewhat surmountable as it can be seen as measurement error.

2) More problematically, years of education may tally other confounding variables – diligence, education of parents, economic strata, etc. And then education endows people with more than cognitive ability; it also causes potentially confounding variables such as civic engagement, knowledge, etc.

Conservatively we can only attribute the effect of the variable to the variable itself. That is – we only have variables we enter. If one does rely on proxy variables then one may want to address the two points mentioned above.

Re-conceptualizing the effect of the Deliberative Poll

6 Sep

Deliberative Poll proceeds as follows — Respondents are surveyed, provided ‘balanced’ briefing materials, randomly assigned to moderated small group discussions, allowed the opportunity to quiz experts or politicians in plenary sessions, and re-interviewed at the end. The “effect” is conceptualized as average Post–Pre across all participants.

The effect of the Deliberative Poll is contingent upon a particular random assignment to small groups. This isn’t an issue if small group composition doesn’t matter. If it does, then the counterfactual imagination of the ‘informed public’ is somewhat particularistic. Under those circumstances, one way want to come up with a distribution of what opinion change may look like if the assignment of participants to small groups was different. One can do this by estimating the impact of small group composition on the dependent variable of interest, and then predicting the dependent variable of interest under simulated alternate assignments.

See also: Adjusting for covariate imbalance in experiments with SUTVA violations

Size matters, significantly

26 Aug

Achieving statistical significance is entirely a matter of sample size. In the frequentist world, we can always distinguish between two samples if we have enough data (except of course if the samples are exactly the same). On the other hand, we may fail to reject even large differences when sample sizes are small. For example, over 13 Deliberative Polls (list at the end), correlation between proportion of attitude indices showing significant change and size of the participant sample is .81 (rank ordered correlation is .71). This sharp correlation is suggestive evidence that average effect is roughly equal across polls (and hence power matters).

When the conservative thing to do is to the reject the null, for example, in “representativeness” analysis designed to see if the experimental sample is different from control, one may want to go for large sample sizes or say something about substantiveness of differences, or ‘adjust’ results for differences. If we don’t do that samples can look more ‘representative’ as sample size reduces. So for instance, the rank ordered correlation between proportion significant differences between non-participants and participants, and the size of the smaller sample (participant sample), for the 13 polls is .5. The somewhat low correlation is slightly surprising. It is partly a result of the negative correlation between size of the participant pool and average size of the differences.

Polls included: Texas Utilities: (CPL, WTU, SWEPCO, HLP, Entergy, SPS, TU, EPE), Europolis 2009, China Zeguo, UK Crime, Australia Referendum, and NIC

Adjusting for covariate imbalance in experiments with SUTVA violations

25 Aug

Consider the following scenario: control group is 50% female while the participant sample is 60% female. Also assume that this discrepancy is solely a matter of chance, and that the effect of the experiment varies by gender. To estimate the effect of the experiment, one needs to adjust for the discrepancy, which can be done via matching, regression, etc.

If the effect of the experiment depends on the nature of the participant pool, such adjustments won’t be enough. Part of the effect of Deliberative Polls is a consequence of the pool of respondents. It is expected that the pool matters only in small group deliberation. Given people are randomly assigned to small groups, one can exploit the natural variation across groups to estimate how say proportion females in a group impacts attitudes (dependent variable of interest). If that relationship is minimal, no adjustments outside the usual are needed. If, however, there is a strong relationship, one may want to adjust as follows: predict attitudes under simulated groups from a weighted sample, with probability of selection proportional to the weight. This will give us a distribution — which is correct— as women may be allocated in a variety of ways to small groups.

There are many caveats, beginning with limitations of data in estimating impact of group characteristics on individual attitudes, especially if effects are heterogeneous. Where proportions of subgroups are somewhat small, inadequate variation across small groups can result.

This procedure can be generalized to a variety of cases where effect is determined by the participant pool except where each participant interacts with the entire sample (or a large proportion of it). Reliability of the generalization will depend on getting good estimates.

Cricket: An Unfairly Random Game?

7 May

In many cricket matches, it is claimed that there is a clear advantage to bowling (batting) first. The advantage is pointed to by commentators, and by captains of the competing teams in the pre-toss interview. And sometimes in the post-match interview.

The opportunity to bowl or bat first is decided by a coin toss. While this method of deciding on who is advantaged is fair on average, the system isn’t fair in any one game. At first glance, the imbalance seems inevitable. After all someone has to bat first. One can, however, devise a baseball like system where short innings are interspersed. If that violates the nature of the game too much, one can easily create pitches that don’t deteriorate appreciably over the course of a game. Or, one can come up with an estimate of the advantage and adjust scores accordingly (something akin to an adjustment issued when matches are shortened due to rain).

But before we move to seriously consider these solutions, one may ask about the evidence.

Data are from nearly five thousand one-day international matches.

The team that wins the toss wins the match approximately 49.3% of the times. With 5335 matches, we cannot rule out that the true proportion is 50%. Thus, counter to intuition, the effect of winning the toss is, on average, at best minor. This may be so because it is impossible to predict well in advance the advantage of bowling or batting first. Or it may simply be because teams are bad at predicting it, perhaps because they use bad heuristics.


No effects across the entire sample may hide some subgroup effects. It is often claimed that toss is more crucial in day and night matches, due to dew and lower visibility of the white ball under lights. And data show as much.


It may well be the case that toss is more important in tests than one-day matches.