Size Matters, Significantly

26 Aug

Achieving statistical significance is entirely a matter of sample size. In the frequentist world, we can always distinguish between two samples if we have enough data (except of course if the samples are exactly the same). On the other hand, we may fail to reject even large differences when sample sizes are small. For example, over 13 Deliberative Polls (list at the end), the correlation between the proportion of attitude indices showing significant change and size of the participant sample is .81 (rank ordered correlation is .71). This sharp correlation is suggestive evidence that average effect is roughly equal across polls (and hence power matters).

When the conservative thing to do is to the reject the null, for example, in “representativeness” analysis designed to see if the experimental sample is different from control, one may want to go for large sample sizes or say something about substantiveness of differences, or ‘adjust’ results for differences. If we don’t do that samples can look more ‘representative’ as sample size reduces. So for instance, the rank-ordered correlation between proportion significant differences between non-participants and participants, and the size of the smaller sample (participant sample), for the 13 polls is .5. The somewhat low correlation is slightly surprising. It is partly a result of the negative correlation between the size of the participant pool and average size of the differences.

Polls included: Texas Utilities: (CPL, WTU, SWEPCO, HLP, Entergy, SPS, TU, EPE), Europolis 2009, China Zeguo, UK Crime, Australia Referendum, and NIC

Adjusting for Covariate Imbalance in Experiments with SUTVA Violations

25 Aug

Consider the following scenario: control group is 50% female while the participant sample is 60% female. Also, assume that this discrepancy is solely a matter of chance and that the effect of the experiment varies by gender. To estimate the effect of the experiment, one needs to adjust for the discrepancy, which can be done via matching, regression, etc.

If the effect of the experiment depends on the nature of the participant pool, such adjustments won’t be enough. Part of the effect of Deliberative Polls is a consequence of the pool of respondents. It is expected that the pool matters only in small group deliberation. Given people are randomly assigned to small groups, one can exploit the natural variation across groups to estimate how say proportion females in a group impacts attitudes (dependent variable of interest). If that relationship is minimal, no adjustments outside the usual are needed. If, however, there is a strong relationship, one may want to adjust as follows: predict attitudes under simulated groups from a weighted sample, with the probability of selection proportional to the weight. This will give us a distribution — which is correct— as women may be allocated in a variety of ways to small groups.

There are many caveats, beginning with limitations of data in estimating the impact of group characteristics on individual attitudes, especially if effects are heterogeneous. Where proportions of subgroups are somewhat small, inadequate variation across small groups can result.

This procedure can be generalized to a variety of cases where the effect is determined by the participant pool except where each participant interacts with the entire sample (or a large proportion of it). Reliability of the generalization will depend on getting good estimates.

Poor Browsers and Internet Surveys

14 Jul


  1. older browsers are likelier to display the survey incorrectly.
  2. type of browser can be a proxy for respondent’s proficiency in using computers, and speed of the Internet connection.

People using older browsers may abandon surveys at higher rates than those using more modern browsers.

Using data from a large Internet survey, we test whether people who use older browsers abandon surveys at higher rates, and whether their surveys have larger amount of missing data. Read More >>.

Cricket: An Unfairly Random Game?

7 May

In many cricket matches, it is claimed that there is a clear advantage to bowling (batting) first. The advantage is pointed to by commentators, and by captains of the competing teams in the pre-toss interview. And sometimes in the post-match interview.

The opportunity to bowl or bat first is decided by a coin toss. While this method of deciding on who is advantaged is fair on average, the system isn’t fair in any one game. At first glance, the imbalance seems inevitable. After all someone has to bat first. One can, however, devise a baseball like system where short innings are interspersed. If that violates the nature of the game too much, one can easily create pitches that don’t deteriorate appreciably over the course of a game. Or, one can come up with an estimate of the advantage and adjust scores accordingly (something akin to an adjustment issued when matches are shortened due to rain).

But before we move to seriously consider these solutions, one may ask about the evidence.

Data are from nearly five thousand one-day international matches.

The team that wins the toss wins the match approximately 49.3% of the times. With 5335 matches, we cannot rule out that the true proportion is 50%. Thus, counter to intuition, the effect of winning the toss is, on average, at best minor. This may be so because it is impossible to predict well in advance the advantage of bowling or batting first. Or it may simply be because teams are bad at predicting it, perhaps because they use bad heuristics.


No effects across the entire sample may hide some subgroup effects. It is often claimed that toss is more crucial in day and night matches, due to dew and lower visibility of the white ball under lights. And data show as much.


It may well be the case that toss is more important in tests than one-day matches.

GSS and ANES: Alike Yet Different

1 Jan

The General Social Survey (GSS), run out of National Opinion Research Center at University of Chicago, and American National Election Studies (ANES), which until recently ran out of University of Michigan’s Institute for Social Research, are two preeminent surveys tracking over-time trends in social and political attitudes, beliefs and behavior of the US adult population.

Outside of their shared Midwestern roots, GSS and ANES also share sampling design—both use a stratified random sample, with the selection of PSUs affected by necessities of in-person interviewing, and during the 1980s and 1990s, sampling frame. However, in spite of this relative close coordination in sampling, common mode of interview, responses to few questions asked identically in the two surveys diverge systematically.

In 1996, 2000, 2004, and 2008, GSS and ANES included exact same questions on racial trait ratings. Limiting the sample to just White respondents, mean difference in trait ratings of Whites and Blacks was always greater in ANES – ratings of hardwork and intelligence, almost always statistically significantly so.

Separately, difference in proportion of self-identified Republicans estimated by ANES and GSS is declining over time.

This unexplained directional variance poses a considerable threat to inference. The problem takes additional gravity given that the surveys are the bedrock of important empirical research in social science.

The Perils of Balancing Scales

15 Nov

Randomization of scale order (balancing) across respondents is common practice. It is done to ‘cancel’ errors generated by ‘satisficers’ who presumably pick the first option on a scale without regard to content. The practice is assumed to have no impact on the propensity of satisficers to pick the first option, or on other respondents, both somewhat unlikely assumptions.

A far more reasonable hypothesis is that reversing scale order does have an impact on respondents, on both non-satisficers and satisficers. Empirically, people take longer to fill out reverse ordered scales, and it is conceivable that they pay more attention to filling out the responses — reducing satisficing and perhaps boosting the quality of responses, either way not simply ‘canceling’ errors among a subset, as hypothesized.

Within satisficers, without randomization, correlated bias may produce artificial correlations across variables where none existed. For example, satisficers (say uneducated) love candy (love candy to hate candy scale). Such a calamity ought to be avoided. However, in a minority of cases where satisficers true preferences are those expressed in the first choice, randomization will artificially produce null results. Randomization may be more sub-optimal still if there indeed are effects on rest of the respondents.

Within survey experiments, where balancing randomization is “orthogonal” (typically just separate) to the main randomization, it has to be further assumed that manipulation has equal impact on “satisficers” in either reverse or regularly ordered scale, again a somewhat tenuous assumption.

The entire exercise of randomization is devoted not to find out the true preferences of the satisficers, a more honorable purpose, but to eliminate them from the sample. There are better ways to catch ‘satisficers’ than randomizing across the entire sample. One possibility is to randomize within a smaller set of likely satisficers. On knowledge questions, ability estimated over multiple questions can be used to inform propensity the first option (if correct and if chosen) was not a guess. Response latency can be used as well to inform judgments. For attitude questions, follow up questions measuring the strength of attitude etc. can be used to weight responses on attitude questions.

If we are interested in getting true attitudes from ‘satisficers,’ we may want to motivate respondents either by interspersed exhortations that their responses matter, or by providing financial incentives.

Lastly, it is important to note that combining two kinds of systematic error doesn’t make it a ‘random’ error. And no variance in data can be a conservative attribute of data (with hardworking social scientists around).

Flawed Analyses in Deliberative Polls

14 Jan

A Deliberative Poll works as follows: A random sample of people are surveyed. Out of the initial sample, a random subset is invited to deliberate, given balanced briefing materials, randomly assigned to small groups moderated by trained moderators, allowed the opportunity to quiz experts, and in the end surveyed again.

Reports and papers on Deliberative Polls often carry comparisons between participants to non-participants on a host of attitudinal, and demographic variables (e.g. see here, and here). The analysis purports to answer whether people who came to Deliberative Poll were different from those who didn’t and to compare participant sample to the population. This sounds about right, except the comparison is made between participants, and a pool of two likely distinct sub-populations—people who were never invited (probably a representative, random set), and people who were invited but never came. Under plausible and probable assumptions, such pooling biases against finding a result.

The key thing we want to measure is self-selection bias—was there a difference between people who accepted the invitation, and who did not. The right way to estimate the bias would be as follows:

(Participant/Didn't come) ~ socio-demographics (gender, education, income, party id, age) + knowledge + attitude extremity

Effect sizes can be provided to summarize the extent of bias. This kind of analysis can account for the fact that bias may not occur at first marginals (gender), but at second marginals (less educated women). (This all can be theory-driven, or more descriptive in purpose). The analysis also allows for smaller effects to be seen, as variance within cells is reduced.

p values
When the conservative thing to do is to reject the null hypothesis, think a little less about p-values.

Assuming initial survey approximates a ‘representative’ sample of the entire population and assuming we want to infer how ‘representative’ participants are of the entire population, it makes sense just to report mean differences without the p-values.

The survey sample estimates stand in for the entire population. Entire population census numbers are without standard errors or very low s.e. so comparisons are always significant.

By comparing to an uncertain estimate of the population, one cannot say whether the participant sample was representative of the entire population. That estimation is without bias but suffers the following problem – the more uncertain the population estimate, the less likely one can reject null, and more likely one is to conclude that the participant sample is representative. One way to deal with this is to do the following – Have 95% conf. band of sample estimate of population and then calculate max and min difference between the sample and report that.

Name calling
Calling the analysis ‘representativeness’ analysis seems misleading on two counts:

  1. While a clear representation question can be answered by some analysis, none such question is answered and can be answered by the analysis presented. Moreover, it isn’t clear if it relates to some larger politically meaningful variable. For example, one question that can be posed is whether participant sample resembles the population at large. For answering such a question, one would want to compare population estimates to census estimates (which have near zero variance, so t-tests, etc. would be pointless.)
  2. In a series of papers in the 1970s, Kruskal and Mosteller (citations at the end) rightly excoriate the use of `representativeness’, which is fuzzy and open to abuse.

Kruskal, W; Mosteller, F. (1979) Representative sampling I: non-scientific literature. Intern Stat Rev. 47:13-24.
Kruskal, W; Mosteller, F. (1979) Representative sampling II: scientific literature. Intern Stat Rev. 47:111-127.
Kruskal, W; Mosteller, F. (1979) Representative sampling III: the current statistical literature. Intern Stat Rev. 47:245-265.