Against Proxy Variables

23 Dec

Lacking direct measures of the theoretical variable of interest, some rely on “proxy variables.” For instance, some have used years of education as a proxy for cognitive ability. However, using “proxy variables” can be problematic for the following reasons — (1) proxy variables may not track the theoretical variable of interest very well, (2) they may track other confounding variables, outside the theoretical variable of interest. For instance, in the case of years of education as a proxy for cognitive ability, the concerns manifest themselves as follows:

  1. Cognitive ability causes, and is a consequence of, what courses you take, and what school you go to, in addition to, of course, years of education. GSS, for instance, contains more granular measures of education, for instance, did the respondent take a science course in college. And nearly always the variable proves significant when predicting knowledge, etc. This all is somewhat surmountable as it can be seen as measurement error.
  2. More problematically, years of education may tally other confounding variables – diligence, education of parents, economic strata, etc. And then education endows people with more than cognitive ability; it also causes potentially confounding variables such as civic engagement, knowledge, etc.

Conservatively we can only attribute the effect of the variable to the variable itself. That is – we only have variables we enter. If one does rely on proxy variables then one may want to address the two points mentioned above.

Impact of Menu on Choices: Choosing What You Want Or Deciding What You Should Want

24 Sep

In Predictably Irrational, Dan Ariely discusses the clever (ex)-subscription menu of The Economist that purportedly manipulates people to subscribe to a pricier plan. In an experiment based on the menu, Ariely shows that addition of an item to the menu (that very few choose) can cause preference reversal over other items in the menu.

Let’s consider a minor variation of Ariely’s experiment. Assume there are two different menus that look as follows:
1. 400 cal, 500 cal.
2. 400 cal, 500 cal, 800 cal.

Assume that all items cost and taste the same. When given the first menu, say 20% choose the 500 calorie item. When selecting from the second menu, percent of respondents selecting the 500 calorie choice is likely to be significantly greater.

Now, why may that be? One reason may be that people do not have absolute preferences; here for a specific number of calories. And that people make judgments about what is the reasonable number of calories based on the menu. For instance, they decide that they do not want the item with the maximum calorie count. And when presented with a menu with more than two distinct calorie choices, another consideration comes to mind — they do not too little food either. More generally, they may let the options on the menu anchor for them what is ‘too much’ and what is ‘too little.’

If this is true, it can have potentially negative consequences. For instance, McDonald’s has on the menu a Bacon Angus Burger that is about 1360 calories (calories are now being displayed on McDonald’s menus courtesy Richard Thaler). It is possible that people choose higher calorie items when they see this menu option, than when they do not.

More generally, people’s reliance on the menu to discover their own preferences means that marketers can manipulate what is seen as the middle (and hence ‘reasonable’). This also translates to some degree to politics where what is considered the middle (in both social and economic policy) is sometimes exogenously shifted by the elites.

That is but one way a choice on the menu can impact preference order over other choices. Separately, sometimes a choice can prime people about how to judge other choices. For instance, in a paper exploring effect of Nader on preferences over Bush and Kerry, researchers find that “[W]hen Nader is in the choice set all voters’ choices are more sharply aligned with their spatial placements of the candidates.”

This all means, assumptions of IIA need to be rethought. Adverse conclusions about human rationality are best withheld (see Sen).

Further Reading:

1. R. Duncan Luce and Howard Raiffa. Games and Decision. John Wiley and Sons, Inc., 1957.
2. Amartya Sen. Internal consistency of choice. Econometrica, 61(3):495– -521, May 1993.
3. Amartya Sen. Is the idea of purely internal consistency of choice bizarre? In J.E.J. Altham and Ross Harrison, editors, World, Mind, and Ethics. Essays on the ethical philosophy of Bernard Williams. Cambridge University Press, 1995.

Reconceptualizing the Effect of the Deliberative Poll

6 Sep

Deliberative Poll proceeds as follows — Respondents are surveyed, provided ‘balanced’ briefing materials, randomly assigned to moderated small group discussions, allowed the opportunity to quiz experts or politicians in plenary sessions, and re-interviewed at the end. The “effect” is conceptualized as average Post–Pre across all participants.

The effect of the Deliberative Poll is contingent upon a particular random assignment to small groups. This isn’t an issue if small group composition doesn’t matter. If it does, then the counterfactual imagination of the ‘informed public’ is somewhat particularistic. Under those circumstances, one may want to come up with a distribution of what opinion change may look like if the assignment of participants to small groups was different. One can do this by estimating the impact of small group composition on the dependent variable of interest and then predicting the dependent variable of interest under simulated alternate assignments.

See also: Adjusting for covariate imbalance in experiments with SUTVA violations

Weighting to Multiple Datasets

27 Aug

Say there are two datasets—one that carries attitudinal variables, and demographic variables (dataset 1), and another that carries just demographic variables (dataset 2). Also assume that Dataset 2 is the more accurate and larger dataset for demographics (e.g. CPS). Our goal is to weight a dataset (dataset 3) so that it is “closest” to the population at large on both socio-demographic characteristics and attitudinal variables. We can proceed in the following manner: weight Dataset 1 to Dataset 2, and then weight dataset 3 to dataset 1. This will mean multiplying the weights. One may also impute attitudes for the larger dataset (dataset 2), using a prediction model built using dataset 1, and then use the larger dataset to generalize to the population.

Size Matters, Significantly

26 Aug

Achieving statistical significance is entirely a matter of sample size. In the frequentist world, we can always distinguish between two samples if we have enough data (except of course if the samples are exactly the same). On the other hand, we may fail to reject even large differences when sample sizes are small. For example, over 13 Deliberative Polls (list at the end), the correlation between the proportion of attitude indices showing significant change and size of the participant sample is .81 (rank ordered correlation is .71). This sharp correlation is suggestive evidence that average effect is roughly equal across polls (and hence power matters).

When the conservative thing to do is to the reject the null, for example, in “representativeness” analysis designed to see if the experimental sample is different from control, one may want to go for large sample sizes or say something about substantiveness of differences, or ‘adjust’ results for differences. If we don’t do that samples can look more ‘representative’ as sample size reduces. So for instance, the rank-ordered correlation between proportion significant differences between non-participants and participants, and the size of the smaller sample (participant sample), for the 13 polls is .5. The somewhat low correlation is slightly surprising. It is partly a result of the negative correlation between the size of the participant pool and average size of the differences.

Polls included: Texas Utilities: (CPL, WTU, SWEPCO, HLP, Entergy, SPS, TU, EPE), Europolis 2009, China Zeguo, UK Crime, Australia Referendum, and NIC

Adjusting for Covariate Imbalance in Experiments with SUTVA Violations

25 Aug

Consider the following scenario: control group is 50% female while the participant sample is 60% female. Also, assume that this discrepancy is solely a matter of chance and that the effect of the experiment varies by gender. To estimate the effect of the experiment, one needs to adjust for the discrepancy, which can be done via matching, regression, etc.

If the effect of the experiment depends on the nature of the participant pool, such adjustments won’t be enough. Part of the effect of Deliberative Polls is a consequence of the pool of respondents. It is expected that the pool matters only in small group deliberation. Given people are randomly assigned to small groups, one can exploit the natural variation across groups to estimate how say proportion females in a group impacts attitudes (dependent variable of interest). If that relationship is minimal, no adjustments outside the usual are needed. If, however, there is a strong relationship, one may want to adjust as follows: predict attitudes under simulated groups from a weighted sample, with the probability of selection proportional to the weight. This will give us a distribution — which is correct— as women may be allocated in a variety of ways to small groups.

There are many caveats, beginning with limitations of data in estimating the impact of group characteristics on individual attitudes, especially if effects are heterogeneous. Where proportions of subgroups are somewhat small, inadequate variation across small groups can result.

This procedure can be generalized to a variety of cases where the effect is determined by the participant pool except where each participant interacts with the entire sample (or a large proportion of it). Reliability of the generalization will depend on getting good estimates.

Poor Browsers and Internet Surveys

14 Jul

Given,

  1. older browsers are likelier to display the survey incorrectly.
  2. type of browser can be a proxy for respondent’s proficiency in using computers, and speed of the Internet connection.

People using older browsers may abandon surveys at higher rates than those using more modern browsers.

Using data from a large Internet survey, we test whether people who use older browsers abandon surveys at higher rates, and whether their surveys have larger amount of missing data. Read More >>.

Cricket: An Unfairly Random Game?

7 May

In many cricket matches, it is claimed that there is a clear advantage to bowling (batting) first. The advantage is pointed to by commentators, and by captains of the competing teams in the pre-toss interview. And sometimes in the post-match interview.

The opportunity to bowl or bat first is decided by a coin toss. While this method of deciding on who is advantaged is fair on average, the system isn’t fair in any one game. At first glance, the imbalance seems inevitable. After all someone has to bat first. One can, however, devise a baseball like system where short innings are interspersed. If that violates the nature of the game too much, one can easily create pitches that don’t deteriorate appreciably over the course of a game. Or, one can come up with an estimate of the advantage and adjust scores accordingly (something akin to an adjustment issued when matches are shortened due to rain).

But before we move to seriously consider these solutions, one may ask about the evidence.

Data are from nearly five thousand one-day international matches.

The team that wins the toss wins the match approximately 49.3% of the times. With 5335 matches, we cannot rule out that the true proportion is 50%. Thus, counter to intuition, the effect of winning the toss is, on average, at best minor. This may be so because it is impossible to predict well in advance the advantage of bowling or batting first. Or it may simply be because teams are bad at predicting it, perhaps because they use bad heuristics.

time

No effects across the entire sample may hide some subgroup effects. It is often claimed that toss is more crucial in day and night matches, due to dew and lower visibility of the white ball under lights. And data show as much.

daynight

It may well be the case that toss is more important in tests than one-day matches.

GSS and ANES: Alike Yet Different

1 Jan

The General Social Survey (GSS), run out of National Opinion Research Center at University of Chicago, and American National Election Studies (ANES), which until recently ran out of University of Michigan’s Institute for Social Research, are two preeminent surveys tracking over-time trends in social and political attitudes, beliefs and behavior of the US adult population.

Outside of their shared Midwestern roots, GSS and ANES also share sampling design—both use a stratified random sample, with the selection of PSUs affected by necessities of in-person interviewing, and during the 1980s and 1990s, sampling frame. However, in spite of this relative close coordination in sampling, common mode of interview, responses to few questions asked identically in the two surveys diverge systematically.

In 1996, 2000, 2004, and 2008, GSS and ANES included exact same questions on racial trait ratings. Limiting the sample to just White respondents, mean difference in trait ratings of Whites and Blacks was always greater in ANES – ratings of hardwork and intelligence, almost always statistically significantly so.

Separately, difference in proportion of self-identified Republicans estimated by ANES and GSS is declining over time.

This unexplained directional variance poses a considerable threat to inference. The problem takes additional gravity given that the surveys are the bedrock of important empirical research in social science.

The Perils of Balancing Scales

15 Nov

Randomization of scale order (balancing) across respondents is common practice. It is done to ‘cancel’ errors generated by ‘satisficers’ who presumably pick the first option on a scale without regard to content. The practice is assumed to have no impact on the propensity of satisficers to pick the first option, or on other respondents, both somewhat unlikely assumptions.

A far more reasonable hypothesis is that reversing scale order does have an impact on respondents, on both non-satisficers and satisficers. Empirically, people take longer to fill out reverse ordered scales, and it is conceivable that they pay more attention to filling out the responses — reducing satisficing and perhaps boosting the quality of responses, either way not simply ‘canceling’ errors among a subset, as hypothesized.

Within satisficers, without randomization, correlated bias may produce artificial correlations across variables where none existed. For example, satisficers (say uneducated) love candy (love candy to hate candy scale). Such a calamity ought to be avoided. However, in a minority of cases where satisficers true preferences are those expressed in the first choice, randomization will artificially produce null results. Randomization may be more sub-optimal still if there indeed are effects on rest of the respondents.

Within survey experiments, where balancing randomization is “orthogonal” (typically just separate) to the main randomization, it has to be further assumed that manipulation has equal impact on “satisficers” in either reverse or regularly ordered scale, again a somewhat tenuous assumption.

The entire exercise of randomization is devoted not to find out the true preferences of the satisficers, a more honorable purpose, but to eliminate them from the sample. There are better ways to catch ‘satisficers’ than randomizing across the entire sample. One possibility is to randomize within a smaller set of likely satisficers. On knowledge questions, ability estimated over multiple questions can be used to inform propensity the first option (if correct and if chosen) was not a guess. Response latency can be used as well to inform judgments. For attitude questions, follow up questions measuring the strength of attitude etc. can be used to weight responses on attitude questions.

If we are interested in getting true attitudes from ‘satisficers,’ we may want to motivate respondents either by interspersed exhortations that their responses matter, or by providing financial incentives.

Lastly, it is important to note that combining two kinds of systematic error doesn’t make it a ‘random’ error. And no variance in data can be a conservative attribute of data (with hardworking social scientists around).