Category Archives: Research

Against Proxy Variables

Lacking direct measures of the theoretical variable of interest, some scholars rely on ‘proxy variables’. For instance, some have used years of education as a proxy for cognitive ability. However, using ‘proxy variables’ can be problematic for the following reasons – (1) proxy variables may not track the theoretical variable of interest very well, (2) they may track other confounding variables, outside the theoretical variable of interest. For instance, in the case of years of education as a proxy for cognitive ability, the concerns manifest themselves as follows –

1) Cognitive ability causes, and is a consequence of, what courses you take, and what school you go to, in addition to of course, years of education. GSS for instance contains more granular measures of education – for instance did the respondent take science course in college. And nearly always the variable proves significant when predicting knowledge, etc. This all is somewhat surmountable as it can be seen as measurement error.

2) More problematically, years of education may tally other confounding variables –diligence, education of parents, economic strata, etc. And then education endows people with more than cognitive ability; it also causes potentially confounding variables such as civic engagement, knowledge, etc.

Conservatively we can only attribute the effect of the variable to the variable itself. That is – we only have variables we enter. If one does rely on ‘proxy variables’ then one may want to address the two points mentioned above.

Measuring Impact of Media

Measuring the impact of media accurately has proven a challenge. Findings of minimal effects abound when intuition tells us that an activity that an average American engages in over forty hours a week is likely to have a larger impact. These insignificant findings have been typically attributed to frailty of survey self-reports of media exposure, though debilitating error in dependent variables has also been noted as a culprit. Others have noted weaknesses in research design, inadequate awareness of analytic techniques that allow one to compensate for error in measures, etc. as stumbling blocks.

Here are a few of the methods that have been used to overcome some of the problems in media research, along with some modest new proposals of my own –

  • Measurement
    Since measures are error prone, one strategy has been to combine multiple measures. Multiple measures of a single latent concept can be combined using latent variable models, factor analysis, or even simple averaging. Precaution must be taken to check that errors across measures aren’t heavily correlated, for under such conditions improvements from combining multiple measures are likely to be weak or non-existent. In fact deleterious effects are possible.

    Another point of worry is that measurement error can be correlated with irrelevant respondent characteristics. For instance, women guess less than men on knowledge questions. Hence responses to knowledge questions are a function of ability and propensity to guess when one doesn’t know (tallied here by gender). By conditioning on gender, we can recover better estimates of ‘ability’. Another application would be in handling satisficing.

  • Measurement of exposure
    Rather than use self-assessments of exposure, which have been shown to be correlated to confounding variables, one may want to track incidental consequences of exposure as a measure for exposure. For example, knowledge of words of a campaign jingle, attributes of a character in a campaign commercial, source (~channel) on which the campaign was shown, program, etc. These measures factor in attention, in addition to exposure, which is useful. Unobtrusive monitoring of consumption is of course likely to be even more effective.

  • Measurement of Impact
    1. Increased exposure to positive images ought to change procedural memory and implicit associations. One can use IAT or AMP to assess the effect.
    2. Tracking Twitter and Facebook feeds for relevant information. These measures can be calibrated to opinion poll data to get a sense of what they mean.
  • Data Collection
    1. Data collection efforts need to reflect half-life of the effect. Recent research indicates that some of the impact of the media may be short-lived. Short-term effects may be increasingly consequential as people increasingly have the ability to act on their impulses – be it buying something, or donating to a campaign, or finding more information about the product. Behavioral measures (e.g. website hits) corresponding to ads may thus be one way to track impact.
    2. Future ‘panels’ may contain solely passive monitoring of media use (both input and output) and consumption behavior.
  • Estimating recipient characteristics via secondary data
    1. Geocoded IP addresses can be used to harvest secondary demographic data (race, income, etc.) from census
    2. Para-data like what browser and operating system the customer uses etc. are reasonable indicators of tech. savvy. And these data are readily harvested.
    3. Datasets can be merged via ‘matching’ or by exploiting correlation across items and by calibrating.

Re-conceptualizing the effect of the Deliberative Poll

Deliberative Poll proceeds as follows – Respondents are surveyed, provided ‘balanced’ briefing materials, randomly assigned to moderated small group discussions, allowed the opportunity to quiz experts or politicians in plenary sessions, and re-interviewed at the end. The “effect” is conceptualized as average Post-Pre across all participants.

The effect of the Deliberative Poll is contingent upon a particular random assignment to small groups. This isn’t an issue if small group composition doesn’t matter. If it does, then the counter-factual imagination of the ‘informed public’ is somewhat particularistic. Under those circumstances, one way want to come up with a distribution of what opinion change may look like if assignment of participants to small groups was different. One can do this by estimating impact of small group composition on the dependent variable of interest, and then predicting the dependent variable of interest under simulated alternate assignments.

See also: Adjusting for covariate imbalance in experiments with SUTVA violations

Weighting to multiple datasets

Say there are two datasets – one that carries attitudinal variables, and demographic variables (dataset 1), and another that carries just demographic variables (dataset 2). Also assume that Dataset 2 is the more accurate and larger dataset for demographics (e.g. CPS). Our goal is to weight a dataset (dataset 3) so that it is “closest” to the population at large on both socio-demographic characteristics and attitudinal variables. We can proceed in the following manner – weight Dataset 1 to Dataset 2. And then weight dataset 3 to dataset 1. This will mean multiplying the weights. One may also impute ‘attitudes’ for the larger dataset (dataset 2), using a prediction model built using dataset 1, and then use the larger dataset to generalize to the population.

Size matters, significantly

Achieving statistical significance is entirely a matter of sample size. In the frequentist world, we can always distinguish between two samples if we have enough data (except of course if the samples are exactly the same). On the other hand, we may fail to reject even large differences when sample sizes are small. For example, over 13 Deliberative Polls (list at the end), correlation between proportion of attitude indices showing significant change and size of the participant sample is .81 (rank ordered correlation is .71). This sharp correlation is suggestive evidence that average effect is roughly equal across polls (and hence power matters).

When the conservative thing to do is to the reject the null, for example, in “representativeness” analysis designed to see if the experimental sample is different from control, one may want to go for large sample sizes or say something about substantiveness of differences, or ‘adjust’ results for differences. If we don’t do that samples can look more ‘representative’ as sample size reduces. So for instance, the rank ordered correlation between proportion significant differences between non-participants and participants, and the size of the smaller sample (participant sample), for the 13 polls is .5. The somewhat low correlation is slightly surprising. It is partly a result of the negative correlation between size of the participant pool and average size of the differences.

Polls included: Texas Utilities: (CPL, WTU, SWEPCO, HLP, Entergy, SPS, TU, EPE), Europolis 2009, China Zeguo, UK Crime, Australia Referendum, and NIC

Adjusting for covariate imbalance in experiments with SUTVA violations

Consider the following scenario: control group is 50% female while the participant sample is 60% female. Also assume that this discrepancy is solely a matter of chance, and that the effect of the experiment varies by gender. To estimate the effect of the experiment, one needs to adjust for the discrepancy, which can be done via matching, regression, etc.

If the effect of the experiment depends on the nature of the participant pool, such adjustments won’t be enough. Part of the effect of Deliberative Polls is a consequence of the pool of respondents. It is expected that the pool matters only in small group deliberation. Given people are randomly assigned to small groups, one can exploit the natural variation across groups to estimate how say proportion females in a group impacts attitudes (dependent variable of interest). If that relationship is minimal, no adjustments outside the usual are needed. If however there is a strong relationship, one may want to adjust as follows: predict attitudes under simulated groups from a weighted sample, with probability of selection proportional to the weight. This will give us a distribution – which is correct – as women may be allocated in a variety of ways to small groups.

There are many caveats, beginning with limitations of data in estimating impact of group characteristics on individual attitudes, especially if effects are heterogeneous. Where proportions of subgroups are somewhat small, inadequate variation across small groups can result.

This procedure can be generalized to a variety of cases where effect is determined by the participant pool except where each participant interacts with the entire sample (or a large proportion of it). Reliability of the generalization will depend on getting good estimates.

Measuring Affect Coldly

Outside of the variety of ways of explicitly asking people how they feel about another group – feeling thermometers, like/dislike scales, favorability ratings etc., explicit measures asked using mechanisms designed to overcome or attenuate social desirability concerns – bogus pipeline, ACASI, etc., and a plethora of implicit measures – affect misattribution, IAT, etc., there exist a few other interesting ways of measuring affect –

  • Games as measures – Jeremy Weinstein uses ‘games’ like the dictator game to measure (inter-ethnic) affect. One can use prisoner’s dilemma, among other games, to do the same.
  • Systematic bias in responding to factual questions when ignorant about the correct answer – For example, most presidential elections years since 1988, ANES has posed a variety of retrospective evaluative and factual questions including assessments of the state of the economy, whether the inflation/unemployment/crime rose, remained the same, or declined in the past year (or some other time frame). Analyses of these questions have revealed significant ‘partisan bias’, but these questions have yet to be used as a measure of ‘partisan affect’ that is the likely cause of the observed ‘bias’.

GSS and ANES: alike yet different

The General Social Survey (GSS), run out of National Opinion Research Center at University of Chicago, and American National Election Studies (ANES), which until recently ran out of University of Michigan’s Institute for Social Research, are two preeminent surveys tracking over-time trends in social and political attitudes, beliefs and behavior of the US adult population.

Outside of their shared Midwestern roots, GSS and ANES also share sampling design – both use stratified random sample, with selection of PSUs affected by necessities of in-person interviewing, and during the 1980s and 1990s, sampling frame. However, in spite of this relative close coordination in sampling, common mode of interview, responses to few questions asked identically in the two surveys diverge systematically.

In 1996, 2000, 2004, and 2008, GSS and ANES included exact same questions on racial trait ratings. Limiting the sample to just White respondents, mean difference in trait ratings of Whites and Blacks was always greater in ANES – ratings of hardwork and intelligence, almost always statistically significantly so.

Separately, difference in proportion of self-identified Republicans estimated by ANES and GSS is declining over time.

This unexplained directional variance poses considerable threat to inference. The problem takes additional gravity given that the surveys are the bedrock of important empirical research in social science.

Another coding issue in the ANES Cumulative File

Technology has made it easy to analyze data. However we have paid inadequate attention to developing automation in data analysis software that pays more attention to potential problems with the data itself. For example, I was recently exploring how interviewer rated political knowledge varied by respondent’s level of education within each year over time using ANES cumulative file. It was only when I plotted the confidence bounds (not earlier) that I found that in 2004 7-category education variable (VCF0140a) had fewer than 7 levels – a highly unlikely scenario. To verify, I checked the unique levels for education in 2004 and indeed there were only 5 -
unique(nes$vcf0140a[nes$vcf0004=="2004"])
[1] 6 5 2 3 1

The variable from which the 7-category variable is ostensibly constructed (V043254) in 2004 has 8 levels. Since the plot looks reasonable for 2004, the problem was likely due to the case of (unwarranted) collapsing of adjacent categories than switching order more irresponsibly. Tallying raw counts revealed that categories 6 and 7, 0 and 1, and 4 and 5 had been collapsed.

On to the point about developing software that automatically flags potential problems – it would be nice if the software flagged differing number of levels of same variable by year. However this suggestion is piecemeal and more careful thinking ought to be brought to bear to design issues.

Perils of balancing scales

Randomization of scale order (balancing) across respondents is common practice. It is done to ‘cancel’ errors generated by ‘satisficers’ who presumably pick the first option on a scale without regard to content. The practice is assumed to have no impact on propensity of satisficers to pick the first option, or on other respondents, both somewhat unlikely assumptions.

A far more reasonable hypothesis is that reversing scale order does have an impact on respondents, on both non-satisficers and satisficers. Empirically, people take longer to fill out reverse ordered scales, and it is conceivable that they pay more attention to filling out the responses – reducing satisficing and perhaps boosting quality of responses, either way not simply ‘cancelling’ errors among a subset, as hypothesized.

Within satisficers, without randomization, correlated bias may produce artificial correlations across variables where none exist. For example – satisficers (say uneducated) love candy (love candy to hate candy scale). Such calamity ought to be avoided. However, in minority of cases where satisficers true preferences are those expressed in the first choice, randomization will artificially produce null results. Randomization may be more sub-optimal still if there indeed are effects on rest of the respondents.

Within survey experiments, where balancing randomization is “orthogonal” (typically just separate) to the main randomization, it has to be further assumed that manipulation has equal impact on “satisficers” in either reverse or regularly ordered scale, again a somewhat tenuous assumption.

The entire exercise of randomization is devoted not to find out the true preferences of the satisficers, a more honorable purpose, but to eliminate them from the sample. There are better ways to catch ‘satisficers’ than randomizing across the entire sample. One possibility is to randomize within a smaller set of likely satisficers. On knowledge questions, ability estimated over multiple questions can be used to inform propensity the first option (if correct and if chosen) was not a guess. Response latency can be used as well to inform judgments. For attitude questions, follow up questions measuring strength of attitude etc. can be used to weight responses on attitude questions.

If we are interested in getting true attitudes from ‘satisficers’ we may want to motivate respondents either by interspersed exhortations that their responses matter, or by providing financial incentives.

Lastly, it is important to note that combining two kinds of systematic error doesn’t make it a ‘random’ error. And no variance in data can be a conservative attribute of data (with hardworking social scientists around).