Weighting to multiple datasets

Say there are two datasets – one that carries attitudinal variables, and demographic variables (dataset 1), and another that carries just demographic variables (dataset 2). Also assume that Dataset 2 is the more accurate and larger dataset for demographics (e.g. CPS). Our goal is to weight a dataset (dataset 3) so that it is “closest” to the population at large on both socio-demographic characteristics and attitudinal variables. We can proceed in the following manner – weight Dataset 1 to Dataset 2. And then weight dataset 3 to dataset 1. This will mean multiplying the weights. One may also impute ‘attitudes’ for the larger dataset (dataset 2), using a prediction model built using dataset 1, and then use the larger dataset to generalize to the population.

Star Trek: Trekking uncertainly between utopia and twentieth century earth

Star Trek (and its spin offs) are justly applauded for including socially progressive ideas in both, the themes of their stories, and the cultural fabric of the counterfactual imagination of the future. For instance, women and minorities command positions of responsibility, those working for the ‘Federation’ take ethical questions seriously, both ‘Data’ (an android) and empathy (via a ‘Betazoid counselor’) play a central role in making command decisions (at least in one of the series), etc.

There are some other pleasant aspects of the show. Background hum of a ship replaces cacophonous noise that passes for as background score on many shows; order prevails; professionalism and intelligence are shown as being rewarded; backroom machinations are absent; and thrill of exploration and discovery is elevated to virtue.

However, there are a variety of places where either insufficient thought, or distinctly twentieth century considerations intrude. For one, the central protagonists belong to ‘Star Fleet’, a military (and peacekeeping) arm of the ‘Federation.’ More distressingly, this military arm seems to be run internally on many of the same time-worn principles as on earth in the twentieth century including, an extremely hierarchical code, uniform clothing, etc. The saving grace is that most members of the Star Fleet are technical personnel. Still the choice of conceptualizing the protagonists as belonging to the military wing (of arguably a peaceful organization) is somewhat troubling.

There are other ‘backward’ aspects. Inter-species stereotyping is common. For instance, Ferengis are mostly shown as irredeemably greedy, the Romulans and Klignons as devoted to war, and the Borg and the Dominion as simply evil. While some shows make some attempts at dealing with the issue, attributing psychological traits to entire cultures and worlds is relatively common. Further, regrettably, uniforms of women in some of the series are noticeably tighter.

More forgivably perhaps, there is an almost exclusive focus on people in command. This is perhaps necessitated by demands of creating non inter-personal drama, most easily achieved by focusing on important situations that affect the fate of many – the kinds of situations mostly people in command confront (in the hierarchical institutional format shown). The hierarchical structure and need for drama often create some absurdity. Since those in command have to be shown ‘commanding’, the captain of the ship is shown giving the largely superfluous order of ‘engage’ (akin to asking the driver to ‘drive’ when he knows he has to drive you to the destination) in a theatrical fashion. Similarly, given the level of automation and technological sophistication shown, opportunities for showing heroism have to be many a time contrived. Hence many of the ‘missions’ are tremendously low tech.

Where does this leave us? Nowhere in particular but perhaps with just a slightly better appreciation of some of the ‘tensions’ between how the show is often imagined by ‘nerds’ (as a vision of utopia) and what the show is really about.

Size matters, significantly

Achieving statistical significance is entirely a matter of sample size. In the frequentist world, we can always distinguish between two samples if we have enough data (except of course if the samples are exactly the same). On the other hand, we may fail to reject even large differences when sample sizes are small. For example, over 13 Deliberative Polls (list at the end), correlation between proportion of attitude indices showing significant change and size of the participant sample is .81 (rank ordered correlation is .71). This sharp correlation is suggestive evidence that average effect is roughly equal across polls (and hence power matters).

When the conservative thing to do is to the reject the null, for example, in “representativeness” analysis designed to see if the experimental sample is different from control, one may want to go for large sample sizes or say something about substantiveness of differences, or ‘adjust’ results for differences. If we don’t do that samples can look more ‘representative’ as sample size reduces. So for instance, the rank ordered correlation between proportion significant differences between non-participants and participants, and the size of the smaller sample (participant sample), for the 13 polls is .5. The somewhat low correlation is slightly surprising. It is partly a result of the negative correlation between size of the participant pool and average size of the differences.

Polls included: Texas Utilities: (CPL, WTU, SWEPCO, HLP, Entergy, SPS, TU, EPE), Europolis 2009, China Zeguo, UK Crime, Australia Referendum, and NIC

Adjusting for covariate imbalance in experiments with SUTVA violations

Consider the following scenario: control group is 50% female while the participant sample is 60% female. Also assume that this discrepancy is solely a matter of chance, and that the effect of the experiment varies by gender. To estimate the effect of the experiment, one needs to adjust for the discrepancy, which can be done via matching, regression, etc.

If the effect of the experiment depends on the nature of the participant pool, such adjustments won’t be enough. Part of the effect of Deliberative Polls is a consequence of the pool of respondents. It is expected that the pool matters only in small group deliberation. Given people are randomly assigned to small groups, one can exploit the natural variation across groups to estimate how say proportion females in a group impacts attitudes (dependent variable of interest). If that relationship is minimal, no adjustments outside the usual are needed. If however there is a strong relationship, one may want to adjust as follows: predict attitudes under simulated groups from a weighted sample, with probability of selection proportional to the weight. This will give us a distribution – which is correct – as women may be allocated in a variety of ways to small groups.

There are many caveats, beginning with limitations of data in estimating impact of group characteristics on individual attitudes, especially if effects are heterogeneous. Where proportions of subgroups are somewhat small, inadequate variation across small groups can result.

This procedure can be generalized to a variety of cases where effect is determined by the participant pool except where each participant interacts with the entire sample (or a large proportion of it). Reliability of the generalization will depend on getting good estimates.

Poor Browsers and Internet Surveys


  1. older browsers are likelier to display the survey incorrectly.
  2. type of browser can be a proxy for respondent’s proficiency in using computers, and speed of the Internet connection.

People using older browsers may abandon surveys at higher rates than those using more modern browsers.

Using data from a large Internet survey, we test whether people who use older browsers abandon surveys at higher rates, and whether their surveys have larger amount of missing data. Read More >>.


(Based on data from the 111th Congress)

Law is the most popular degree at the Capitol Hill (it has been the case for a long time) – nearly 52% of the senators, and 36% of congressional representatives have a degree in law. There are some differences across parties and across houses, with Republicans likelier to have a law degree than Democrats in the Senate (58% to 48%), and the reverse holding true for the Congress – where more Democrats have law degrees than Republicans (40% to 32%). Less than 10% of members of congress have a degree in the natural sciences or engineering. Nearly 8% have a degree from Harvard, making Harvard’s the largest alumni contingent at the Capitol. Yale is a distant second with less than half the number that went to Harvard.

Does children’s gender cause partisanship?

More women identify themselves as Democrats than as Republicans. The disparity is yet greater among single women. It is possible (perhaps even likely) that this difference in partisan identification is due to (perceived) policy positions of Republicans and Democrats.

Now let’s do a thought experiment: Imagine a couple about to have a kid. Also assume that the couple doesn’t engage in sex-selection. Two things can happen – the couple can have a son or a daughter. It is possible that having a daughter persuades the parent to change his or her policy preferences towards a direction that is perceived as more congenial to women. It is also possible that having a son has the opposite impact – persuading parents to adopt more male congenial political preferences. Overall, it is possible that gender of the child makes a difference to parents’ policy preferences. With panel data one can identify both movements. With cross-sectional data, one can only identify the difference between those who had a son, and those who had a daughter.

Let’s test this using cross-sectional data from Jennings and Stoker’s “Study of Political Socialization: Parent-Child Pairs Based on Survey of Youth Panel and Their Offspring, 1997″.

Let’s assume that a couple’s partisan affiliation doesn’t impact the gender of their kid.

Number of kids, however, is determined by personal choice, which in turn may be impacted by ideology, income, etc. For example, it is likely that conservatives have more kids as they are less likely to believe in contraception, etc. This is also supported by the data. (Ideology is a post-treatment variable. This may not matter if impact of having a daughter is same in magnitude as impact of having a son, and if there are similar numbers of each across people.)

Hence one may conceptualize ‘treatment’ as gender of the kids, conditional on number of kids.

Understandably, we only study people who have one or more kids.

Conditional on number of kids, the more daughters respondent has, the less likely respondent is to identify herself as a Republican (b = -.342, p < .01) (when dependent variable is curtailed to Republican/Democrat dichotomous variable; the relationship holds – indeed becomes stronger – if the dependent variable is coded as an ordinal trichotomous variable: Republican, Independent, and Democrat, and an ordered multinomial estimated)

Future –

If what we observe is true then we should also see that as party stances evolve, impact of gender on policy preference of a parent should vary. One should also be able to do this cross-nationally.

Some other findings –

  1. Probability of having a son (limiting to live births in the U.S.) is about .51. This ‘natural rate’ varies slightly by income – daughters are more likely to be born among lower income. However effect of income is extremely modest in the U.S., to the point of being ignorable. The live birth ratio is marginally rebalanced by the higher child mortality rate among males. As a result, among 0-21, the ratio between men and women is about equal in U.S.

    In the sample, there are significantly more daughters than sons. The female/male ratio is 1.16. This is ‘significantly’ unusual.

  2. If families are less likely to have kids after the birth of a boy, number of kids will be negatively correlated with proportion sons. Among people with just one kid, number of sons is indeed greater than number of daughters, though the difference is insignificant. Overall correlation between proportion sons and number of kids is also very low (corr. = -.041).

Reducing Errors in Survey Analysis

Analysis of survey data is hard to automate because of the immense variability across survey instruments – different variables, differently coded, and named in ways that often defy even the most fecund imagination. What often replaces complete automation is ad-hoc automation – quickly coded functions to recode a variable to lie within a particular range, etc. applied by intelligent people frustrated by lack of complete automation, and bored by the repetitiveness of the task. Ad-hoc automation attracts mistakes for functions are often coded without rigor, and useful alerts and warnings not included.

One way to reduce mistakes is to prevent them from happening. Carefully coded functions with robust error checking and handling, and alerts (and passive verbose outputs) that are cognizant of our own biases, and bounded attention, can reduce mistakes. Functions applied most routinely typically need the most attention.

Let’s use the example of recoding a variable 0 to 1 (in R) to illustrate how one may think about coding a function. A few things we one want to consider –

  1. Data type: Is the variable numeric, ordinal, or categorical. Let’s say we want to constrain our function to handling only numeric variables. Some numeric variables may be coded as ‘character’. We may want to seamlessly deal with these issues, and possibly issue warnings (or passive outputs) when improper data types are used.
  2. Range: Range that the variable takes in the data may not span the entire domain. We want to account for that but perhaps seamlessly by printing out the range that the variable takes, and also allowing the user to input the true range.
  3. Missing Values: Variety of functions we may rely on when recoding our variable may take ‘fail’ (quietly) when fronted with missing values, for example, range(x). We may want to alert the user to the issue, but still handle missing values seamlessly.
  4. A user may not see the actual data so we may want to show user some of the data by default. Efficient summaries of the data (fivenum, mean, median, etc.) or displaying a few initial items may be useful.

A function that addresses some of the issues:

zero1 <- function(x, minx=NA, maxx=NA){
# Test the type of x and see if it is a double, or can be transformed into a double
stopifnot(identical(typeof(as.numeric(x)), 'double'))
if(typeof(x)=='character') x <- as.numeric(x)
print(head(x)) #displays first few items
print(paste("Range:", paste(range(x, na.rm=T), collapse=" "))) #shows the range the variable takes in the data
res <- rep(NA, length(x))
if(!is.na(minx)) res <- (x - minx)/(maxx - minx)
if(is.na(minx)) res <- (x - min(x,na.rm=T))/(max(x,na.rm=T) - min(x,na.rm=T))

These tips also apply to canned functions available in R (and those writing them) and functions in other statistical packages that routinely do not display alerts or other secondary information that may help reduce mistakes. One can always build on top of canned functions. For example, recode (car package) function can be coded to passively display correlation between the recoded variable and the original variable by default.

In addition to writing better functions, one may also want to do some post-hoc checking. Before we move to some ways of doing post-hoc checks, a caveat – Post hoc methods are only good at detecting aberrations among the variables you test, and they are costly and incomprehensive.

  1. Using prior knowledge:
    1. Identify beforehand how some variables relate to each other. For example, education is typically correlated with political knowledge, race with partisan preferences, etc. Test these ‘hypotheses’. In some cases, these can also be diagnostic of sampling biases.
    2. Over an experiment you may have hypotheses about how variables change across times. For example, ‘constraint’ typically increases across attitude indices over the course of a treatment designed to produce learning. Test these priors.
  2. Characteristics of the coded variable: If using multiple datasets, check to see if number of ‘levels’ of a categorical variable are the same across each dataset. If not, investigate. Cross-tabulations across merged data are a quick way to diagnose problems, which can range from varying codes for missing data to missing ‘levels’.

Sort of sorted but definitely cold

By now students of American Politics have all become accustomed to seeing graphs of DW-NOMINATE scores showing ideological polarization in Congress. Here are the equivalent graphs (we assume two dimensions) at the mass-level.

Data are from the 2004 ANES. Social and Cultural Preferences are from Confirmatory Factor Analysis over relevant items.





Here’s how to interpret the graphs -

1) There is a large overlap in preference profiles of Rs and Ds.

2) Conditional on same preferences, there is a large gap in thermometer ratings. Without partisan bias – same-preferences should yield about the same R-D thermometer ratings. And this gap is not particularly responsive to change in preferences within parties.