Source: 2004 ANES. Cultural and Social Welfare Preferences are from Confirmatory Factor Analysis over relevant items.
Reducing Errors in Survey Analysis
Analysis of survey data is hard to automate because of the immense variability across survey instruments – different variables, differently coded, and named in ways that often defy even the most fecund imagination. What often replaces complete automation is ad-hoc automation – quickly coded functions to recode a variable to lie within a particular range, etc. applied by intelligent people frustrated by lack of complete automation, and bored by the repetitiveness of the task. Ad-hoc automation attracts mistakes for functions are often coded without rigor, and useful alerts and warnings not included.
One way to reduce mistakes is to prevent them from happening. Carefully coded functions with robust error checking and handling, and alerts (and passive verbose outputs) that are cognizant of our own biases, and bounded attention, can reduce mistakes. Functions applied most routinely typically need the most attention.
Let’s use the example of recoding a variable 0 to 1 (in R) to illustrate how one may think about coding a function. A few things we one want to consider –
- Data type: Is the variable numeric, ordinal, or categorical. Let’s say we want to constrain our function to handling only numeric variables. Some numeric variables may be coded as ‘character’. We may want to seamlessly deal with these issues, and possibly issue warnings (or passive outputs) when improper data types are used.
- Range: Range that the variable takes in the data may not span the entire domain. We want to account for that but perhaps seamlessly by printing out the range that the variable takes, and also allowing the user to input the true range.
- Missing Values: Variety of functions we may rely on when recoding our variable may take ‘fail’ (quietly) when fronted with missing values, for example, range(x). We may want to alert the user to the issue, but still handle missing values seamlessly.
- A user may not see the actual data so we may want to show user some of the data by default. Efficient summaries of the data (fivenum, mean, median, etc.) or displaying a few initial items may be useful.
A function that addresses some of the issues:
zero1 <- function(x, minx=NA, maxx=NA){
# Test the type of x and see if it is a double, or can be transformed into a double
stopifnot(identical(typeof(as.numeric(x)), 'double'))
if(typeof(x)=='character') x <- as.numeric(x)
print(head(x)) #displays first few items
print(paste("Range:", paste(range(x, na.rm=T), collapse=" "))) #shows the range the variable takes in the data
res <- rep(NA, length(x))
if(!is.na(minx)) res <- (x - minx)/(maxx - minx)
if(is.na(minx)) res <- (x - min(x,na.rm=T))/(max(x,na.rm=T) - min(x,na.rm=T))
res
}
These tips also apply to canned functions available in R (and those writing them) and functions in other statistical packages that routinely do not display alerts or other secondary information that may help reduce mistakes. One can always build on top of canned functions. For example, recode (car package) function can be coded to passively display correlation between the recoded variable and the original variable by default.
In addition to writing better functions, one may also want to do some post-hoc checking. Before we move to some ways of doing post-hoc checks, a caveat – Post hoc methods are only good at detecting aberrations among the variables you test, and they are costly and incomprehensive.
- Using prior knowledge:
- Identify beforehand how some variables relate to each other. For example, education is typically correlated with political knowledge, race with partisan preferences, etc. Test these ‘hypotheses’. In some cases, these can also be diagnostic of sampling biases.
- Over an experiment you may have hypotheses about how variables change across times. For example, ‘constraint’ typically increases across attitude indices over the course of a treatment designed to produce learning. Test these priors.
- Characteristics of the coded variable: If using multiple datasets, check to see if number of ‘levels’ of a categorical variable are the same across each dataset. If not, investigate. Cross-tabulations across merged data are a quick way to diagnose problems, which can range from varying codes for missing data to missing ‘levels’.
Sort of sorted but definitely cold
By now students of American Politics have all become accustomed to seeing graphs of DW-NOMINATE scores showing ideological polarization in Congress. Here are the equivalent graphs (we assume two dimensions) at the mass-level.
Data are from the 2004 ANES. Social and Cultural Preferences are from Confirmatory Factor Analysis over relevant items.
Here’s how to interpret the graphs -
1) There is a large overlap in preference profiles of Rs and Ds.
2) Conditional on same preferences, there is a large gap in thermometer ratings. Without partisan bias – same-preferences should yield about the same R-D thermometer ratings. And this gap is not particularly responsive to change in preferences within parties.
Sharing Information about Sharing Misinformation
Internet has revolutionized the dissemination of misinformation. Easy availability of incorrect information, gullible and eager masses, and ease of ‘sharing’ has created fertile conditions for misinformation epidemics.
While a fair proportion of misinformation is likely created deliberately, it may well spread inadvertently. Misinformation that people carry is often no different than fact to them. People are likely to share misinformation with the same enthusiasm as they would fact.
Attitude congenial misinformation is more likely to be known (and accepted as fact), and more likely to be enthusiastically shared with someone who shares the same attitude (for social, and personal rewards). Misinformation considered ‘useful’ is also more likely to be shared, e.g. (mis)-information about health related topics.
The chance of acceptance of misinformation may be greater still if people know little about the topic, or if they have no reason to think that the information is motivated. Lastly, these epidemics are more likely to take place among those less familiar with technology.
Cricket: An Unfairly Random Game?
In sports competitions, a variety of measures are often taken to make conditions about equal for all competitors. In tennis, for example, players must changes sides every other game so as to neutralize impact of angle of the sun, among other ‘side’ specific problems. In basketball, due precaution is taken to balance home court advantage. In cricket however, a curious thing happens – conditions are made randomly unequal.
In many cricket matches, there is a clear advantage in bowling or batting first. This fact is often pointed out by commentators, and by captains of the competing teams in the pre-toss interview. However the opportunity to bowl or bat first is decided by a coin toss. While this may seem ‘fair’ – it really just means that one team is randomly handed the shorter end of the stick. Hence games are not decided on ability alone. One can derive estimates of the advantage by comparing results in cases where teams won the coin toss, and when they lost it.
At first glance, the imbalance may seem inevitable – after all someone has to bat first. One can however devise a baseball like system where short innings are interspersed throughout the day. If that violates the nature of the game too much, one can easily create pitches that don’t deteriorate heavily over the course of a game, or come up with an estimate of the advantage and adjust the target for the team by that estimated amount (something akin to an adjustment issued when matches are shortened due to rain).
Empirical Analysis
Data are from nearly five thousand one-day international matches, and all international test-matches.
Toss likely plays a more crucial role on sub-continent pitches as the typically dry dusty pitches deteriorate faster under the harsh sunlight. It is also likely that toss is more crucial in day and night matches, due to dew and lower visibility of the white ball in the lights. It may well be case that toss is more important in tests than one-day matches.
The team that wins the toss wins the match approximately 46% of the times. This brings up the question as to whether the teams are choosing wisely.
[More soon..]
Nudging
Nudging the mood?
Important consequential decisions in life are hostage to our mood. What we intend to do (and actually do) often varies by mood. Mood in turn can vary due to a variety of exogenous reasons – negative swings can be caused by ill-health (a headache, or allergies) and positive swings can be caused by a nice thing said by someone you meet by accident. This variation is a ‘proof’ of our ‘irrationality’. The irrational aspect is not just misattribution of ill-health to mood, but why mood at all affects our decisions. Being aware of the relationship between mood and decisions can allow one to choose better. Given the central place mood occupies in decision making, it is likely that a nudge to affect the mood would be powerful.
End of a nudge
One of the paper-towel dispensers I use has the following sticker –‘These come from trees’. This is a famous ‘nudge’ (In Sunstein/Thaler terminology). So far so good. Till perhaps few months ago, I always read the sticker when I used the dispenser. Yesterday I noticed that I had stopped noticing the sticker. This contrasts with my behavior towards the hotel notes about saving water – which I still read. I think that is so partly because there is so much time in a hotel room. ‘Nudges’ for quick everyday decisions perhaps need to change over time.
On (Modest) Differences In Racial Distribution of Voting Eligible Population and Registered Voters in California
Each election cycle many hands are waved, and spit launched in air, when the topic of registration rates of Latinos (and other minorities) comes up. And indeed registration rates of Latinos substantially lag those of Whites – In California, percent eligible Latinos who are registered is 62.8%, whereas percent eligible Whites registered to vote is approximately 72.9%.
This somewhat large difference in registration rates doesn’t automatically translate to (equally) wide distortions in racial distribution of the eligible population and the registered voter population. For example, while self-identified Whites constitute 62.8% of the VEP, they constitute marginally more – 64.2% of the voting eligible respondents who self-identify as having registered to vote.
Here’s the math –
Assume VEP Pop. = 100
Whites = 63/100; of these 72% register = 45
Latinos = 23/100; of these 62% register = 14
Rest = 14/100; of these 62% register = 9
New Registered Population = 45 + 14 +9 = 68
Registered: Whites = 66.2; Latinos = 20.6
Source: PPIC Survey (September 2010).
Note: CPS 2008, Secretary of State data confirm this. Voting day population estimates from Exit Poll also show no large distortions.
Some simple math:
For a two category case, say proportion category a = pa
Proportion category b = 1 – pa
Assume response rates for category a = qa, and for category b = qb = c*qa
Initial Ratio = pa/(1 -pa)
Final Ratio = pa*qa/(1-pa)*qb
Or between time 1 and 2, ratio changes by qa/qb or 1/c
T1 Diff. = pa – (1- pa) = 2pa – 1
T2 Diff. = (pa*qa – qb + pa*qb)/(pa*qa + (1-pa)*qb)
= (pa(qa + qb) – qb)/(pa(qa – qb) + qb)
= [pa*qa (1 + c) - c*qa]/[pa*qa(1-c) + c*qa]
T2 Diff. – T1 Diff. = [pa*qa (1 + c) - c*qa]/[pa*qa(1-c) + c*qa] – (2pa -1)
= [pa*qa (1 + c) - c*qa + pa*qa(1-c) + c*qa - 2pa (pa*qa(1-c) + c*qa)]/[pa*qa(1-c) + c*qa]
= [pa*qa + pa*qa*c - c*qa + pa*qa - pa*qa*c + c*qa - 2pa*pa*qa + 2pa*pa*qa*c - 2pa*c*qa]/[pa*qa(1-c) + c*qa]
= [2pa*qa - 2pa*pa*qa + 2pa*pa*qa*c - 2pa*c*qa]/[pa*qa(1-c) + c*qa]
= [2pa*qa(1- pa + pa*c -c)]/[pa*qa(1-c) + c*qa]
= [2pa*qa((1- c) - pa(1-c))]/[pa*qa(1-c) + c*qa]
= [2pa*qa(1-pa)(1-c)]/[pa*qa(1-c) + c*qa]
Diff. in response rates = qa – qb
When will diff. in response rates be greater than T2 – T1 Diff. -
qa – qb > [2pa*qa(1-pa)(1-c)]/(pa*qa – pa*qac + cqa)
qa(1-c)(pa*qa – pa*qac + cqa) > 2pa*qa(1-pa)(1-c)
qa(1-c)(pa*qa – pa*qa*c + c*qa) – 2pa*qa(1-pa)(1-c) > 0
(1-c)qa [pa*qa - pa*qa*c + c*qa - 2pa(1 -pa)] > 0
(1-c)qa[pa*qa -pa*qa*c + c.qa - 2pa + 2pa*pa] > 0
(1-c)qa[pa(qa - qa*c -2 + 2pa) - c.qa] > 0
(1- c) and qa are always greater than 0. Lets take them out.
pa.qa – pa.qa.c – 2pa + 2pa.pa – c.qa > 0
qa – qa.c – 2 + 2pa – c.qa/pa > 0 [ dividing by pa]
qa + 2pa – c.qa(1 + 1/pa) > 0
qa + 2pa > c.qa(1 + 1/pa)
(qa + 2pa)/[qa(1 + 1/pa)] > c
[pa*(qa + 2pa)]/[(pa + 1)qa] > c
When will diff. in response rates + initial diff. > T2 diff.
qa – qa*c + 2pa – 1 > [pa*qa (1 + c) - c*qa]/[pa*qa(1-c) + c*qa]
[pa*qa(1-c) + c*qa][qa - qa*c + 2pa - 1] – [pa*qa (1 + c) - c*qa] > 0
- pa*qa + pa*qa*c – c*qa + [pa*qa(1-c) + c*qa][qa - qa*c + 2pa] – pa*qa – pa*qa*c + c*qa > 0
-2pa*qa + [pa*qa(1-c) + c*qa][qa - qa*c + 2pa] > 0
-2pa*qa + [pa*qa - pa*qa*c + c*qa][qa - qa*c + 2pa] > 0
-2pa*qa + pa*qa[qa - qa*c + 2pa] – pa*qa*c[qa - qa*c + 2pa] + c*qa[qa - qa*c + 2pa] > 0
-2pa*qa + pa*qa*qa – pa*qa*qa*c + 2pa*qa*pa – pa*qa*c*qa + pa*qa*c*qa*c + 2pa*qa*c*pa + c*qa*qa – c*qa*qa*c + 2pa*c*qa> 0
-2pa*qa + pa*qa^2 – 2c*pa*qa^2 + 2qa*pa^2 + pa*c^2*qa^2 + 2pa^2*c*qa + c*qa^2 + c^2*qa^2 + 2pa*c*qa > 0
-2pa*qa + 2qa*pa^2 + 2pa*c*qa + 2pa^2*c*qa + pa*qa^2 – 2c*pa*qa^2 + pa*c^2*qa^2 + c*qa^2 + c^2*qa^2 > 0
2qa*pa(-1 + c + pa + pa*c) + pa*qa^2 (1 – 2c + c^2) + c*qa^2(1 + c) > 0
2qa*pa(-1 + c + pa(1+c)) + pa*qa^2 (1 – c)^2 + c*qa^2(1 + c) > 0
two of the terms are always 0 or more.
2qa*pa(-1 + c + pa(1+c)) > 0
-1 + c + pa(1+c) > 0
pa > (1-c)/(1 +c)
Idealog: Creating A Leaky Internet
Recent Wikileaks episode has highlighted the immense control national governments and private companies have on what content can be hosted. Within days of being identified by the U.S. government as a problem, private companies in charge of hosting and providing banking services to Wikileaks withdrew support, largely neutering organization’s ability to raise funds, and host content.
Successful attempts to cut Internet in Egypt and Libya also pose questions of a similar nature.
So two questions follow – should anything be done about it? And if so, what? The answer to the first is not as clear, but on balance, perhaps such (what is effectively) absolute discretionary control over the fate of ‘hostile’ information/or technology should not be the allowed. As to the second question – Given many of the hosting, banking companies, etc. essential to disseminating content are privately held, and susceptible to both government and market pressures, dissemination engine ought to be independent of those as much as possible (bottlenecks remain: most pipes are owned by governments or corporations). Here are three ideas –
1) Create an international server farm on which content can be hosted by anyone but only removed after due process, set internationally. (NGO supported farms may work as well.)
2) We already have ways to disseminate content without centralized hosting – P2P – but these systems lack a browser that collates torrents and builds a webpage in live time. Such a ‘torrent’ based browser can vastly improve the ability of P2P networks to host content.
3) For Libya/Egypt etc. the problem is of a different nature. We need applications like ‘Twitter’ to continue to function even if the artery to central servers goes down. This can be handled by building applications in a manner that they can be run on edge servers with local data. I believe this kind of redundancy can also be useful for businesses.
Measuring Affect Coldly
Outside of the variety of ways of explicitly asking people how they feel about another group – feeling thermometers, like/dislike scales, favorability ratings etc., explicit measures asked using mechanisms designed to overcome or attenuate social desirability concerns – bogus pipeline, ACASI, etc., and a plethora of implicit measures – affect misattribution, IAT, etc., there exist a few other interesting ways of measuring affect –
- Games as measures – Jeremy Weinstein uses ‘games’ like the dictator game to measure (inter-ethnic) affect. One can use prisoner’s dilemma, among other games, to do the same.
- Systematic bias in responding to factual questions when ignorant about the correct answer – For example, most presidential elections years since 1988, ANES has posed a variety of retrospective evaluative and factual questions including assessments of the state of the economy, whether the inflation/unemployment/crime rose, remained the same, or declined in the past year (or some other time frame). Analyses of these questions have revealed significant ‘partisan bias’, but these questions have yet to be used as a measure of ‘partisan affect’ that is the likely cause of the observed ‘bias’.
‘Fairly’ Random
Lottery is a way to assign disproportionate rewards (or punishments) ‘fairly’. Procedural fairness – equal chance of selection – provides ‘legitimacy’ to this system of disproportionate allocation.
Given the purpose of a lottery is unequal allocation, it is important that informed consent be sought from the participants, and that it be used in consequential arenas only when necessary.
Fairness over the longer term
One particular use of lottery is in fair assignment of scarce indivisible resources. For example, think of a good school with only hundred open seats that receives a thousand applications from candidates who are indistinguishable (or only weakly distinguishable) – given limitations of data – from each other in matters of ability. One fair way of assigning seats would be to do it randomly.
One may choose to consider the matter closed at this point. However, this means making peace with disproportional outcomes. Alternatives exist to this option. For example, one may ask the winners of the lottery to give back to those who didn’t win – say by sharing the portion of their income attributable to going to a good school, or by producing public goods, or by some other mutually agreed mechanism.
Fair Selection
Random selection is a fair method of selection over objects where we have no or little ‘reason’ to prefer one over the other. When objects are ‘observably’ (as much as data can tell us) same, or similar – same within some margin, random selection is fair.
One may extend it to objects that are different but for no discretionary action of theirs, say people with physical or mental disabilities, though competing concerns, such as lower efficiency etc., exist. More generally, selection based on some commonly agreed metric – say maximal increase in public good – may also be considered fair.
As is clear, those who aren’t selected don’t ‘deserve’ less, and indeed adequate compensation ought to be the formal basis of selection, unless of course rewards once earned cannot be transferred (say lottery to get a liver transplant, which leaves others dead, and hence unable to receive any compensation, though one can imagine rewards being transferred to relatives, etc.).





