Measuring Selective Exposure

24 Jul

Ideally we would like to be able to place ideology of each bit of information consumed in relation to the ideological location of the person. And we would like a time stamped distribution of the bits consumed. We can then summarize various moments of that distribution (or the distribution of ideological distances). And that would be that. (If we were worried about dimensionality, we would do it by topic.)

But lack of data mean we must change the estimand. We must code each bit of information as merely uncongenial or uncongenial. This means taking directionality out of the equation. For a Republican at a 6 on a 1 to 7 liberal to conservative scale, consuming a bit of information at 5 is the same as consuming a bit at 7.

The conventional estimand then is a set of two ratios: (Bits of politically congenial information consumed)/(All political information) and (Bits of uncongenial information)/(All political information consumed). Other reasonable formalizations exist, including difference between congenial and uncongenial. (Note that the denominator is absent, and reasonably so.)

To estimate these quantities, we must often make further assumptions. First, we must decide on the domain of political information. That domain is likely vast, and increasing by the minute. We are all producers of political information now. (We always were but today we can easily access political opinions of thousands of lay people.) But see here for some thoughts on how to come up with the relevant domain of political information from passive browsing data.

Next, generally, people code ideology at the level of ‘source.’ New York Times is ‘independent’ or ‘liberal’ and ‘Fox’ simply ‘conservative’ or perhaps more accurately ‘Republican leaning.’ (Continuous measures of ideology – as estimated by Groseclose and Milyo or Gentzkow and Shapiro – are also assigned at the source level.) This is fine except that it means coding all bits of information consumed from a source as the same. This is called ecological inference. And there are some attendant risks. We know that not all NYT articles are ‘liberal.’ In fact, we know much of it is not even political news. A toy example of how such measures can mislead:

Page Views: 10 Fox, 10 CNN. Est: 10/20
But say Fox Pages 7R, 3D and CNN 5R, 5D
Est: 7/10 + 5/10 = 12/20

If the measure of ideology is continuous, there are still some risks. If we code all page views as mean ideology of the source, we assume that the person views a random sample of pages on the source. (Or some version of that.) But that is too implausible an assumption. It is much more likely that a liberal reading the NYT likely stays away from the David Brooks’ columns. If you account for such within source self-selection, selective exposure measures based on source level coding are going to be downwardly biased — that is find people as less selective than they are.

Discussion until now has focused on passive browsing data, eliding over survey measures. There are two additional problems with survey measures. One is about the denominator. Measures based on limited choice experiments like ones used by Iyengar and Hahn 2009 are bad measures of real life behavior. In real life we just have far more choices. And inferences from such experiments can at best recover ordinal rankings. The second big problem with survey measures is ‘expressive responding.’ Republicans indicating they watch Fox News not because they do but because they want to convey they do.

Reviewing the Peer Review with Reviews as Data

24 Jul

Science is a process. And for a good deal of time, the peer review has been an essential part of the process. Looked independently by people with no experience with it, it makes a fair bit of sense. For there is only one well-known way of increasing the quality of an academic paper — additional independent thinking.

But this seemingly sound part of the process is creaking. Today, you can’t bring two academics together without them venting their frustration about the broken review system. Given how critical the peer review is in the scientific production, it deserves closer attention, preferably with good data.

But reviews aren’t available to be analyzed. Thus, some anecdotes as a way to make the case that these data be made broadly available. Of the 80 or so reviews that I have filed and for which editors have been kind enough to share comments by other reviewers, two things have jumped at me: a) hefty variation in quality of reviews, b) and equally hefty variation in recommendations for the final disposition. It would be good to quantify the two. The latter is easy enough to quantify. And the (lack) of reliability has implications for how many recommenders we need to reliably accept or reject the same article. Low reliability has important implications for number of submissions. Partly because everyone knows that the review process is so noisy, scholars submit articles they think aren’t good enough to top journals because there is a reasonable chance of an(y) article getting accepted at a good place. At any rate, these data ought to be publicly released. Side-by-side, editors should consider experimenting with number of reviewers to collect more data on the point.

Quantifying the quality of reviews is a much harder problem. What do we mean by a good review? A review that points to important problems in the manuscript and, where possible, suggests solutions? Likely so. But this is much trickier to code. But perhaps there isn’t as much a point to quantifying this. What is needed perhaps is guidance. Much like child-rearing, there is no manual for reviewing. There really should be. What should reviewers attend to? What are they missing? And how do we incentivize this process?

Data

Easy to Release Data for Each Manuscript:

  • Race, gender, school ranking of each author
  • Whether manuscript was desk rejected or not
  • How many reviewers were invited
  • Time taken by each reviewer to accept (NA for those from whom you never heard)
  • Total time in review for each article (till R and R or reject) (And separate set of column for each revision)
  • Time taken by each reviewer
  • Recommendation by each reviewer
  • Length of each review
  • How many reviewers did the author(s) suggest?
  • How often were suggested reviewers followed-up on?

In fact, much of the data submitted in multiple-choice question format can probably be released easily.

More Ambitious:

We can crowdsource collection of review data. People can deposit their reviews and the associated manuscript in a specific format to a server. To maintain confidentiality, we can sandbox these data allowing scholars to run a variety of pre-screened scripts on it. Or else journals can institute similar mechanisms.

Suggestions for Experiments
In economics, people have tried to institute shorter deadlines for reviewers to effect of reducing review times. We can try that out. We can try out fiddling with the number of reviews. In terms of incentives, it may be a good idea to try out cash but also perhaps experimenting with a system where reviewers are told that their comments will be public. I, for one, think it would lead to more responsible reviewing.

If you have additional thoughts on the issue, please propose them at: https://gist.github.com/soodoku/b20e6d31d21e83ed5e39

Here’s to making advances in the production of science and our pursuit for truth.

Where’s the Porn? Classifying Porn Domains Using a Calibrated Keyword Classifier

23 Jul

Aim: Given a very large list of unique domains, find domains carrying adult content.

In the 2004 comScore browsing data, for instance, there are about a million unique domains. Comparing a million unique domain names against a large database is doable. But access to such databases doesn’t often come cheap. So a hack.

Start with an exhaustive key word search containing porn-related keywords. Here’s mine

breast, boy, hardcore, 18, queen, blowjob, movie, video, love, play, fun, hot, gal, pee, 69, naked, teen, girl, cam, sex, pussy, dildo, adult, porn, mature, sex, xxx, bbw, slut, whore, tit, pussy, sperm, gay, men, cheat, ass, booty, ebony, asian, brazilian, fuck, cock, cunt, lesbian, male, boob, cum, naughty

For the 2004 comScore data, this gives about 140k potential porn domains. Compare this list to the approximately 850k porn domains in the shallalist. This leaves us with a list of 68k domains with uncertain status. Use one of the many URL classification APIs. Using Trusted Source API, I get about 20k porn, and 48k non-porn.

This gives us the lower bound of adult domains. But perhaps much too low.

To estimate the false positives, take a large random sample (say 10,000 unique domains). Compare results from keyword search and eliminate using API to API search of all 10k domains. This will give you an estimate of false positive rate. But you can learn from the list of false negatives to improve your keyword search. And redo everything. A couple of iterations can produce a sufficiently low false negative rate (false positive rate is always ~ 0). (For 2004 comScore data, false negative rate of 5% is easily achieved.)

Where’s the news?: Classifying News Domains

23 Jul

We select an initial universe of news outlets (i.e., web domains) via the Open Directory Project (ODP, dmoz.org), a collective of tens of thousands of editors who hand-label websites into a classification hierarchy. This gives 7,923 distinct domains labeled as: news, politics/news, politics/media, and regional/news. Since the vast majority of these news sites receive relatively little traffic, to simplify our analysis we restrict to the one hundred domains that attracted the largest number of unique visitors from our sample of toolbar users. This list of popular news sites includes every major national news source, well-known blogs and many regional dailies, and
collectively accounts for over 98% of page views of news sites in the full ODP list (as estimated via our toolbar sample). The complete list of 100 domains is given in the Appendix.

From Filter Bubbles, Echo Chambers, and Online News Consumption by Flaxman, Goel and Rao.

When using rich browsing data, scholars often rely on ad hoc lists of domains to estimate consumption of certain kind of media. Using these lists to estimate consumption raises three obvious concerns – 1) Even sites classified as ‘news sites,’ such as the NYT, carry a fair bit of non-news 2) (speaking categorically) There is the danger of ‘false positives’ 3) And (speaking categorically again) there is a danger of ‘false negatives.’

FGR address the first concern by exploiting the URL structure. They exploit the fact that the URL of NY Times story contains information about the section. (The classifier is assumed to be perfect. But likely isn’t. False positive and negative rates for this kind of classification can be estimated using raw article data.) This leaves us with concern about false positives and negatives at the domain level. Lists like those published by DMOZ appear to be curated well-enough to not contain too many false-positives. The real question is about how to calibrate false negatives. Here’s one procedure. Take a large random sample of the browsing data (at least 10,000 unique domain names). Compare it to a large comprehensive database like Shallalist. Of the domains that aren’t in the database, query a URL classification service such as Trusted Source. (The initial step of comparing against Shallalist is to reduce the amount of querying.) Using the results, estimate the proportion of missing domain names (the net number of missing domain names is likely much much larger). Also estimate missed visitation time, page views etc.

Towards a Two-Sided Market on Uber

17 Jun

Uber prices rides based on the availability of Uber drivers in the area, the demand, and the destination. This price is the same for whoever is on the other end of the transaction. For instance, Uber doesn’t take in account that someone in a hurry may be willing to pay more for a quicker service. By the same token, Uber doesn’t allow drivers to distinguish themselves on price (AirBnB, for instance, allows this). It leads to a simpler interface but it produces some inefficiency – some needs go unmet, some drivers go underutilized etc. It may make sense to try out a system that allows for bidding on both sides.

Rape in India

16 Jun

According to crime reports, in India, rape is about 15 times less common than in the US. A variety of concerns apply. For one, definition of rape varies considerably. But differences aren’t always in the expected direction. For instance, Indian Penal Code considers sex under the following circumstance as rape: “With her consent, when the man knows that he is not her husband, and that her consent is given because she believes that he is another man to whom she is or believes herself to be lawfully married.” In 2013, however, the definition of rape under the Indian Penal Code was updated to what is generally about par with the international definitions except for two major exceptions: the above clause, and more materially, the continued exclusion of marital rape. For two, there are genuine fears about rape being yet more severely underreported in India.

Evidence from Surveys
Given that rape is underreported, anonymous surveys of people are better indicators of prevalence of rape. In the US, comparing CDC National Intimate Partner and Sexual Violence Survey data to FBI crime reports suggests that only about 6.6% of rapes are reported. (Though see also, comparison to National Crime Victimization Survey (NCVS) which suggests that one of every 3 rapes are reported.) In India, according to Lancet (citing numbers from American Journal of Epidemiology (Web Table 4)), “1% of victims of sexual violence report the crime to the police.” Another article based on data from the 2005 National Family Health Survey (NFHS) finds that the corresponding figure for India is .7%. (One odd thing about the article that caught my eye — The proportion of marital rape reported ought to be zero given the Indian Penal Code doesn’t recognize marital rape. But perhaps reports can still be made.) This pegs the rate of rape in India at either about 60% of the rate in US (if CDC numbers are more commensurable to NFHS numbers) or at about three times as much (if NCVS numbers are more commensurable to NFHS numbers).

The article based on NFHS data also estimates that nearly 98% of rapes are committed by husbands. This compares to 26% in US according to NCVS. So one startling finding is that risk of rape for unmarried women is startlingly low in India. (Or chances of being raped as a married women, astonishingly high. Data also show that rapes by husbands in India are especially unlikely to be reported.) The low risk of rape for unmarried women may be a consequence of something equally abhorrent — fearful of sexual harassment or as a consequence of patriarchy at home, Indian women may be much less likely to be outdoors than men (it would be good to quantify this).

Partisan Retrospection?: Partisan Gaps in Retrospection are Highly Variable

11 Jun

The difference between partisans’ responses on retrospection items is highly variable, ranging from over 40% to nearly 0. For instance, in 1988 nearly 30% fewer Democrats than Republicans reported that the inflation rate between 1980 and 1988 had declined. (It had.) However, similar proportions of Republicans and Democrats got questions about changes in the size of the budget deficit and defense spending between 1980 and 1988 right. The median partisan gap across 20 items asked in the NES over 5 years (1988, 1992, 2000, 2004, and 2008) was about 15 points (the median was about 12 points), and the standard deviation was about 13 points. (See the tables.) This much variation suggests that observed bias in partisans’ perceptions depends on a variety of conditioning variables. For one, there is some evidence to suggest that during severe recessions, partisans do not differ much in their assessment of economic conditions (See here.) Even when there are partisan gaps, however, they may not be real (see paper (pdf)).

Enabling FOIA: Getting Better Access to Government Data at a Lower Cost

28 May

Freedom of Information Act is vital for holding the government to account. But little effort has gone into building tools that enable fulfilment of FOIA requests. As a consequence of this underinvestment, fulfilling FOIA requests often means onerous, costly work for government agencies, and long delays for the requesters. Here, I discuss a couple of alternatives. One tool that can prove particularly effective in lowering the cost of fulfilling FOIA requests is an anonymizer — a tool that detects proper nouns, addresses, telephone numbers, etc. and blurs them. This is easily achieved using modern machine learning methods. To ensure 100% accuracy, humans can quickly vet the suggestions by the algorithm along with ‘suspect words.’ Or a captcha like system that asks people in developing countries to label suspect words as nouns etc. can be created to further reduce costs. This covers one conventional solution.

Another way to solve the problem would be to create a sandbox architecture. Rather than give requesters stacks of redacted documents – often unnecessary – one can allow people the right to run certain queries on the data. Results of these queries can be vetted internally. A classic example would be to allow people to query government emails for total number of times porn domains are accessed via servers at government offices.

Optimal Cost Function When Cost of Misclassification is Higher for the Customer than for the Business

15 Apr

Consider a bank making decisions about loans. For the bank, making lending decisions optimally means reducing prediction errors*(cost of errors) minus the cost of making predictions (Keeping things simple here). The cost of any one particular error — especially, denial of loan when eligible– is typically small for the bank, but very consequential for the applicant. So the applicant may be willing to pay the bank money to increase the accuracy of their decisions. Say, willing to compensate the bank for the cost of getting a person to take a closer look at the file. If customers are willing to pay the cost, accuracy rates can increase without reducing profits. (Under some circumstances, a bank may well be able to increase profits.) Customer’s willingness to pay for increasing accuracy is typically not exploited by the lending institutions. It may be well worth exploring it.

The Human and the Machine: Semi-automated approaches to ML

12 Apr

For a class of problems, a combination of algorithms and human input makes for the most optimal solution. For instance, three years ago software to recreate shredded documents that won the DARPA award used “human[s] [to] verify what the computer was recommending.” The insight is used in character recognition tasks. I have used it to create software for matching dirty data — the software was used to merge shape files with electoral returns at precinct level.

The class of problems for which human input proves useful has one essential attribute — humans produce unbiased, if error-prone, estimates for these problems. So for instance, it would be unwise to use humans for making the ‘last mile’ of lending decisions (see also this NYT article). (And that is something you may want to verify with training data.)