Getting a Measure of a Measure: Measuring Selective Exposure

24 Jul

Ideally we would like to be able to place ideology of each bit of information consumed in relation to the ideological location of the person. And we would like a time stamped distribution of the bits consumed. We can then summarize various moments of that distribution (or the distribution of ideological distances). And that would be that. (If we were worried about dimensionality, we would do it by topic.)

But lack of data mean we must change the estimand. We must code each bit of information as merely uncongenial or uncongenial. This means taking directionality out of the equation. For a Republican at a 6 on a 1 to 7 liberal to conservative scale, consuming a bit of information at 5 is the same as consuming a bit at 7.

The conventional estimand then is a set of two ratios: (Bits of politically congenial information consumed)/(All political information) and (Bits of uncongenial information)/(All political information consumed). Other reasonable formalizations exist, including difference between congenial and uncongenial. (Note that the denominator is absent, and reasonably so.)

To estimate these quantities, we must often make further assumptions. First, we must decide on the domain of political information. That domain is likely vast, and increasing by the minute. We are all producers of political information now. (We always were but today we can easily access political opinions of thousands of lay people.) But see here for some thoughts on how to come up with the relevant domain of political information from passive browsing data.

Next, generally, people code ideology at the level of ‘source.’ New York Times is ‘independent’ or ‘liberal’ and ‘Fox’ simply ‘conservative’ or perhaps more accurately ‘Republican leaning.’ (Continuous measures of ideology – as estimated by Groseclose and Milyo or Gentzkow and Shapiro – are also assigned at the source level.) This is fine except that it means coding all bits of information consumed from a source as the same. This is called ecological inference. And there are some attendant risks. We know that not all NYT articles are ‘liberal.’ In fact, we know much of it is not even political news. A toy example of how such measures can mislead:

Page Views: 10 Fox, 10 CNN. Est: 10/20
But say Fox Pages 7R, 3D and CNN 5R, 5D
Est: 7/10 + 5/10 = 12/20

If the measure of ideology is continuous, there are still some risks. If we code all page views as mean ideology of the source, we assume that the person views a random sample of pages on the source. (Or some version of that.) But that is too implausible an assumption. It is much more likely that a liberal reading the NYT likely stays away from the David Brooks’ columns. If you account for such within source self-selection, selective exposure measures based on source level coding are going to be downwardly biased — that is find people as less selective than they are.

Discussion until now has focused on passive browsing data, eliding over survey measures. There are two additional problems with survey measures. One is about the denominator. Measures based on limited choice experiments like ones used by Iyengar and Hahn 2009 are bad measures of real life behavior. In real life we just have far more choices. And inferences from such experiments can at best recover ordinal rankings. The second big problem with survey measures is ‘expressive responding.’ Republicans indicating they watch Fox News not because they do but because they want to convey they do.

Where’s the Porn? Classifying Porn Domains Using a Calibrated Keyword Classifier

23 Jul

Aim: Given a very large list of unique domains, find domains carrying adult content.

In the 2004 comScore browsing data, for instance, there are about a million unique domains. Comparing a million unique domain names against a large database is doable. But access to such databases doesn’t often come cheap. So a hack.

Start with an exhaustive key word search containing porn-related keywords. Here’s mine

breast, boy, hardcore, 18, queen, blowjob, movie, video, love, play, fun, hot, gal, pee, 69, naked, teen, girl, cam, sex, pussy, dildo, adult, porn, mature, sex, xxx, bbw, slut, whore, tit, pussy, sperm, gay, men, cheat, ass, booty, ebony, asian, brazilian, fuck, cock, cunt, lesbian, male, boob, cum, naughty

For the 2004 comScore data, this gives about 140k potential porn domains. Compare this list to the approximately 850k porn domains in the shallalist. This leaves us with a list of 68k domains with uncertain status. Use one of the many URL classification APIs. Using Trusted Source API, I get about 20k porn, and 48k non-porn.

This gives us the lower bound of adult domains. But perhaps much too low.

To estimate the false positives, take a large random sample (say 10,000 unique domains). Compare results from keyword search and eliminate using API to API search of all 10k domains. This will give you an estimate of false positive rate. But you can learn from the list of false negatives to improve your keyword search. And redo everything. A couple of iterations can produce a sufficiently low false negative rate (false positive rate is always ~ 0). (For 2004 comScore data, false negative rate of 5% is easily achieved.)

Where’s the news?: Classifying News Domains

23 Jul

We select an initial universe of news outlets (i.e., web domains) via the Open Directory Project (ODP,, a collective of tens of thousands of editors who hand-label websites into a classification hierarchy. This gives 7,923 distinct domains labeled as: news, politics/news, politics/media, and regional/news. Since the vast majority of these news sites receive relatively little traffic, to simplify our analysis we restrict to the one hundred domains that attracted the largest number of unique visitors from our sample of toolbar users. This list of popular news sites includes every major national news source, well-known blogs and many regional dailies, and
collectively accounts for over 98% of page views of news sites in the full ODP list (as estimated via our toolbar sample). The complete list of 100 domains is given in the Appendix.

From Filter Bubbles, Echo Chambers, and Online News Consumption by Flaxman, Goel and Rao.

When using rich browsing data, scholars often rely on ad hoc lists of domains to estimate consumption of certain kind of media. Using these lists to estimate consumption raises three obvious concerns – 1) Even sites classified as ‘news sites,’ such as the NYT, carry a fair bit of non-news 2) (speaking categorically) There is the danger of ‘false positives’ 3) And (speaking categorically again) there is a danger of ‘false negatives.’

FGR address the first concern by exploiting the URL structure. They exploit the fact that the URL of NY Times story contains information about the section. (The classifier is assumed to be perfect. But likely isn’t. False positive and negative rates for this kind of classification can be estimated using raw article data.) This leaves us with concern about false positives and negatives at the domain level. Lists like those published by DMOZ appear to be curated well-enough to not contain too many false-positives. The real question is about how to calibrate false negatives. Here’s one procedure. Take a large random sample of the browsing data (at least 10,000 unique domain names). Compare it to a large comprehensive database like Shallalist. Of the domains that aren’t in the database, query a URL classification service such as Trusted Source. (The initial step of comparing against Shallalist is to reduce the amount of querying.) Using the results, estimate the proportion of missing domain names (the net number of missing domain names is likely much much larger). Also estimate missed visitation time, page views etc.

Liberal politicians are referred to more often in news

8 Jul

The median Democrat referred to in television news is to the left of the House Democratic Median, and the median Republican politician referred to is to the left of the House Republican Median.

Click here for the aggregate distribution.

And here’s a plot of top 50 politicians cited in news. The plot shows a strong right skewed distribution with a bias towards executives.

News data: UCLA Television News Archive, which includes closed-caption transcripts of all national, cable and local (Los Angeles) news from 2006 to early 2013. In all, there are 155,814 transcripts of news shows.

Politician data: Database on Ideology, Money in Politics, and Elections (see Bonica 2012).

Taking out data from local news channels or removing Obama does little to change the pattern in the aggregate distribution.

Reliving some of the high points of the 2008 presidential campaign

13 Nov

November 10, 2007: One of the first scandals to break out during the campaign was about planted questions in Hillary’s townhall meetings. “They asked me if I would ask the senator a question. I said, ‘Sure, you know,'” Gallo-Chasanoff told CNN. “He showed me in his binder, he had a piece of paper that had typed out questions on it. And the top one was planned specifically for a college student. It said ‘college student.'” ‘A video on MSNBC shows Gallo-Chasanoff reading the question word for word, and then winking when she was done.’ ABC News

November 10, 2007: “I love my wife and my five sons and their five wives. Wait a second. Let me clarify that. They each have one.” Mitt Romney (Economist gave this quip the title – Best Freudian slip;

December 12, 2007: In kindergarten, Senator Obama wrote an essay titled ‘I Want to Become President.’ “Iis Darmawan, 63, Senator Obama’s kindergarten teacher, remembers him as an exceptionally tall and curly haired child who quickly picked up the local language and had sharp math skills. He wrote an essay titled, ‘I Want To Become President,’ the teacher said.”
From: Clinton campaign’s press-release.

December 13, 2007: “It’ll be, ‘When was the last time? Did you ever give drugs to anyone? Did you sell them to anyone?'” Shaheen on Obama
Bill Shaheen (husband of NH Senator-elect Jeanne Shaheen; national co-chairman of Clinton’s campaign at that point)

February 24, 2008: Bill Clinton speaking about Hillary’s inability to win caucus states – “the caucuses aren’t good for her. They disproportionately favor upper-income voters who, who, don’t really need a president but feel like they need a change.” Audacity of Hopelessness by Frank Rich

March 8, 2008: “She is a monster, too – that is off the record – she is stooping to anything,” Samantha Power; Obama’s foreign policy adviser.

March 10, 2008: Hillary Clinton chief spokesman Howard Wolfson declared Monday that Clinton does not consider Obama qualified to be vice president.

March 11, 2008: “I will not be discriminated against because I’m white. Geraldine Ferraro

“If we can’t trust Mitt Romney on Ronald Reagan, how can we trust him to lead America?”
From John McCain’s attack ad on Romney

“The Clintons will be there when they need you,” said a Carter friend. (Maureen Dowd, NY Times)

May 3, 2008: When asked, at the Republican presidential primary debate at Simi Valley, whether any of the candidates did not believe in evolution , three candidates – Tancredo, Brownback, and Huckabee – raised their hands.

May 9, 2008: “Senator Obama’s support among working, hard-working Americans, white Americans, is weakening again.” (Hillary Clinton, Interview with USA Today)

Google News - Clinton Accuses Obama

Google News Archives timeline graph of citations of 'Clinton Accuses Obama' between August 2007 and August 2008

August 21, 2008: “I think – I’ll have my staff get to you. It’s condominiums where – I’ll have them get to you.” (John McCain unsure about the number of houses he owns.)

A special tribute to Palin:

September 24, 2008: “As Putin rears his head and comes into the air space of the United States of America, where– where do they go? It’s Alaska. It’s just right over the border. (Interview with Katie Couric, CBS News)

In defense of Palin, she never said that she could see Russia from her house. (Time)

September 25, 2008: Couric: And when it comes to establishing your worldview, I was curious, what newspapers and magazines did you regularly read before you were tapped for this to stay informed and to understand the world?
Palin: I’ve read most of them, again with a great appreciation for the press, for the media.
Couric: What, specifically?
Palin: Um, all of them, any of them that have been in front of me all these years.
Couric: Can you name a few?
Palin: I have a vast variety of sources where we get our news, too.
CBS News

October 1, 2008: “Well, let’s see. There’s — of course — in the great history of America rulings there have been rulings.” Sarah Palin (When asked by Couric to name a Supreme Court decision, other than Roe vs. Wade, that she disagreed with; CBS News)

Graphical analyses of news coverage of Sarah Palin

12 Nov
News Coverage of Sarah Palin and Joseph Biden

News Coverage of Sarah Palin and Joseph Biden

Ratio of News stories by day covering Sarah Palin and Joseph Biden

Ratio of News stories by day covering Sarah Palin and Joseph Biden

Coverage of Sarah Palin's interview with Gibson, Couric, and Tina Fey's impersonation on SNL

Coverage of Sarah Palin's interview with Gibson, Couric, and Tina Fey's impersonation on SNL

Ratio of stories citing John McCain that also cited Sarah Palin vis-a-vis Ratio of stories citing Barack Obama that also mentioned Joe Biden

Ratio of stories citing John McCain that also cited Sarah Palin vis-a-vis Ratio of stories citing Barack Obama that also mentioned Joe Biden

Palin's coverage: MSNBC and Fox

Palin's coverage: MSNBC and Fox