Good NYT: Provision of ‘Not News’ in the NYT Over Time

29 Nov

The mainstream American news media is under siege from the political right, but for the wrong reasons. To the hyper-partisan political elites, small inequities in slant matter a lot. But hyperventilating about small issues doesn’t magically turn them into serious problems. It just makes them loom larger. And takes the focus away from the much more serious problems. The big afflictions of ‘MSM’ are: vacuity, sensationalism, a preference for opinions over facts, poor understanding of important issues, and disinterest in covering them diligently, disinterest in what goes on outside American borders, a poor understanding of numbers, policy myopia—covering events rather than issues, and fixation with breaking news.

Here we shed light on a small piece of the real estate: the provision of ‘not news.’ By ‘not news’ we mean news about cooking, travel, fashion, home decor, and such. We are very conservative in what we code as ‘not news,’ leaving news about health, anything local, and such in place.

We analyze provision of ‘not news’ in The New York Times (NYT), the nation’s newspaper of record. NYT is both well-regarded and popular. It has won more Pulitzer awards than any other newspaper. And it is the 30th most visited website in the U.S. (as of October 2017).

So, has the proportion of news stories about topics unrelated to politics or the economy, such as cooking, travel, fashion, music, etc., in NYT gone up over time? (For issues related to measurement, see here.) As the figure shows, the net provision of not news increased by 10 percentage points between 1987 and 2007.

Interested in learning about more ways in which NYT has changed over time? See here.

Confirmation Bias: Confirming Media Bias

31 Oct

It used to be that searches for ‘Fox News Bias’ were far more common than searches for ‘CNN bias’. Not anymore. The other notable thing—correlated peaks around presidential elections.

Note also that searches around midterm elections barely rise above the noise.

Town Level Data on Cable Operators and Cable Channels

12 Sep

I am pleased to announce the release of TV and Cable Factbook Data (1997–2002; 1998 coverage is modest). Use of the data is restricted to research purposes.


In 2007, Stefano DellaVigna and Ethan Kaplan published a paper that used data from Warren’s Factbook to identify the effect of the introduction of Fox News Channel on Republican vote share (link to paper). Since then, a variety of papers exploiting the same data and identification scheme have been published (see, for instance, Hopkins and Ladd, Clinton and Enamorado, etc.)

In 2012, I embarked on a similar such project—trying to use the data to study the impact of the introduction of Fox News Channel on attitudes and behaviors related to climate change. However, I found the original data to be limited—DellaVigna and Kaplan had used a team of research assistants to manually code a small number of variables for a few years. So I worked on extending the data. I planned on extending the data in two ways: adding more years, and adding ‘all’ the data for each year. To that end, I developed custom software. The data collection and parsing of a few thousand densely packed, inconsistently formatted, pages (see below) to a usable CSV (see below) finished sometime early in 2014. (To make it easier to create a crosswalk with other geographical units, I merged the data with Town lat/long (centroid) and elevation data from

Sample Page
Snapshot of the Final CSV

Soon after I finished the data collection, however, I became aware of a paper by Martin and Yurukoglu. They found some inconsistencies between the Nielsen data and the Factbook data (see Appendix C1 of paper), tracing the inconsistencies to delays in updating the Factbook data—“Updating is especially poor around [DellaVigna and Kaplan] sample year. Between 1999 and 2000, only 22% of observations were updated. Between 1998 and 1999, only 37% of observations were updated.” Based on their paper, I abandoned the plan to use the data, though I still believe the data can be used for a variety of important research projects, including estimating the impact of the introduction of Fox News. Based on that belief, I am releasing the data.

Ipso Facto: Analysis of Complaints to IPSO

11 Jun

Independent Press Standards Agency (IPSO) handles complaints about accuracy etc. in the media in the U.K. Against which media organization are most complaints filed? And against which organization are the complaints most often upheld? We answer these questions using data from the IPSO website. (The data and scripts behind the analysis are posted on GitHub.)

Between its formation in September, 2014 and May 20th, 2016, IPSO received 371 complaints. Expectedly, tabloid newspapers are well represented. Of the 371 complaints, The Telegraph alone received 35 complaints, or about 9.4% of the total complaints. It was followed closely by The Mail with 31 complaints. The Times had 25 complaints filed against it, The Mirror and The Express 22 each, and The Sun, 19 complaints.

Generally, less than half the number of complaints were completely or partly upheld. Topping the list was The Express and The Telegraph with 10 upheld complaints each. And following close behind was The Times with 8 complaints, The Mail with 6, and The Sun and the Daily Star with 4 each.

See also the plot of batting average of media organizations with most complaints against them.

Some Facts About PolitiFact

27 May

I assessed PolitiFact on:

1. Imbalance in scrutiny: Do they vet statements by Democrats or Democratic-leaning organizations more than statements Republicans or Republican-leaning organizations?

2. Batting average by party: Roughly n_correct/n_checked, but instantiated here as mean Politifact rating.

To answer the questions, I scraped the data from PolitiFact and independently coded and appended data on the party of the person or organization covered. (Feel free to download the script for scraping and analyzing the data, scraped data and data linking people and organizations to party from the GitHub Repository.)

Until now, Politifact has checked veracity 3,859 statements by 703 politicians and organizations. Of these, I was able to establish the partisanship of 554 people and organizations. I restrict the analysis to 3,396 statements by organizations and people whose partisanship I could establish and who lean either towards the Republican or Democratic party. I code the Politifact 6-point True to Pants on Fire scale (true, mostly-true, half-true, barely-true, false, pants-fire) linearly so that it lies between 0 (pants-fire) and 1 (true).

Of the 3,396 statements, about 44% (n = 1506) of the statements checked by PolitiFact are by Democrats or Democratic-leaning organizations. Rest of the roughly 56% (n = 1890) are by Republicans or Republican-leaning organizations. The average PolitiFact rating of statements by Democrats or Democratic-leaning organizations (batting average) is .63; it is .49 for statements by Republicans or Republican-leaning organizations.

To check whether the results are driven by some people receiving a lot of scrutiny, I tallied the total number of statements investigated for each person. Unsurprisingly, there is a large skew, with a few prominent politicians receiving a bulk of the attention. For instance, PolitiFact investigated more than 500 claims by Barack Obama alone. The figure below plots the total number of statements investigated for thirty politicians receiving the most scrutiny.

If you take out Barack Obama, the percentage of Democrats receiving scrutiny reduces to 33.98%. More generally, limiting ourselves to the bottom 90% of the politicians in terms of scrutiny received, the share of Democrats is about 42.75%.

To analyze whether there is selection bias in covering politicians who say incorrect things more often, I estimated the correlation between the batting average and the total number of statements investigated. The correlation is very weak and does not appear to vary systematically by party. Accounting for the skew by taking the log of the total statements or by estimating a rank-ordered correlation has little effect. The figure below plots batting average as a function of total statements investigated.


Caveats About Interpretation

To interpret the numbers, you need to make two assumptions:

1. The number of statements made by Republicans and Republican-leaning persons and organizations is the same as that made by people and organizations on the left.

2. Truthiness of statements by Republican and Republican-leaning persons and organizations is the same as that of left-leaning people and organizations.

Getting a Measure of Measures of Selective Exposure

24 Jul

Ideally, we would like to be able to place ideology of each bit of information consumed in relation to the ideological location of the person. And we would like a time-stamped distribution of the bits consumed. We can then summarize various moments of that distribution (or the distribution of ideological distances). And that would be that. (If we were worried about dimensionality, we would do it by topic.)

But lack of data means we must change the estimand. We must code each bit of information as merely uncongenial or uncongenial. This means taking directionality out of the equation. For a Republican at a 6 on a 1 to 7 liberal to conservative scale, consuming a bit of information at 5 is the same as consuming a bit at 7.

The conventional estimand then is a set of two ratios: (Bits of politically congenial information consumed)/(All political information) and (Bits of uncongenial information)/(All political information consumed). Other reasonable formalizations exist, including the difference between congenial and uncongenial. (Note that the denominator is absent, and reasonably so.)

To estimate these quantities, we must often make further assumptions. First, we must decide on the domain of political information. That domain is likely vast and increasing by the minute. We are all producers of political information now. (We always were but today we can easily access political opinions of thousands of lay people.) But see here for some thoughts on how to come up with the relevant domain of political information from passive browsing data.

Next, people often code ideology at the level of ‘source.’ The New York Times is ‘independent’ or ‘liberal’ and ‘Fox’ simply ‘conservative’ or perhaps more accurately ‘Republican-leaning.’ (Continuous measures of ideology — as estimated by Groseclose and Milyo or Gentzkow and Shapiro — are also assigned at the source level.) This is fine except that it means coding all bits of information consumed from a source as the same. And there are some attendant risks. We know that not all NYT articles are ‘liberal.’ In fact, we know much of it is not even political news. A toy example of how such measures can mislead:

Page Views: 10 Fox, 10 CNN. Est: 10/20
But say Fox Pages 7R, 3D and CNN 5R, 5D
Est: 7/10 + 5/10 = 12/20

If the measure of ideology is continuous, there are still some risks. If we code all page views as the mean ideology of the source, we assume that the person views a random sample of pages on the source. (Or some version of that.) But that is too implausible an assumption. It is much more likely that a liberal reading the NYT likely stays away from the David Brooks’ columns. If you account for such within source self-selection, selective exposure measures based on source level coding are going to be downwardly biased — that is find people as less selective than they are.

Discussion until now has focused on passive browsing data, eliding over survey measures. There are two additional problems with survey measures. One is about the denominator. Measures based on limited choice experiments like ones used by Iyengar and Hahn 2009 are bad measures of real-life behavior. In real life, we just have far more choices. And inferences from such experiments can at best recover ordinal rankings. The second big problem with survey measures is ‘expressive responding.’ Republicans indicating they watch Fox News not because they do but because they want to convey they do.

Where’s the Porn? Classifying Porn Domains Using a Calibrated Keyword Classifier

23 Jul

Aim: Given a very large list of unique domains, find domains carrying adult content.

In the 2004 comScore browsing data, for instance, there are about a million unique domains. Comparing a million unique domain names against a large database is doable. But access to such databases doesn’t often come cheap. So a hack.

Start with an exhaustive key word search containing porn-related keywords. Here’s mine

breast, boy, hardcore, 18, queen, blowjob, movie, video, love, play, fun, hot, gal, pee, 69, naked, teen, girl, cam, sex, pussy, dildo, adult, porn, mature, sex, xxx, bbw, slut, whore, tit, pussy, sperm, gay, men, cheat, ass, booty, ebony, asian, brazilian, fuck, cock, cunt, lesbian, male, boob, cum, naughty

For the 2004 comScore data, this gives about 140k potential porn domains. Compare this list to the approximately 850k porn domains in the shallalist. This leaves us with a list of 68k domains with uncertain status. Use one of the many URL classification APIs. Using Trusted Source API, I get about 20k porn, and 48k non-porn.

This gives us the lower bound of adult domains. But perhaps much too low.

To estimate the false positives, take a large random sample (say 10,000 unique domains). Compare results from keyword search and eliminate using API to API search of all 10k domains. This will give you an estimate of false positive rate. But you can learn from the list of false negatives to improve your keyword search. And redo everything. A couple of iterations can produce a sufficiently low false negative rate (false positive rate is always ~ 0). (For 2004 comScore data, false negative rate of 5% is easily achieved.)

Where’s the news?: Classifying News Domains

23 Jul

We select an initial universe of news outlets (i.e., web domains) via the Open Directory Project (ODP,, a collective of tens of thousands of editors who hand-label websites into a classification hierarchy. This gives 7,923 distinct domains labeled as: news, politics/news, politics/media, and regional/news. Since the vast majority of these news sites receive relatively little traffic, to simplify our analysis we restrict to the one hundred domains that attracted the largest number of unique visitors from our sample of toolbar users. This list of popular news sites includes every major national news source, well-known blogs and many regional dailies, and
collectively accounts for over 98% of page views of news sites in the full ODP list (as estimated via our toolbar sample). The complete list of 100 domains is given in the Appendix.

From Filter Bubbles, Echo Chambers, and Online News Consumption by Flaxman, Goel and Rao.

When using rich browsing data, scholars often rely on ad hoc lists of domains to estimate consumption of certain kind of media. Using these lists to estimate consumption raises three obvious concerns – 1) Even sites classified as ‘news sites,’ such as the NYT, carry a fair bit of non-news 2) (speaking categorically) There is the danger of ‘false positives’ 3) And (speaking categorically again) there is a danger of ‘false negatives.’

FGR address the first concern by exploiting the URL structure. They exploit the fact that the URL of NY Times story contains information about the section. (The classifier is assumed to be perfect. But likely isn’t. False positive and negative rates for this kind of classification can be estimated using raw article data.) This leaves us with concern about false positives and negatives at the domain level. Lists like those published by DMOZ appear to be curated well-enough to not contain too many false-positives. The real question is about how to calibrate false negatives. Here’s one procedure. Take a large random sample of the browsing data (at least 10,000 unique domain names). Compare it to a large comprehensive database like Shallalist. Of the domains that aren’t in the database, query a URL classification service such as Trusted Source. (The initial step of comparing against Shallalist is to reduce the amount of querying.) Using the results, estimate the proportion of missing domain names (the net number of missing domain names is likely much much larger). Also estimate missed visitation time, page views etc.

Liberal Politicians are Referred to More Often in News

8 Jul

The median Democrat referred to in television news is to the left of the House Democratic Median, and the median Republican politician referred to is to the left of the House Republican Median.

Click here for the aggregate distribution.

And here’s a plot of top 50 politicians cited in news. The plot shows a strong right skewed distribution with a bias towards executives.

News data: UCLA Television News Archive, which includes closed-caption transcripts of all national, cable and local (Los Angeles) news from 2006 to early 2013. In all, there are 155,814 transcripts of news shows.

Politician data: Database on Ideology, Money in Politics, and Elections (see Bonica 2012).

Taking out data from local news channels or removing Obama does little to change the pattern in the aggregate distribution.