## Ipso Facto: Analysis of Complaints to IPSO

11 Jun

Independent Press Standards Agency (IPSO) handles complaints about accuracy etc. in the media in the U.K. Against which media organization are most complaints filed? And against which organization are the complaints most often upheld? We answer these questions using data from the IPSO website. (The data and scripts behind the analysis are posted on GitHub.)

Between its formation in September, 2014 and May 20th, 2016, IPSO received 371 complaints. Expectedly, tabloid newspapers are well represented. Of the 371 complaints, The Telegraph alone received 35 complaints, or about 9.4% of the total complaints. It was followed closely by The Mail with 31 complaints. The Times had 25 complaints filed against it, The Mirror and The Express 22 each, and The Sun, 19 complaints.

Generally, less than half the number of complaints were completely or partly upheld. Topping the list was The Express and The Telegraph with 10 upheld complaints each. And following close behind was The Times with 8 complaints, The Mail with 6, and The Sun and the Daily Star with 4 each.

See also the plot of batting average of media organizations with most complaints against them.

## Clustering and then Classifying: Improving Prediction for Crudely-labeled and Mislabeled Data

8 Jun

Mislabeled and crudely labeled data are common problems in data science. Supervised prediction of such data expectedly yields poor results—coefficients are biased, and accuracy with regards to the true label is poor. One solution to the problem is to hand code labels of an `adequate’ sample and infer true labels based on a model trained on that data.

Another solution relies on the intuition (assumption) that distance between rows (covariates) of a label will be lower than the distance between the distance between rows of different labels. One way to leverage that intuition is to cluster the data within each label, infer true labels from erroneous labels and then predict inferred true labels. For a class of problems, the method can be shown to always improve accuracy. (You can also predict just the cluster labels.)

Here’s one potential solution for a scenario where we have a binary dependent variable:

Assume a mis_labeled vector called mis_label that codes some true 0s as 1 and some true 1s as 0s.

1. For each mis_label (1 and 0), use k-means with k = 2 to get 2 clusters within each label for a total of 4 labels
2. Assuming mislabeling rate < 50%, create a new col. = est_true_label, which takes: 1 when mis_label = 1 and cluster label is of the majority class (that is cluster label class is more than 50% of the mis_label = 1), otherwise 0. 0 when mis_label = 0 and cluster label is of the majority class (that is cluster label class is more than 50% of the mis_label = 0), otherwise 1.
3. Predict est_true_label using a logistic regression and produce accuracy estimates based on true_labels and bias estimates in coefficient estimates (compared to coefficients from logistic regression coefficients from true labels)

## The Missing Plot

27 May

Datasets often contain missing values. And often enough—at least in social science data—values are missing systematically. So how do we visualize missing values? After all, they are missing.

Some analysts simply list-wise delete points with missing values. Others impute, replacing missing values with mean or median. Yet others use sophisticated methods to impute missing values. None of the methods, however, automatically acknowledge that any of the data are missing in the visualizations.

It is important to acknowledge missing data.

One can do it is by providing a tally of how much data are missing on each of the variables in a small table in the graph. Another, perhaps better, method is to plot the missing values as a function of a covariate. For bivariate graphs, the solution is pretty simple. Create a dummy vector that tallies missing values. And plot the dummy vector in addition to the data. For instance, see:

(The script to produce the graph can be downloaded from the following GitHub Gist.)

In cases, where missing values are imputed, the dummy vector can also be used to ‘color’ the points that were imputed.

27 May

I assessed PolitiFact on:

1. Imbalance in scrutiny: Do they vet statements by Democrats or Democratic-leaning organizations more than statements Republicans or Republican-leaning organizations?

2. Batting average by party: Roughly n_correct/n_checked, but instantiated here as mean Politifact rating.

To answer the questions, I scraped the data from PolitiFact and independently coded and appended data on the party of the person or organization covered. (Feel free to download the script for scraping and analyzing the data, scraped data and data linking people and organizations to party from the GitHub Repository.)

Until now, Politifact has checked veracity 3,859 statements by 703 politicians and organizations. Of these, I was able to establish the partisanship of 554 people and organizations. I restrict the analysis to 3,396 statements by organizations and people whose partisanship I could establish and who lean either towards the Republican or Democratic party. I code the Politifact 6-point True to Pants on Fire scale (true, mostly-true, half-true, barely-true, false, pants-fire) linearly so that it lies between 0 (pants-fire) and 1 (true).

Of the 3,396 statements, about 44% (n = 1506) of the statements checked by PolitiFact are by Democrats or Democratic-leaning organizations. Rest of the roughly 56% (n = 1890) are by Republicans or Republican-leaning organizations. The average PolitiFact rating of statements by Democrats or Democratic-leaning organizations (batting average) is .63; it is .49 for statements by Republicans or Republican-leaning organizations.

To check whether the results are driven by some people receiving a lot of scrutiny, I tallied the total number of statements investigated for each person. Unsurprisingly, there is a large skew, with a few prominent politicians receiving a bulk of the attention. For instance, PolitiFact investigated more than 500 claims by Barack Obama alone. The figure below plots the total number of statements investigated for thirty politicians receiving the most scrutiny.

If you take out Barack Obama, the percentage of Democrats receiving scrutiny reduces to 33.98%. More generally, limiting ourselves to the bottom 90% of the politicians in terms of scrutiny received, the share of Democrats is about 42.75%.

To analyze whether there is selection bias in covering politicians who say incorrect things more often, I estimated the correlation between the batting average and the total number of statements investigated. The correlation is very weak and does not appear to vary systematically by party. Accounting for the skew by taking the log of the total statements or by estimating a rank-ordered correlation has little effect. The figure below plots batting average as a function of total statements investigated.

To interpret the numbers, you need to make two assumptions:

1. The number of statements made by Republicans and Republican-leaning persons and organizations is the same as that made by people and organizations on the left.

2. Truthiness of statements by Republican and Republican-leaning persons and organizations is the same as that of left-leaning people and organizations.

## About 85% Problematic: The Trouble With Human-In-The-Loop ML Systems

26 Apr

MIT researchers recently unveiled a system that combines machine learning with input from users to ‘predict 85% of the attacks.’ Each day, the system winnows down millions of rows to a few hundred atypical data points and passes these points on to ‘human experts’ who then label the few hundred data points. The system then uses the labels to refine the algorithm.

At the first blush, using data from users in such a way to refine the algorithm seems like the right thing to do, even the obvious thing to do. And there exist a variety of systems that do precisely this. In the context of cyber data (and a broad category of similar such data), however, it may not be the right thing to do. There are two big reasons for that. A low false positive rate can be much more easily achieved if we do not care about the false negative rate. And there are good reasons to worry a lot about false negative rates in cyber data. And second, and perhaps more importantly, incorporating user input on complex tasks (or where data is insufficiently rich) reduces to the following: given a complex task with inadequate time, the users use cheap heuristics to label the data, and supervised aspect of the algorithm reduces to learning cheap heuristics that humans use.

## The Case for Ending Closed Academic Publishing

21 Mar

A few commercial publishers publish a large chunk of top flight of academic research. And earn a pretty penny doing so. The standard operating model of the publishers is as follows: pay the editorial board no more than \$70-\$100k, pay for typesetting and publishing, and in turn get copyrights to academic papers. And then go on and charge already locked in institutional customers—university and government libraries—and ordinary scholars extortionary rates. The model is gratuitously dysfunctional.

Assuming there are no long term contracts with the publishers, the system ought to be rapidly dismantled. But if dismantling is easy, creating something better may not be. It just happens to be. A majority of the cost of publishing is in printing on paper. Twenty first century has made printing large organized bundles on paper largely obsolete; those who need it can print on paper at home. Beyond that, open source software for administering a journal already exists. And the model of a single editor with veto powers seems anachronistic. Editing duties can be spread around much like peer review. As unpaid peer review can survive as it always has, though better mechanisms can be thought about. If some money is still needed for administration, it could be gotten easily by charging a nominal submission tax, waived where the author self identifies as being unable to pay.

## Interpeting Clusters and ‘Outliers’ from Clustering Algorithms

19 Feb

Assume that the data from the dominant data generating process are structured so that they occupy a few small portions of a high-dimensional space. Say we use a hard partition clustering algorithm to learn the structure of the data. And say that it does—learn the structure. Anything that lies outside the few narrow pockets of high-dimensional space is an ‘outlier,’ improbable (even impossible) given the dominant data generating process. (These ‘outliers’ may be generated by a small malicious data generating processes.) Even points on the fringes of the narrow pockets are suspicious. If so, one reasonable measure of suspiciousness of a point is its distance from the centroid of the cluster to which it is assigned; the further the point from the centroid, the more suspicious it is. (Distance can be some multivariate distance metric, or proportion of points assigned to the cluster that are further away from the cluster centroid than the point whose score we are tallying.)

How can we interpret an outlier (score)? Tautological explanations—it is improbable given the dominant data generating process—aside.

Simply providing distance to the centroid doesn’t give enough context. And for obvious reasons, for high-dimensional vectors, providing distance on each feature isn’t reasonable either. A better approach involves some feature selection. This can be done in various ways, all of which take the same general form. Find distance to the centroid on features on which the points assigned to the cluster have the least variation. Or, on features that discriminate the cluster from other clusters the best. Or, on features that predict distance from the cluster centroid the best. Limit the features arbitrarily to a small set. On this limited feature set, calculate cluster means and standard deviations, and give standardized distance (for categorical variable, just provide ) to the centroid.

Read More (pdf with pseudo code)

## Sampling With Coprimes

1 Jan

Say you want to sample from a sequence of length n. Multiples of a number that is relatively prime to the length of the sequence (n) cover the entire sequence and have the property that the entire sequence is covered before any number is repeated. This is a known result from number theory. We could use the result to (sequentially) (see below for what I mean) sample from a series.

For instance, if the sequence is 1, 2, 3,…, 9, the number 5 is one such number (5 and 9 are coprime). Using multiples of 5, we get:

1 2 3 4 5 6 7 8 9
X
X X
X X
X X
X X

If the length of the sequence is odd, then we all know that 2 will do. But not all even numbers will do. For instance, for the same length of 9, if you were to choose 6, it would result in 6, 3, 9, and 6 again.

Some R code:

``````
seq_length = 6
rel_prime  = 5
multiples  = rel_prime*(1:seq_length)
multiples  = ifelse(multiples > seq_length, multiples %% seq_length, multiples)
multiples  = ifelse(multiples ==0, seq_length, multiples)
length(unique(multiples))
``````

Where can we use this? It makes passes over an address space less discoverable.

## Clarifai(ng) the Future of Clarifai

31 Dec

Clarifai is a promising AI start-up. In a short(ish) time, it has made major progress on an important problem. And it is rapidly rolling out products with lots of business potential. But there are still some things that it could do.

As I understand it, the base version of Clarifai API is trying to do two things at once: a) learn various recognizable patterns in images b) rank the patterns based on ‘appropriateness’ and probab_true. I think Clarifai would have to split these two things over time and allow people to input what abstract dimensions are ‘appropriate’ for them. As the idiom goes, an image is a thousand words. In an image, there can be information about social class, race, and country, but also shapes, patterns, colors, perspective, depth, time of the day etc. And Clarifai should allow people to pick dimensions appropriate for the task. Though, defining dimensions would be hard. But that shouldn’t stymie the efforts. And ad hoc advances may be useful. For instance, one dimension could be abstract shapes and colors. Another could be the more ‘human’ dimension etc.

Extending the logic, Clarifai should support the building of abstract data science applications that solve a particular problem. For instance, say a user is only interested in learning about whether the photo features a man or a woman. And the user wants to build a Clarifai based classifier. (That person is me. The task is inferring gender of first names. See here.) Clarifai could in principle allow the user to train a classifier that uses all other information in the images, including jewelry, color, perspective, etc. and provide an out of sample error for that particular task. The crucial point is allowing users fuller access to what Clarifai can do and then letting the users manage it at their ends. To that end again, input about user objectives needs to be built into the API. Basic hooks could be developed for classification and clustering inputs.

More generally, Clarifai should eventually support more user inputs and a greater variety of outputs. Limiting the product to tagging is a mistake.

There are three other general directions for Clarifai to go into. A product that automatically sections an image into multiple images and tags each section would be useful. This would allow, for instance, to count the number of women in a photo. Another direction to go would be to provide the ‘best’ set of tags that collectively describe a set of images. (It may seem like violating the spirit of what I noted above but it needn’t — a user could want just this.) By the same token, Clarifai could build general-purpose discrimination engines — a list of tags that distinguishes image(s) the best.

Beyond this, the obvious. Clarifai can also provide synonyms of tags to make tags easier to use. And it could allow users to specify if they want, say tags in ‘UK English’ etc.

## Congenial Invention and the Economy of Everyday (Political) Conversation

31 Dec

Communication comes from the Latin word communicare, which means `to make common.’ We communicate not only to transfer information, but also to establish and reaffirm identities, mores, and meanings. (From my earlier note on a somewhat different aspect of the economy of everyday conversation.) Hence, there is often a large incentive for loyalty. More generally, there are three salient aspects to most private interpersonal communication about politics — shared ideological (or partisan) loyalties, little knowledge of, and prior thinking about political issues, and a premium for cynicism. The second of these points — ignorance — cuts both ways. It allows for the possibility of getting away with saying something that doesn’t make sense (or isn’t true). And it also means that people need to invent stuff if they want to sound smart etc. (Looking stuff up is often still too hard. I am often puzzled by that.)

But don’t people know that they are making stuff up? And doesn’t that stop them? A defining feature of humans is overconfidence. And people often times aren’t aware of the depth of the wells of their own ignorance. And if it sounds right, well it is right, they reason. The act of speaking is many a time an act of invention (or discovery). And we aren’t sure and don’t actively control how we create. (Underlying mechanisms behind how we create — use of ‘gut’ are well-known.) Unless we are very deliberate in speech. Most people aren’t. (There generally aren’t incentives to be.) And find it hard to vet the veracity of the invention (or discovery) in the short time that passes between invention and vocalization.