Sample this: Sampling randomly from the streets

23 Jun

Say you want to learn about the average number of potholes per unit paved street in a city. To estimate that quantity, the following sampling plan can be employed:

1. Get all the streets in a city from Google Maps or OSM
2. Starting from one end of the street, split each street into .5 km segments till you reach the end of the street. The last segment, or if the street is shorter than .5km, the only segment, can be shorter than .5 km.
3. Get the lat/long of start/end of the segments.
4. Create a database of all the segments: segment_id, street_name, start_lat, start_long, end_lat, end_long
5. Sample from rows of the database
6. Produce a CSV of the sampled segments (subset of step 4)
7. Plot the lat/long on Google Map — filling all the area within the segment.
8. Collect data on the highlighted segments.

Ipso Facto: Analysis of Complaints to IPSO

11 Jun

Independent Press Standards Agency (IPSO) handles complaints about accuracy etc. in the media in the U.K. Against which media organization are most complaints filed? And against which organization are the complaints most often upheld? We answer these questions using data from the IPSO website. (The data and scripts behind the analysis are posted on GitHub.)

Between its formation in September, 2014 and May 20th, 2016, IPSO received 371 complaints. Expectedly, tabloid newspapers are well represented. Of the 371 complaints, The Telegraph alone received 35 complaints, or about 9.4% of the total complaints. It was followed closely by The Mail with 31 complaints. The Times had 25 complaints filed against it, The Mirror and The Express 22 each, and The Sun, 19 complaints.

Generally, less than half the number of complaints were completely or partly upheld. Topping the list was The Express and The Telegraph with 10 upheld complaints each. And following close behind was The Times with 8 complaints, The Mail with 6, and The Sun and the Daily Star with 4 each.

See also the plot of batting average of media organizations with most complaints against them.

Clustering and then Classifying: Improving Prediction for Crudely-labeled and Mislabeled Data

8 Jun

Mislabeled and crudely labeled data are common problems in data science. Supervised prediction of such data expectedly yields poor results. To alleviate the problem, one simple solution is to cluster the data within each label, and then, instead of predicting original labels, predict cluster labels. For a class of problems, the method can be shown to always improve both comprehensibility and accuracy.

Detailed research note coming soon!

The Missing Plot

27 May

Datasets often contain missing values. And often enough—at least in social science data—values are missing systematically. So how do we visualize missing values? After all, they are missing. Some analysts simply list-wise delete points with missing values. Others impute, replacing missing values with mean or median. Others use yet more sophisticated methods to impute missing values. None of the methods, however, automatically acknowledge that any of the data are missing in the visualizations.

It is important to acknowledge missing data. One can do it is by providing a tally of how much data are missing on each of the variables in a small table in the graph. Another, perhaps better, method is to plot the missing values as a function of a covariate. For bivariate graphs, the solution is pretty simple. Create a dummy vector that tallies missing values. And plot the dummy vector in addition to the data. For instance, see:
missing

(The script to produce the graph can be downloaded from the following GitHub Gist.)

In cases, where missing values are imputed, the dummy vector can be (also) used to ‘color’ the points that were imputed.

Some Facts About PolitiFact

27 May

I assessed PolitiFact on:

1. Imbalance in scrutiny: Do they vet statements by Democrats or Democratic-leaning organizations more than statements Republicans or Republican-leaning organizations?

2. Batting average by party: Roughly n_correct/n_checked, but instantiated here as mean Politifact rating.

To answer the questions, I scraped the data from PolitiFact and independently coded and appended data on the party of the person or organization covered. (Feel free to download the script for scraping and analyzing the data, scraped data and data linking people and organizations to party from the GitHub Repository.)

Until now, Politifact has checked veracity 3,859 statements by 703 politicians and organizations. Of these, I was able to establish the partisanship of 554 people and organizations. I restrict the analysis to 3,396 statements by organizations and people whose partisanship I could establish and who lean either towards the Republican or Democratic party. I code the Politifact 6-point True to Pants on Fire scale (true, mostly-true, half-true, barely-true, false, pants-fire) linearly so that it lies between 0 (pants-fire) and 1 (true).

Of the 3,396 statements, about 44% (n = 1506) of the statements checked by PolitiFact are by Democrats or Democratic-leaning organizations. Rest of the roughly 56% (n = 1890) are by Republicans or Republican-leaning organizations. The average PolitiFact rating of statements by Democrats or Democratic-leaning organizations (batting average) is .63; it is .49 for statements by Republicans or Republican-leaning organizations.

To check whether the results are driven by some people receiving a lot of scrutiny, I tallied the total number of statements investigated for each person. Unsurprisingly, there is a large skew, with a few prominent politicians receiving a bulk of the attention. For instance, PolitiFact investigated more than 500 claims by Barack Obama alone. The figure below plots the total number of statements investigated for thirty politicians receiving the most scrutiny.
t30_total_investigated

If you take out Barack Obama, the percentage of Democrats receiving scrutiny reduces to 33.98%. More generally, limiting ourselves to the bottom 90% of the politicians in terms of scrutiny received, the share of Democrats is about 42.75%.

To analyze whether there is selection bias in covering politicians who incorrect things more often, I estimated the correlation between batting average and the total number of statements investigated. The correlation is very weak and does not appear to vary systematically by party. Accounting for the skew by taking the log of the total statements or estimating a rank ordered correlation, similarly has little effect. The figure below plots batting average as a function of total statements investigated.

batting_average_total_investigated

Caveats About Interpretation

To interpret the numbers, you need to make two assumptions:

1. The number of statements made by Republicans and Republican-leaning persons and organizations is the same as that made by people and organizations on the left.

2. Truthiness of statements by Republican and Republican-leaning persons and organizations is the same as that of left-leaning people and organizations.

About 85% Problematic: The trouble with predicting 85 percent of cyber-attacks using input from human experts

26 Apr

MIT researchers recently unveiled a system that combines machine learning with input from users to ‘predict 85% of the attacks.’ Each day, the system winnows down millions of rows to a few hundred atypical data points and passes these points on to ‘human experts’ who then label the few hundred data points. The system then uses the labels to refine the algorithm.

At the first blush, using data from users in such a way to refine the algorithm seems like the right thing to do, even the obvious thing to do. And there exist a variety of systems that do precisely this. In the context of cyber data (and a broad category of similar such data), however, it may not be the right thing to do. There are two big reasons for that. A low false positive rate can be much more easily achieved if we do not care about the false negative rate. And there are good reasons to worry a lot about false negative rates in cyber data. And second, and perhaps more importantly, incorporating user input on complex tasks (or where data is insufficiently rich) reduces to the following: given a complex task with inadequate time, the users use cheap heuristics to label the data, and supervised aspect of the algorithm reduces to learning cheap heuristics that humans use.

The Case for Ending Closed Academic Publishing

21 Mar

A few commercial publishers publish a large chunk of top flight of academic research. And earn a pretty penny doing so. The standard operating model of the publishers is as follows: pay the editorial board no more than $70-$100k, pay for typesetting and publishing, and in turn get copyrights to academic papers. And then go on and charge already locked in institutional customers—university and government libraries—and ordinary scholars extortionary rates. The model is gratuitously dysfunctional.

Assuming there are no long term contracts with the publishers, the system ought to be rapidly dismantled. But if dismantling is easy, creating something better may not be. It just happens to be. A majority of the cost of publishing is in printing on paper. Twenty first century has made printing large organized bundles on paper largely obsolete; those who need it can print on paper at home. Beyond that, open source software for administering a journal already exists. And the model of a single editor with veto powers seems anachronistic. Editing duties can be spread around much like peer review. As unpaid peer review can survive as it always has, though better mechanisms can be thought about. If some money is still needed for administration, it could be gotten easily by charging a nominal submission tax, waived where the author self identifies as being unable to pay.

Interpeting Clusters and ‘Outliers’ from Clustering Algorithms

19 Feb

Assume that the data from the dominant data generating process are structured so that they occupy a few small portions of a high-dimensional space. Say we use a hard partition clustering algorithm to learn the structure of the data. And say that it does—learn the structure. Anything that lies outside the few narrow pockets of high-dimensional space is an ‘outlier,’ improbable (even impossible) given the dominant data generating process. (These ‘outliers’ may be generated by a small malicious data generating processes.) Even points on the fringes of the narrow pockets are suspicious. If so, one reasonable measure of suspiciousness of a point is its distance from the centroid of the cluster to which it is assigned; the further the point from the centroid, the more suspicious it is. (Distance can be some multivariate distance metric, or proportion of points assigned to the cluster that are further away from the cluster centroid than the point whose score we are tallying.)

How can we interpret an outlier (score)? Tautological explanations—it is improbable given the dominant data generating process—aside.

Simply providing distance to the centroid doesn’t give enough context. And for obvious reasons, for high-dimensional vectors, providing distance on each feature isn’t reasonable either. A better approach involves some feature selection. This can be done in various ways, all of which take the same general form. Find distance to the centroid on features on which the points assigned to the cluster have the least variation. Or, on features that discriminate the cluster from other clusters the best. Or, on features that predict distance from the cluster centroid the best. Limit the features arbitrarily to a small set. On this limited feature set, calculate cluster means and standard deviations, and give standardized distance (for categorical variable, just provide ) to the centroid.

Read More (pdf with pseudo code)

Sampling (or Enumerating) with Coprimes

1 Jan

Say you want to sample from a sequence of length n. Multiples of a number that is relatively prime to the length of the sequence (n) cover the entire sequence, and have the property that the entire sequence is covered before any number is repeated. This is a known result from number theory. We could use the result to (sequentially) (see below for what I mean) sample from a series.

For instance, if the sequence is 1,2,3,…9, the number 5 is one such number (5 and 9 are coprime). Using multiples of 5, we get:

1 2 3 4 5 6 7 8 9
X
X X
X X
X X
X X

If the length of the sequence is odd, then we all know that 2 will do. But not all even numbers will do. For instance, for the same length of 9, if you were to choose 6, it would result in 6, 3, 9, and 6 again.

Some R code:


seq_length = 6
rel_prime  = 5
multiples  = rel_prime*(1:seq_length)
multiples  = ifelse(multiples > seq_length, multiples %% seq_length, multiples)
multiples  = ifelse(multiples ==0, seq_length, multiples)
length(unique(multiples))

Where can we use this? It makes passes over an address space less discoverable.

Clarifai(ng) the Future of Clarifai; Some Thoughts

31 Dec

Clarifai is a promising AI start-up. In a short(ish) time, it has made major progress on an important problem. And it is rapidly rolling out products with lots of business potential. But there are still some things that it could do.

As I understand it, the base version of Clarifai API is trying to do two things at once: a) learn various recognizable patterns in images b) rank the patterns based on ‘appropriateness’ and probab_true. I think Clarifai would have to split these two things over time and allow people to input what abstract dimensions are ‘appropriate’ for them. As the idiom goes, an image is a thousand words. In an image, there can be information about social class, race, and country, but also shapes, patterns, colors, perspective, depth, time of the day etc. And Clarifai should allow people to pick dimensions appropriate for the task. Though, defining dimensions would be hard. But that shouldn’t stymie the efforts. And ad hoc advances may be useful. For instance, one dimension could be abstract shapes and colors. Another could be the more ‘human’ dimension etc.

Extending the logic, Clarifai should support the building of abstract data science applications that solve a particular problem. For instance, say a user is only interested in learning about whether the photo features a man or a woman. And the user wants to build a Clarifai based classifier. (That person is me. Task is inferring gender of first names. See here.) Clarifai could in principle allow the user to train a classifier that uses all other information in the images, including jewelry, color, perspective, etc. and provide an out of sample error for that particular task. The crucial point is allowing users fuller access to what Clarifai can do and then letting users manage it to their ends. To that end again, input about user objectives needs to be built into the API. Basic hooks could be developed for classification and clustering inputs.

More generally, Clarifai should eventually support more user inputs and a greater variety of outputs. Limiting the product to tagging is a mistake.

There are three other general directions for Clarifai to go into. A product that automatically sections an image into multiple images and tags each section would be useful. This would allow, for instance, to count the number of women in a photo. Another direction to go would be to provide the ‘best’ set of tags that collectively describe a set of images. (It may seem like violating the spirit of what I note above but it needn’t — a user could want just this.) By the same token, Clarifai could build general purpose discrimination engines — a list of tags that distinguishes image(s) the best.

Beyond this, the obvious. Clarifai can also provide synonyms of tags to make tags easier to use. And it could allow users to specify if they want, say tags in ‘UK English’ etc.