What is the trend over the last X months? One estimate of the ‘trend’ over the last k time periods is what I call the ‘hold up the ends’ method. Look at t_k and t_0, get the difference between the two, and divide by the number of time periods. If t_k > t_0, you say that things are going up. If t_k < t_0, you say things are going down. And if they are the same, then you say that things are flat. But this method can elide over important non-linearity. For instance, say unemployment went down in the first 9 months and then went up over the last 3 but ended with t_k < t_0. What is the trend? If by trend, we mean average slope over the last t time periods, and if there is no measurement error, then 'hold up the ends' method is reasonable. If there is measurement error, we would want to smooth the time series first before we hold up the ends. Often people care about 'consistency' in the trend. One estimate of consistency is the following: the proportion of times we get a number of the same sign when we do pairwise comparison of any two time consecutive time periods. Often people also care more about later time periods than earlier time periods. And one could build on that intuition by weighting later changes more.

# Category Archives: Statistics

## Targeting 101

Say that there is a company that makes more than one product. The company can run an ad in one of its products about the one or more other products it produces that a user doesn’t use. Should it consider targeting—not showing the same ad to all users? There are six things to consider:

**Opportunity Cost**: Could the company make more profit by showing an ad for something else?**Cost of Showing an Ad to an Additional User**: The cost of serving an ad; it is close to zero in the digital economy.**Cost of Worse Product**: An ad for an irrelevant product lowers the user’s welfare. (The magnitude of the reduction depends on how disruptive the ad is and how irrelevant it is.) As a result of seeing an irrelevant ad in the product, the user likes the product less.**Cost of Not Learning About the Relevant Product Sooner and Investment in Learning About an Irrelevant Product**: the cost of not learning about a product they could use sooner. Plus the investment a user makes in learning about a product that is not relevant to them.**Poisoning the Well**: Showing an irrelevant ad means that people are more likely to skip whatever ad you present next. It reduces your ability to monetize future ads.**Profits**: On the flip side of the ledger are expected profits. What are the expected profits from showing an ad? If you show a user an ad for a relevant product, they may not just buy and use the other product, but may also become less likely to switch from your stack. Further, they may even proselytize your product, netting you more users.

I formalize the problem here (pdf).

## Estimating Bias and Error in Perceptions of Group Composition

People’s reports of perceptions of the share of various groups in the population are typically biased. The bias is generally greater for smaller groups. The bias also appears to vary by how people feel about the group—they are likelier to think that the groups they don’t like are bigger—and by stereotypes about the groups (see here and here).

A new paper makes a remarkable claim: “explicit estimates are not direct reflections of perceptions, but systematic transformations of those perceptions. As a result, surveys and polls that ask participants to estimate demographic proportions cannot be interpreted as direct measures of participants’ (mis)information, since a large portion of apparent error on any particular question will likely reflect rescaling toward a more moderate expected value…”

The claim is supported by a figure that takes the form of plotting a curve over averages. (It also reports results from other papers that base their inferences on similar such figures.)

The evidence doesn’t seem right for the claim. Ideally, we want to plot curves within people and show that the curves are roughly the same. (I doubt it to be the case.)

Second, it is one thing to claim that the reports of perceptions follow a particular rescaling formula, and another to claim that people are aware of what they are doing. I doubt that people are.

Third, if the claim that ‘a large portion of apparent error on any particular question will likely reflect rescaling toward a more moderate expected value’ is true, then presenting people correct information ought not to change how people think about groups, for e.g., perceived threat from immigrants. The calibrated error should be a much better moderator than raw error. Again, I doubt it.

But I could be proven wrong about each. And I am ok with that. The goal is to learn the right thing, not to be proven right.

## Measuring Segregation

Dissimilarity index is a measure of segregation. It runs as follows:

where:

is population of in the *i*th area

is population of in the larger area

from which dissimilarity is being measured against

The measure suffers from a couple of issues:

- Concerns about lumpiness. Even in a small area, are black people at one end, white people at another?
- Choice of baseline. If the larger area (say a state) is 95\% white (Iowa is 91.3% White), dissimilarity is naturally likely to be small.

One way to address the concern about lumpiness is to provide an estimate of the spatial variance of the quantity of interest. But to measure variance, you need local measures of the quantity of interest. One way to arrive at local measures is as follows:

- Create a distance matrix across all addresses. Get latitude and longitude. And start with Euclidean distances, though smart measures that take account of physical features are a natural next step. (For those worried about computing super huge matrices, the good news is that computation can be parallelized.)
- For each address, find
*n*closest addresses and estimate the quantity of interest. Where multiple houses are similar distance apart, sample randomly or include all. One advantage of*n*closest rather than addresses in a particular area is that it naturally accounts for variations in density.

But once you have arrived at the local measure, why just report variance? Why not report means of compelling common-sense metrics, like the proportion of addresses (people) for whom the closest house has people of another race?

As for baseline numbers (generally just a couple of numbers): they are there to help you interpret. They can be brought in later.

## Sample This: Sampling Randomly from the Streets

Say you want to learn about the average number of potholes per unit paved street in a city. To estimate that quantity, the following sampling plan can be employed:

1. Get all the streets in a city from Google Maps or OSM

2. Starting from one end of the street, split each street into .5 km segments till you reach the end of the street. The last segment, or if the street is shorter than .5km, the only segment, can be shorter than .5 km.

3. Get the lat/long of start/end of the segments.

4. Create a database of all the segments: segment_id, street_name, start_lat, start_long, end_lat, end_long

5. Sample from rows of the database

6. Produce a CSV of the sampled segments (subset of step 4)

7. Plot the lat/long on Google Map — filling all the area within the segment.

8. Collect data on the highlighted **segments**.

For Python package that implements this, see https://github.com/soodoku/geo_sampling.

## Clustering and then Classifying: Improving Prediction for Crudely-labeled and Mislabeled Data

Mislabeled and crudely labeled data are common problems in data science. Supervised prediction of such data expectedly yields poor results—coefficients are biased, and accuracy with regards to the true label is poor. One solution to the problem is to hand code labels of an `adequate’ sample and infer true labels based on a model trained on that data.

Another solution relies on the intuition (assumption) that distance between rows (covariates) of a label will be lower than the distance between the distance between rows of different labels. One way to leverage that intuition is to cluster the data within each label, infer true labels from erroneous labels and then predict inferred true labels. For a class of problems, the method can be shown to *always* improve accuracy. (You can also predict just the cluster labels.)

Here’s one potential solution for a scenario where we have a binary dependent variable:

Assume a mis_labeled vector called mis_label that codes some true 0s as 1 and some true 1s as 0s.

- For each mis_label (1 and 0), use k-means with k = 2 to get 2 clusters within each label for a total of 4 labels
- Assuming mislabeling rate < 50%, create a new col. = est_true_label, which takes: 1 when mis_label = 1 and cluster label is of the majority class (that is cluster label class is more than 50% of the mis_label = 1), otherwise 0. 0 when mis_label = 0 and cluster label is of the majority class (that is cluster label class is more than 50% of the mis_label = 0), otherwise 1.
- Predict est_true_label using a logistic regression and produce accuracy estimates based on true_labels and bias estimates in coefficient estimates (compared to coefficients from logistic regression coefficients from true labels)

## The Missing Plot

Datasets often contain missing values. And often enough—at least in social science data—values are missing systematically. So how do we visualize missing values? After all, they are missing.

Some analysts simply list-wise delete points with missing values. Others impute, replacing missing values with mean or median. Yet others use sophisticated methods to impute missing values. None of the methods, however, automatically acknowledge that any of the data are missing in the visualizations.

It is important to acknowledge missing data.

One can do it is by providing a tally of how much data are missing on each of the variables in a small table in the graph. Another, perhaps better, method is to plot the missing values as a function of a covariate. For bivariate graphs, the solution is pretty simple. Create a dummy vector that tallies missing values. And plot the dummy vector in addition to the data. For instance, see:

(The script to produce the graph can be downloaded from the following GitHub Gist.)

In cases, where missing values are imputed, the dummy vector can also be used to ‘color’ the points that were imputed.

## About 85% Problematic: The Trouble With Human-In-The-Loop ML Systems

MIT researchers recently unveiled a system that combines machine learning with input from users to ‘predict 85% of the attacks.’ Each day, the system winnows down millions of rows to a few hundred atypical data points and passes these points on to ‘human experts’ who then label the few hundred data points. The system then uses the labels to refine the algorithm.

At the first blush, using data from users in such a way to refine the algorithm seems like the right thing to do, even the obvious thing to do. And there exist a variety of systems that do precisely this. In the context of cyber data (and a broad category of similar such data), however, it may not be the right thing to do. There are two big reasons for that. A low false positive rate can be much more easily achieved if we do not care about the false negative rate. And there are good reasons to worry a lot about false negative rates in cyber data. And second, and perhaps more importantly, incorporating user input on complex tasks (or where data is insufficiently rich) reduces to the following: given a complex task with inadequate time, the users use cheap heuristics to label the data, and supervised aspect of the algorithm reduces to learning cheap heuristics that humans use.

## Interpeting Clusters and ‘Outliers’ from Clustering Algorithms

Assume that the data from the dominant data generating process are structured so that they occupy a few small portions of a high-dimensional space. Say we use a hard partition clustering algorithm to learn the structure of the data. And say that it does—learn the structure. Anything that lies outside the few narrow pockets of high-dimensional space is an ‘outlier,’ improbable (even impossible) given the dominant data generating process. (These ‘outliers’ may be generated by a small malicious data generating processes.) Even points on the fringes of the narrow pockets are suspicious. If so, one reasonable measure of suspiciousness of a point is its distance from the centroid of the cluster to which it is assigned; the further the point from the centroid, the more suspicious it is. (Distance can be some multivariate distance metric, or proportion of points assigned to the cluster that are further away from the cluster centroid than the point whose score we are tallying.)

How can we interpret an outlier (score)? Tautological explanations—it is improbable given the dominant data generating process—aside.

Simply providing distance to the centroid doesn’t give enough context. And for obvious reasons, for high-dimensional vectors, providing distance on each feature isn’t reasonable either. A better approach involves some feature selection. This can be done in various ways, all of which take the same general form. Find distance to the centroid on features on which the points assigned to the cluster have the least variation. Or, on features that discriminate the cluster from other clusters the best. Or, on features that predict distance from the cluster centroid the best. Limit the features arbitrarily to a small set. On this limited feature set, calculate cluster means and standard deviations, and give standardized distance (for categorical variable, just provide ) to the centroid.

## Sampling With Coprimes

Say you want to sample from a sequence of length *n*. Multiples of a number that is *relatively prime* to the length of the sequence (n) cover the entire sequence and have the property that the entire sequence is covered before any number is repeated. This is a known result from number theory. We could use the result to (sequentially) (see below for what I mean) sample from a series.

For instance, if the sequence is 1, 2, 3,…, 9, the number 5 is one such number (5 and 9 are coprime). Using multiples of 5, we get:

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|

X | ||||||||

X | X | |||||||

X | X | |||||||

X | X | |||||||

X | X |

If the length of the sequence is odd, then we all know that 2 will do. But not all even numbers will do. For instance, for the same length of 9, if you were to choose 6, it would result in 6, 3, 9, and 6 again.

Some R code:

```
seq_length = 6
rel_prime = 5
multiples = rel_prime*(1:seq_length)
multiples = ifelse(multiples > seq_length, multiples %% seq_length, multiples)
multiples = ifelse(multiples ==0, seq_length, multiples)
length(unique(multiples))
```

Where can we use this? It makes passes over an address space less discoverable.