## Conscious Uncoupling: Separating Compliance from Treatment

18 Sep

Let’s say that we want to measure the effect of a phone call encouraging people to register to vote. Let’s define compliance as a person taking the call. And let’s assume that the compliance rate is low. The traditional way to estimate the effect of the phone call is via an RCT: randomly split the sample into Treatment and Control, call everyone in the Treatment Group, wait till after the election, and calculate the difference between two groups in the proportion who voted. Assuming that the treatment doesn’t affect non-compliers, etc., we can also estimate the Complier Average Treatment Effect.

Non-compliance in the example above means: “Buddy, you need to reach these people using another way.” That is a super useful thing to know, but it is an observational point. You can fit a predictive model for who picks up phone calls and who doesn’t. The experiment is useful in answering how much can you persuade the people you reach on the phone. And you can learn that by randomizing conditional on compliance.

For such cases, here’s what we can do:

1. Call a reasonably large random sample of people. Learn a model for who complies.
2. Use it to target people who are likelier to comply and randomize post a person picking up.

Average Treatment Effect is useful for global rollouts of one policy. But why is that a good counterfactual to learn? Tautologically, when that is all you can do or when it is the optimal thing to do. If we are not in that world, why not learn about—and I am using the example to be concrete—a) what is a good way to reach me, b) what message do you want to show me. For instance, for political campaigns, the optimal strategy is to estimate the cost of reaching people by phone, mail, f2f, etc., estimate the probability of reaching each using each of the media, estimate the payoff for different messages for different kinds of people, and then target using the medium and the message that delivers the greatest benefit. (For a discussion about targeting, see here.)

All this is easier said than done. Technically, a message could have the greatest payoff for the person who is least likely to comply. And the optimal strategy could still be to call everyone.

To learn treatment effects among people who are unlikely to comply (using a particular method), you may need to build experiments around compliance to increase compliance so that there is enough power to estimate the effects precisely. But the aim of the enterprise should be to compare these payoffs to those from reaching the person another way. Perhaps the bottom line is that we need to be fixated on an optimization objective, e.g., the additional number of people persuaded to turn out to vote per dollar, rather than the calculation of ATE.

## Prediction Errors: Using ML For Measurement

1 Sep

Say you want to measure the how often people visit pornographic domains over some period. To measure that, you build a model to predict whether or not a domain hosts pornography. And let’s assume that for the chosen classification threshold, the False Positive rate (FP) is 10\% and the False Negative rate (FN) is 7\%. Here below, we discuss some of the concerns with using scores from such a model and discuss ways to address the issues.

Let’s get some notation out of the way. Let’s say that we have $n$ users and that we can iterate over them using $i$. Let’s denote the total number of unique domains—domains visited by any of the $n$ users at least once during the observation window—by $k$. And let’s use $j$ to iterate over the domains. Let’s denote the number of visits to domain $j$ by user $i$ by $c_{ij} = {0, 1, 2, ....}$. And let’s denote the total number of unique domains a person visits ($\sum (c_{ij} == 1)$) using $t_i$. Lastly, let’s denote predicted labels about whether or not each domain hosts pornography by $p$, so we have $p_1, ..., p_j, ... , p_k$.

Let’s start with a simple point. Say there are 5 domains with $p$: ${1_1, 1_2, 1_3, 1_4, 1_5}$. Let’s say user one visits the first three sites once and let’s say that user two visits all five sites once. Given 10\% of the predictions are false positives, the total measurement error in user one’s score $= 3 * .10$ and the total measurement error in user two’s score $= 5 * .10$. The general point is that total false positives increase as a function of predicted $1s$. And the total number of false negative increase as the number of predicted $0s$.

## Comparing Ad Targeting Regimes

30 Aug

Ad targeting regimes are essential when we have multiple things to sell (opportunity cost) or when the cost of running an ad is non-trivial or when the cost to user’s welfare of a mistargeted ad is non-trivial or any combination of the above. I am leaving this purposely vague because all this is well known.

Say that we have built two different models for selecting people to whom we should show ads—Model A and Model B. Now say that we want to compare which model is better. And by better, we mean better CTR. How do we compare the models? Some people have run an RCT to compare the efficacy of the two models. We don’t need an RCT. All we need to know for each user is whether or not they have been selected to see an ad under each model. We can have 4 potential scenarios:

model_a, model_b
0, 0
1, 0
0, 1
1, 1

For CTR, 0-0 doesn’t add any information. It is the conditional probability. To measure which of the models is better, draw a fixed size random sample of users picked by model_a and another random sample of the same size from users picked by model_b and compare CTR. (The same user can be picked twice. It doesn’t matter.)

Now that we know what to do, let’s understand why experiments are wasteful. The heuristic account is as follows: experiments are there to compare ‘similar people.’ When comparing targeting regimes, we are tautologically comparing different people. That is the point of the comparison.

## What’s Relevant? Learning from Organic Growth

26 Aug

Say that we want to find people to whom a product is relevant. One way to do that is to launch a small campaign advertising the product and learn from people who click on the ad, or better yet, learn from people who not just click on the ad but go and try out the product and end up using it. But if you didn’t have the luxury of running a small campaign and waiting a while, you can learn from organic growth.

Conventionally, people learn from organic growth by posing it as a supervised problem. And they generate the labels as follows: people who have ‘never’ (mostly: in the last 6–12 months) used the product are labeled as 0 and people who “adopted” the product in the latest time period, e.g., over the last month, are labeled 1. People who have used the product in the last 6–12 months or so are filtered out.

There are three problems with generating labels this way. First, not all the people who ‘adopt’ a product continue to use the product. Many of the people who try it find that it is not useful or find the price too high and abandon it. This means that a lot of 1s are mislabeled. Second, the cleanest 1s are the people who ‘adopted’ the product some time ago and have continued to use it since. Removing those is thus a bad idea. Third, the good 0s are those who tried the product but didn’t persist with it not those who never tried the product. Generating the labels in such a manner also allows you to mitigate one of the significant problems with learning from organic growth: people who organically find a product are different from those who don’t. Here, you are subsetting on the kinds of people who found the product, except that one found it useful and another did not. This empirical strategy has its problems, but it is distinctly better than the conventional approach.

## Quality Data: Plumbing ML Data Pipelines

6 Aug

What’s the difference between a scientist and a data scientist? Scientists often collect their own data, and data scientists often use data collected by other people. That is part jest but speaks to an important point. Good scientists know their data. Good data scientists must know their data too. To help data scientists learn about the data they use, we need to build systems that give them good data about the data. But what is good data about data? And how do we build systems that deliver that? Here’s some advice (tailored toward rectangular data for convenience):

• From Where, How Much, and Such
• Provenance: how were each of the columns in the data created (obtained)? If the data are derivative, find out the provenance of the original data. Be as concrete as possible, linking to scripts, related teams, and such.
• How Frequently is it updated
• Cost per unit of data, e.g., a cell in rectangular data.
Both, the frequency with which data are updated, and the cost per unit of data may change over time. Provenance may change as well: a new team (person) may start managing data. So the person who ‘owns’ the data must come back to these questions every so often. Come up with a plan.
• What? To know what the data mean, you need a data dictionary. A data dictionary explains the key characteristics of the data. It includes:
1. Information about each of the columns in plain language.
2. How were the data were collected? For instance, if you conducted a survey, you need the question text and the response options (if any) that were offered, along with the ‘mode’, and where in a sequence of questions does this lie, was it alone on the screen, etc.
3. Data type
4. How (if at all) are missing values generated?
5. For integer columns, it gives the range, sd, mean, median, n_0s, and n_missing. For categorical, it gives the number of unique values, what each label means, and a frequency table that includes n_missing (if missing can be of multiple types, show a row for each).
6. The number of duplicates in data and if they are allowed and a reason for why you would see them.
7. Number of rows and columns
8. Sampling
9. For supervised models, store correlation of y with key x_vars
• What If? What if you have a question? Who should you bug? Who ‘owns’ the ‘column’ of data?

Store these data in JSON so that you can use this information to validate against. Produce the JSON for each update. You can flag when data are some s.d. above below last ingest.

Store all this metadata with the data. For e.g., you can extend the dataframe class in Scala to make it so.

Auto-generate reports in markdown with each ingest.

In many ML applications, you are also ingesting data back from the user. So you need the same as above for the data you are getting from the user (and some of it at least needs to match the stored data).

For any derived data, you need the scripts and the logic, ideally in a notebook. This is your translation function.

Where possible, follow the third normal form of databases. Only store translations when translation is expensive. Even then, think twice.

Lastly, some quality control. Periodically sit down with your team to see if you should see what you are seeing. For instance, if you are in the survey business, do the completion times make sense? If you are doing supervised learning, get a random sample of labels. Assess their quality. You can also assess the quality by looking at errors in classification that your supervised model makes. Are the errors because the data are mislabeled? Keep iterating. Keep improving. And keep cataloging those improvements. You should be able to ‘diff’ data collection, not just numerical summaries of data. And with what the method I highlight above, you should be.

## The Base ML Model

12 Jul

The days of the artisanal ML model are mostly over. The artisanal model builds off domain “knowledge” (it can often be considerably less than that, bordering on misinformation). The artisan has long discussions with domain experts about what variables to include and how to include them in the model, often making idiosyncratic decisions about both. Or the artisan thinks deeply and draws on his own well. And then applies a couple of methods to the final feature set of 10s of variables, and out pops “the” model. This is borderline farcical when the datasets are both long and wide. For supervised problems, the low cost, scalable, common sense thing to do is to implement the following workflow:

1. Get good univariate summaries of each column in the data: mean, median, min., max, sd, n_missing for numerics, and the number of unique values, n_missing, frequency count for categories, etc. Use this to diagnose and understand the data. What stuff is common? On what variables do we have bad data? (see pysum.)

2. Get good bivariate summaries. Correlations for continuous variables and differences in means for categorical variables are reasonable. Use this to understand how the variables are related. Use this to understand the data.

3. Create a dummy vector for missing values for each variable

4. Subset on non-sparse columns

5. Regress on all non-sparse columns, ideally using NN, so that you are not in the business of creating interactions and such.

I have elided over a lot of detail. So let’s take a more concrete example. Say you are predicting whether someone will be diagnosed with diabetes in year y given the claims they make in year y-1, y-2, y-3, etc. Say claim for each service and medicine is a unique code. Tokenize all the claim data so that each unique code gets its own column, and filter on the non-sparse codes. How much information about time you want to preserve depends on you. But for the first cut, roll up the data so that code X made in any year is treated equally. Voila! You have your baseline model.

## Code 44: How to Read Ahler and Sood

27 Jun

This is a follow-up to the hilarious Twitter thread about the sequence of 44s. Numbers in Perry’s 538 piece come from this paper.

First, yes 44s are indeed correct. (Better yet, look for yourself.) But what do the 44s refer to? 44 is the average of all the responses. When Perry writes “Republicans estimated the share at 46 percent,” (we have similar language in the paper, which is regrettable as it can be easily misunderstood), it doesn’t mean that every Republican thinks so. It may not even mean that the median Republican thinks so. See OA 1.7 for medians, OA 1.8 for distributions, but see also OA 2.8.1, Table OA 2.18, OA 2.8.2, OA 2.11 and Table OA 2.23.

Key points =

1. Large majorities overestimate the share of party-stereotypical groups in the party, except for Evangelicals and Southerners.

2. Compared to what people think is the share of a group in the population, people still think the share of the group in the stereotyped party is greater. (But how much more varies a fair bit.)

3. People also generally underestimate the share of counter-stereotypical groups in the party.

## Automating Understanding, Not Just ML

27 Jun

Some of the most complex parts of Machine Learning are largely automated. The modal ML person types in simple commands for very complex operations and voila! Some companies, like Microsoft (Azure) and DataRobot, also provide a UI for this. And this has generally not turned out well. Why? Because this kind of system does too little for the modal ML person and expects too much from the rest. So the modal ML person doesn’t use it. And the people who do use it, generally use it badly. The black box remains the black box. But not much is needed to place a lamp in this black box. Really, just two things are needed:

1. A data summarization and visualization engine, preferably with some chatbot feature that guides people smartly through the key points, including the problems. For instance, start with univariate summaries, highlighting ranges, missing data, sparsity, and such. Then, if it is a supervised problem, give people a bunch of loess plots or explain the ‘best fitting’ parametric approximations with y in plain English, such as, “people who eat 1 more cookie live 5 minutes shorter on average.”

2. An explanation engine, including what the explanations of observational predictions mean. We already have reasonable implementations of this.

When you have both, you have automated complexity thoughtfully, in a way that empowers people, rather than create a system that enables people to do fancy things badly.

## Talking On a Tangent

22 Jun

What is the trend over the last X months? One estimate of the ‘trend’ over the last k time periods is what I call the ‘hold up the ends’ method. Look at t_k and t_0, get the difference between the two, and divide by the number of time periods. If t_k > t_0, you say that things are going up. If t_k < t_0, you say things are going down. And if they are the same, then you say that things are flat. But this method can elide over important non-linearity. For instance, say unemployment went down in the first 9 months and then went up over the last 3 but ended with t_k < t_0. What is the trend? If by trend, we mean average slope over the last t time periods, and if there is no measurement error, then 'hold up the ends' method is reasonable. If there is measurement error, we would want to smooth the time series first before we hold up the ends. Often people care about 'consistency' in the trend. One estimate of consistency is the following: the proportion of times we get a number of the same sign when we do pairwise comparison of any two time consecutive time periods. Often people also care more about later time periods than earlier time periods. And one could build on that intuition by weighting later changes more.

## Targeting 101

22 Jun

Say that there is a company that makes more than one product. The company can run an ad in one of its products about the one or more other products it produces that a user doesn’t use. Should it consider targeting—not showing the same ad to all users? There are six things to consider:

1. Opportunity Cost: Could the company make more profit by showing an ad for something else?
2. Cost of Showing an Ad to an Additional User: The cost of serving an ad; it is close to zero in the digital economy.
3. Cost of Worse Product: An ad for an irrelevant product lowers the user’s welfare. (The magnitude of the reduction depends on how disruptive the ad is and how irrelevant it is.) As a result of seeing an irrelevant ad in the product, the user likes the product less.
4. Cost of Not Learning About the Relevant Product Sooner and Investment in Learning About an Irrelevant Product: the cost of not learning about a product they could use sooner. Plus the investment a user makes in learning about a product that is not relevant to them.
5. Poisoning the Well: Showing an irrelevant ad means that people are more likely to skip whatever ad you present next. It reduces your ability to monetize future ads.
6. Profits: On the flip side of the ledger are expected profits. What are the expected profits from showing an ad? If you show a user an ad for a relevant product, they may not just buy and use the other product, but may also become less likely to switch from your stack. Further, they may even proselytize your product, netting you more users.

I formalize the problem here (pdf).