Growth Funnels

24 Sep

You spend a ton of time building a product to solve a particular problem. You launch. Then, either the kinds of people whose problem you are solving never arrive or they arrive and then leave. You want to know why because if you know the reason, you can work to address the underlying causes.

Often, however, businesses only have access to observational data. And since we can’t answer the why with observational data, people have swapped the why with the where. Answering the where can give us a window into the why (though only a window). And that can be useful.

The traditional way of posing ‘where’ is called a funnel. Conventionally, funnels start at when the customer arrives on the website. We will forgo convention and start at the top.

There are only three things you should work to optimally do conditional on the product you have built:

  1. Help people discover your product
  2. Effectively convey the relevant value of the product to those who have discovered your product
  3. Help people effectively use the product

p.s. When the product changes, you have to do all three all over again.

One way funnels can potentially help is triage. How big is the ‘leak’ at each ‘stage’? The funnel on the top of the first two steps is: of the people who discovered the product, how many did we successfully communicate the relevant value of the product to? Posing the problem in such a way makes it seem more powerful than it is. To come up with a number, you generally only have noisy behavioral signals. For instance, the proportion of people who visit the site who sign up. But low proportions could be of large denominators—lots of people are visiting the site but the product is not relevant for most of them. (If bringing people to the site costs nothing, there is nothing to do.) Or it could be because you have a super kludgy sign-up process. You could drill down to try to get at such clues, but the number of potential locations for drilling remains large.

That brings us to the next point. Macro-triaging is useful but only for resource allocation kinds of decisions based on some assumptions. It doesn’t provide a way to get concrete answers to real issues. For concrete answers, you need funnels around concrete workflows. For instance, for a referral ‘product’ for AirBnB, one way to define steps (based on Gustaf Alstromer) is as follows:

  1. how many people saw the link asking people to refer
  2. of the people who saw the link, how many clicked on it
  3. of the people who clicked on it, how many invited people
  4. of the people who were invited, how many signed up (as a user, guest, host)
  5. of the people who signed up, how many made the first booking

Such a funnel allows you to optimize the workflow by experimenting with the user interface at each stage. It allows you to analyze what the users were trying to do but failed to do, or took a long time doing. It also allows you to analyze how optimally (from a company’s perspective) are users taking an action. For instance, people may want to invite all their friends but the UI doesn’t have a convenient way to import contacts from email.

Sometimes just writing out the steps in a workflow can be useful. It allows people to think if some steps are actually needed. For instance, is signing-in needed before we allow a user to understand the value of the product?

Conscious Uncoupling: Separating Compliance from Treatment

18 Sep

Let’s say that we want to measure the effect of a phone call encouraging people to register to vote on voting. Let’s define compliance as a person taking the call. And let’s assume that the compliance rate is low. The traditional way to estimate the effect of the phone call is via an RCT: randomly split the sample into Treatment and Control, call everyone in the Treatment Group, wait till after the election, and calculate the difference in the proportion who voted. Assuming that the treatment doesn’t affect non-compliers, etc., we can also estimate the Complier Average Treatment Effect.

But one way to think about non-compliance in the example above is as follows: “Buddy, you need to reach these people using another way.” That is a super useful thing to know, but it is an observational point. You can fit a predictive model for who picks up phone calls and who doesn’t. The experiment is useful in answering how much can you persuade the people you reach on the phone. And you can learn that by randomizing conditional on compliance.

For such cases, here’s what we can do:

  1. Call a reasonably large random sample of people. Learn a model for who complies.
  2. Use it to target people who are likelier to comply and randomize post a person picking up.

More generally, Average Treatment Effect is useful for global rollouts of one policy. But when is that a good counterfactual to learn? Tautologically, when that is all you can do or when it is the optimal thing to do. If we are not in that world, why not learn about—and I am using the example to be concrete—a) what is a good way to reach me, b) what message do you want to show me. For instance, for political campaigns, the optimal strategy is to estimate the cost of reaching people by phone, mail, f2f, etc., estimate the probability of reaching each using each of the media, estimate the payoff for different messages for different kinds of people, and then target using the medium and the message that delivers the greatest benefit. (For a discussion about targeting, see here.)

But technically, a message could have the greatest payoff for the person who is least likely to comply. And the optimal strategy could still be to call everyone. To learn treatment effects among people who are unlikely to comply (using a particular method), you will need to build experiments to increase compliance. More generally, if you are thinking about multi-arm bandits or some such dynamic learning system, the insight is to have treatment arms around both compliance and message. The other general point, implicit in the essay, is that rather than be fixated on calculating ATE, we should be fixated on an optimization objective, e.g., the additional number of people persuaded to turn out to vote per dollar.

Prediction Errors: Using ML For Measurement

1 Sep

Say you want to measure the how often people visit pornographic domains over some period. To measure that, you build a model to predict whether or not a domain hosts pornography. And let’s assume that for the chosen classification threshold, the False Positive rate (FP) is 10\% and the False Negative rate (FN) is 7\%. Here below, we discuss some of the concerns with using scores from such a model and discuss ways to address the issues.

Let’s get some notation out of the way. Let’s say that we have n users and that we can iterate over them using i. Let’s denote the total number of unique domains—domains visited by any of the n users at least once during the observation window—by k. And let’s use j to iterate over the domains. Let’s denote the number of visits to domain j by user i by c_{ij} = {0, 1, 2, ....}. And let’s denote the total number of unique domains a person visits (\sum (c_{ij} == 1)) using t_i. Lastly, let’s denote predicted labels about whether or not each domain hosts pornography by p, so we have p_1, ..., p_j, ... , p_k.

Let’s start with a simple point. Say there are 5 domains with p: {1_1, 1_2, 1_3, 1_4, 1_5}. Let’s say user one visits the first three sites once and let’s say that user two visits all five sites once. Given 10\% of the predictions are false positives, the total measurement error in user one’s score = 3 * .10 and the total measurement error in user two’s score = 5 * .10. The general point is that total false positives increase as a function of predicted 1s. And the total number of false negative increase as the number of predicted 0s.

Read more here.

Seven Rules of Technical Management

31 Aug
  1. Assume that everything is being done sub-optimally. Probe everything.
  2. Push people to explain things as simply as possible.
  3. For anything complex, ask people to write things down in plain English.
  4. Get perspective from people with different technical competencies.
  5. Frontload your thinking on a project.
  6. Define the API before moving too far out.
  7. Do less. Do well.

The unspoken golden rule: hire competent people. Other virtues matter but competence generally matters the most in technical roles.

Comparing Ad Targeting Regimes

30 Aug

Ad targeting is often useful when you have multiple things to sell (opportunity cost) or when the cost of running an ad is non-trivial or when an irrelevant ad reduces your ability to reach the user later or any combination of the above. (For a more formal treatment, see here.)

But say that you want proof—you want to estimate the benefit of targeting. How would you do it?

When there is one product to sell, some people have gone about it as follows: randomize to treatment and control, show the ad to a random subset of respondents in the control group and an equal number of respondents picked by a model in the treatment group, and compare the outcomes of the two groups (it reduces to comparing subsets unless there are spillovers). This experiment can be thought off as a way to estimate how to spend a fixed budget optimally. (In this case, the budget is the number of ads you can run.) But if you were interested in finding out whether a budget allocated by a model would be more optimal than say random allocation, you don’t need an experiment (unless there are spillovers). All you need to do is show the ad to a random set of users. For each user, you know whether or not they would have been selected to see an ad by the model. And you can use this information to calculate payoffs for the respondents chosen by the model, and for the randomly selected group.

Let me expand for clarity. Say that you can measure profit from ads using CTR. Say that we have built two different models for selecting people to whom we should show ads—Model A and Model B. Now say that we want to compare which model yields a higher CTR. We can have four potential scenarios for selection of respondents by the model:

model_a, model_b
0, 0
1, 0
0, 1
1, 1

For CTR, 0-0 doesn’t add any information. It is the conditional probability. To measure which of the models is better, draw a fixed size random sample of users picked by model_a and another random sample of the same size from users picked by model_b and compare CTR. (The same user can be picked twice. It doesn’t matter.)

Now that we know what to do, let’s understand why experiments are wasteful. The heuristic account is as follows: experiments are there to compare ‘similar people.’ When estimating allocative efficiency of picking different sets of people, we are tautologically comparing different people. That is the point of the comparison.

All this still leaves the question of how would we measure the benefit of targeting. If you had only one ad to run and wanted to choose between showing an advertisement to everyone versus fewer people, then show the ad to everyone and estimate profits based on the rows selected in the model and profits from showing the ad to everyone. Generally, showing an ad to everyone will win.

If you had multiple ads, you would need to randomize. Assign each person in the treatment group to a targeted ad. In the control group, you could show an ad for a random product. Or you could show an advertisement for any one product that yields the maximum revenue. Pick whichever number is higher as the one to compare against.

What’s Relevant? Learning from Organic Growth

26 Aug

Say that we want to find people to whom a product is relevant. One way to do that is to launch a small campaign advertising the product and learn from people who click on the ad, or better yet, learn from people who not just click on the ad but go and try out the product and end up using it. But if you didn’t have the luxury of running a small campaign and waiting a while, you can learn from organic growth.

Conventionally, people learn from organic growth by posing it as a supervised problem. And they generate the labels as follows: people who have ‘never’ (mostly: in the last 6–12 months) used the product are labeled as 0 and people who “adopted” the product in the latest time period, e.g., over the last month, are labeled 1. People who have used the product in the last 6–12 months or so are filtered out.

There are three problems with generating labels this way. First, not all the people who ‘adopt’ a product continue to use the product. Many of the people who try it find that it is not useful or find the price too high and abandon it. This means that a lot of 1s are mislabeled. Second, the cleanest 1s are the people who ‘adopted’ the product some time ago and have continued to use it since. Removing those is thus a bad idea. Third, the good 0s are those who tried the product but didn’t persist with it not those who never tried the product. Generating the labels in such a manner also allows you to mitigate one of the significant problems with learning from organic growth: people who organically find a product are different from those who don’t. Here, you are subsetting on the kinds of people who found the product, except that one found it useful and another did not. This empirical strategy has its problems, but it is distinctly better than the conventional approach.

Sometimes Scientists Spread Misinformation

24 Aug

To err is human. Good scientists are aware of that, painfully so. The model scientist obsessively checks everything twice over and still keeps eyes peeled for loose ends. So it is a shock to learn that some of us are culpable for spreading misinformation.

Ken and I find that articles with serious errors, even articles based on fraudulent data, continue to be approvingly cited—cited without any mention of any concern—long after the problems have been publicized. Using a novel database of over 3,000 retracted articles and over 74,000 citations to these articles, we find that at least 31% of the citations to retracted articles happen a year after the publication of the retraction notice. And that over 90% of these citations are approving.

What gives our findings particular teeth is the role citations play in science. Many, if not most, claims in a scientific article rely on work done by others. And scientists use citations to back such claims. The readers rely on scientists to note any concerns that impinge on the underlying evidence for the claim. And when scientists cite problematic articles without noting any concerns they very plausibly misinform their readers.

Though 74,000 is a large enough number to be deeply concerning, retractions are relatively infrequent. And that may lead some people to discount these results. Retractions may be infrequent but citations to retracted articles post-retraction are extremely revealing. Retractions are a low-low bar. Retractions are often a result of convincing evidence of serious malpractice, generally fraud or serious error. Anything else, for example, a serious error in data analysis, is usually allowed to self-correct. And if scientists are approvingly citing retracted articles after they have been retracted, it means that they have failed to hurdle the low-low bar. Such failure suggests a broader malaise.

To investigate the broader malaise, Ken and I exploited data from an article published in Nature that notes a statistical error in a series of articles published in prominent journals. Once again, we find that approving citations to erroneous articles persist after the error has been publicized. After the error has been publicized, the rate of citation to erroneous articles is, if anything, higher, and 98% of the citations are approving.

In all, it seems, we are failing.

The New Unit of Scientific Production

11 Aug

One fundamental principle of science is that there is no privileged observer. You get to question what people did. But to question, you first must know what people did. So part of good scientific practice is to make it easy for people to understand how the sausage was made—how the data were collected, transformed, and analyzed—and ideally, why you chose to make the sausage that particular way. Papers are ok places for describing all this, but we now have better tools: version controlled repositories with notebooks and readme files.

The barrier to understanding is not just lack of information, but also poorly organized information. There are three different arcs of information: cross-sectional (where everything is and how it relates to each other), temporal (how the pieces evolve over time), and inter-personal (who is making the changes). To be organized cross-sectionally, you need to be macro organized (where is the data, where are the scripts, what do each of the scripts do, how do I know what the data mean, etc.), and micro organized (have logic and organization to each script; this also means following good coding style). Temporal organization in version control simply requires you to have meaningful commit messages. And inter-personal organization requires no effort at all, beyond the logic of pull requests.

The obvious benefits of this new way are known. But what is less discussed is that this new way allows you to critique specific pull requests and decisions made in certain commits. This provides an entirely new way to make progress in science. The new unit of science also means that we just don’t dole out credits in crude currency like journal articles but we can also provide lower denominations. We can credit each edit, each suggestion. And why not. The third big benefit is that we can build epistemological trees where the logic of disagreement is clear.

The dead tree edition is dead. It is also time to retire the e-version of the dead tree edition.

Quality Data: Plumbing ML Data Pipelines

6 Aug

What’s the difference between a scientist and a data scientist? Scientists often collect their own data, and data scientists often use data collected by other people. That is part jest but speaks to an important point. Good scientists know their data. Good data scientists must know their data too. To help data scientists learn about the data they use, we need to build systems that give them good data about the data. But what is good data about the data? And how do we build systems that deliver that? Here’s some advice (tailored toward rectangular data for convenience):

  • From Where, How Much, and Such
    • Provenance: how were each of the columns in the data created (obtained)? If the data are derivative, find out the provenance of the original data. Be as concrete as possible, linking to scripts, related teams, and such.
    • How Frequently is it updated
    • Cost per unit of data, e.g., a cell in rectangular data.
    Both, the frequency with which data are updated, and the cost per unit of data may change over time. Provenance may change as well: a new team (person) may start managing data. So the person who ‘owns’ the data must come back to these questions every so often. Come up with a plan.
  • What? To know what the data mean, you need a data dictionary. A data dictionary explains the key characteristics of the data. It includes:
    1. Information about each of the columns in plain language.
    2. How were the data were collected? For instance, if you conducted a survey, you need the question text and the response options (if any) that were offered, along with the ‘mode’, and where in a sequence of questions does this lie, was it alone on the screen, etc.
    3. Data type
    4. How (if at all) are missing values generated?
    5. For integer columns, it gives the range, sd, mean, median, n_0s, and n_missing. For categorical, it gives the number of unique values, what each label means, and a frequency table that includes n_missing (if missing can be of multiple types, show a row for each).
    6. The number of duplicates in data and if they are allowed and a reason for why you would see them. 
    7. Number of rows and columns
    8. Sampling
    9. For supervised models, store correlation of y with key x_vars
  • What If? What if you have a question? Who should you bug? Who ‘owns’ the ‘column’ of data?

Store these data in JSON so that you can use this information to validate against. Produce the JSON for each update. You can flag when data are some s.d. above below last ingest.

Store all this metadata with the data. For e.g., you can extend the dataframe class in Scala to make it so.

Auto-generate reports in markdown with each ingest.

In many ML applications, you are also ingesting data back from the user. So you need the same as above for the data you are getting from the user (and some of it at least needs to match the stored data). 

For any derived data, you need the scripts and the logic, ideally in a notebook. This is your translation function.

Where possible, follow the third normal form of databases. Only store translations when translation is expensive. Even then, think twice.

Lastly, some quality control. Periodically sit down with your team to see if you should see what you are seeing. For instance, if you are in the survey business, do the completion times make sense? If you are doing supervised learning, get a random sample of labels. Assess their quality. You can also assess the quality by looking at errors in classification that your supervised model makes. Are the errors because the data are mislabeled? Keep iterating. Keep improving. And keep cataloging those improvements. You should be able to ‘diff’ data collection, not just numerical summaries of data. And with what the method I highlight above, you should be.

Pride and Prejudice

14 Jul

It is ‘so obvious’ that policy A >> policy B that only who don’t want to know or who want inferior things would support policy B. Does this conversation remind you of any that you have had? We don’t just have such conversations about policies. We also have them about people. Way too often, we are being too harsh.

We overestimate how much we know. We ‘know know’ that we are right, we ‘know’ that there isn’t enough information in the world that will make us switch to policy B. Often, the arrogance of this belief is lost on us. As Kahneman puts it, we are ‘ignorant of our own ignorance.’ How could it be anything else? Remember the aphorism, “the more you know, the more you know you don’t know”? The aphorism may not be true but it gets the broad point right. The ignorant are overconfident. And we are ignorant. The human condition is such that it doesn’t leave much room for being anything else (see the top of this page).

Here’s one way to judge your ignorance (see here for some other ideas). Start by recounting what you know. Sit in front of a computer and type it up. Go for it. And then add a sentence about how do you know. Do you recall reading any detailed information about this person or issue? From where? Would you have bought a car if you had that much information about a car?

We not just overestimate what we know, we also underestimate what other people know. Anybody with different opinions must know less than I. It couldn’t be that they know more, could it?

Both, being overconfident about what we know and underestimating what other people know leads to the same thing: being too confident about the rightness of our cause and mistaking our prejudices for obvious truths.

George Carlin got it right. “Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac?” It seems the way we judge drivers is how we judge everything else. Anyone who knows less than you is either being willfully obtuse or an idiot. And those who know more than you just look like ‘maniacs.’