Say that you are in the search engine business. And say that you have built a model that estimates how relevant an ad is based on the ‘context’: search query, previous few queries, kind of device, location, and such. Now let’s assume that for context X, the rank-ordered list of ads based on expected profit is: product A, product B, and product C. Now say that you want to estimate how effective an ad for product A is in driving the sales of product A. One conventional way to estimate this is to randomly assign during serve time: for context X, serve half the people an ad for product A and serve half the people no ad. But if it is true (and you can verify this) that an ad for product B doesn’t cause people to buy product A, then you can switch the ‘no ad’ control where you are not making any money with an ad for product B. With this, you can estimate the effectiveness of ad for product A while sacrificing the least amount of revenue. Better yet, if it is true that ad for product A doesn’t cause people to buy product B, you can also at the same time get an estimate of the efficacy of ad for product B.
Broadly, Google Ads works as follows: 1. Advertisers create an ad, choose keywords, and make a bid (on cost-per-click or CPC) (You can bid on cost-per-view and cost-per-impression also, but we limit our discussion to CPC.), 2. the Google Ads account team vets whether the keywords are related to the product being advertised, and 3. people see the ad from the winning bid when they search for a term that includes the keyword or when they browse content that is related to the keyword (some Google Ads are shown on sites that use Google AdSense).
There is a further nuance to the last step. Generally, on popular keywords, Google has thousands of candidate ads to choose from. And Google doesn’t simply choose the ad from the winning bid. Instead, it uses data to choose an ad (or a few ads) that yield the most profit (Click Through Rate (CTR)*bid). (Google probably has a more complex user utility function and doesn’t show ads below a low predicted
CTR*bid.) In all, who Google shows ads to depends on the predicted CTR and the money it will make per click.
Given this setup, we can reason about the audience for an ad. First, the higher the bid, the broader the audience. Second, it is not clear how well Google can predict CTR per ad conditional on keyword bid especially when the ad run is small. And if that is so, we expect Google to show the ad with the highest bid to a random subset of people searching for the keyword or browsing content related to the keyword. Under such conditions, you can use the total number of impressions per demographic group as an indicator of interest in the keyword. For instance, if you make the highest bid on the keyword ‘election’ and you find that total number of impressions that your ad makes among people 65+ are 10x more than people between ages 18-24, under some assumptions, e.g., similar use of ad blockers, similar rates of clicking ads conditional on relevance (which would become same as predicted relevance), similar utility functions (that is younger people are not more sensitive to irritation from irrelevant ads than older people), etc., you can infer relative interest of 18-24 versus 65+ in elections.
The other case where you can infer relative interest in a keyword (topic) from impressions is when ad markets are thin. For common keywords like ‘elections,’ Google generally has thousands of candidate ads for national campaigns. But if you only want to show your ad in a small geographic area or an infrequently searched term, the candidate set can be pretty small. If your ad is the only one, then your ad will be shown wherever it exceeds some minimum threshold of predicted CTR*bid. Assuming a high enough bid, you can take the total number of impressions of an ad as a proxy for total searches for the term and how often people browsed related content.
With all of this in mind, I discuss results from a Google Ads campaign. More here.
If the canonical insight of computer science is automating repetition, the canonical insight of data science is optimization. It isn’t that computer scientists haven’t thought about optimization. They have. But computer scientists weren’t the first to think about automation, just like economists weren’t the first to think that incentives matter. Automation is just the canonical, foundational, purpose of computer science.
Similarly, optimization is the canonical, foundational purpose of data science. Data science aims to provide the “optimal” action at time t conditional on what you know. And it aims to do that by learning from data optimally. For instance, if the aim is to separate apples from oranges, the aim of supervised learning is to give the best estimate of whether the fruit is an apple or an orange given data.
For certain kinds of problems, the optimal way to learn from data is not to exploit found data but to learn from new data collected in an optimal way. For instance, randomized inference also us to compare two arbitrary regimes. And say if you want to optimize persuasiveness, you need to continuously experiment with different pitches (the number of dimensions on which pitches can be generated can be a lot), some of which exploit human frailties (which vary by people) and some that will exploit the fact that people need to be pitched the relevant value and that relevant value differs across people.
Once you know the canonical insight of a discipline, it opens up all the problems that can be “solved” by it. It also tells you what kind of platform you need to build to make optimal decisions for that problem. For some tasks, the “platform” may be supervised learning. For other tasks, like ad persuasiveness, it may be a platform that combines supervised learning (for targeting) and experimentation (for optimizing the pitch).
Probabilities from classification models can have two problems:
- Miscalibration: A p of .9 often doesn’t mean a 90% chance of 1 (assuming a dichotomous y). (You can calibrate it using isotonic regression.)
- Optimal cut-offs: For multi-class classifiers, we do not know what probability value will maximize the accuracy or F1 score. Or any metric for which you need to trade-off between FP and FN.
One way to solve #2 is to run the true labels (out of sample, otherwise there is concern about bias) and probabilities through a brute-force optimizer and gives you the optimal cut-off for the metric. Here’s the script for doing the same along with an illustration.
Say that you train a model to predict who will click on an ad. Say that you deploy the model to only show ads to people who are likely to click on them. (For a discussion about the optimal strategy for who to show ads to, see here.) And say you use the clicks from the people who see the ad to continue to tune the parameters. (This is a close approximation of a standard implementation of online learning in online advertising.)
In effect, once you launch the model, you only get data from a biased set of users. Such a sampling bias can be a problem when the data generating process (how the 1s and the 0s are generated) changes in a way such that changes above the threshold (among the kinds of people who we get data from) are uncorrelated with how it changes below the threshold (among the people who we do not get data from). The concerning aspect is that if this happens, the model continues to “work,” in that the accuracy can continue to be high even as recall (the proportion of people for whom the ad is relevant) becomes lower over time. There is only one surefire way to diagnose the issue and address it: continue to collect some data from people below the threshold and learn if the data generating process is changing.
Let’s say that we want to measure the effect of a phone call encouraging people to register to vote on voting. Let’s define compliance as a person taking the call. And let’s assume that the compliance rate is low. The traditional way to estimate the effect of the phone call is via an RCT: randomly split the sample into Treatment and Control, call everyone in the Treatment Group, wait till after the election, and calculate the difference in the proportion who voted. Assuming that the treatment doesn’t affect non-compliers, etc., we can also estimate the Complier Average Treatment Effect.
But one way to think about non-compliance in the example above is as follows: “Buddy, you need to reach these people using another way.” That is a super useful thing to know, but it is an observational point. You can fit a predictive model for who picks up phone calls and who doesn’t. The experiment is useful in answering how much can you persuade the people you reach on the phone. And you can learn that by randomizing conditional on compliance.
For such cases, here’s what we can do:
- Call a reasonably large random sample of people. Learn a model for who complies.
- Use it to target people who are likelier to comply and randomize post a person picking up.
More generally, Average Treatment Effect is useful for global rollouts of one policy. But when is that a good counterfactual to learn? Tautologically, when that is all you can do or when it is the optimal thing to do. If we are not in that world, why not learn about—and I am using the example to be concrete—a) what is a good way to reach me, b) what message do you want to show me. For instance, for political campaigns, the optimal strategy is to estimate the cost of reaching people by phone, mail, f2f, etc., estimate the probability of reaching each using each of the media, estimate the payoff for different messages for different kinds of people, and then target using the medium and the message that delivers the greatest benefit. (For a discussion about targeting, see here.)
But technically, a message could have the greatest payoff for the person who is least likely to comply. And the optimal strategy could still be to call everyone. To learn treatment effects among people who are unlikely to comply (using a particular method), you will need to build experiments to increase compliance. More generally, if you are thinking about multi-arm bandits or some such dynamic learning system, the insight is to have treatment arms around both compliance and message. The other general point, implicit in the essay, is that rather than be fixated on calculating ATE, we should be fixated on an optimization objective, e.g., the additional number of people persuaded to turn out to vote per dollar.
Say you want to measure the how often people visit pornographic domains over some period. To measure that, you build a model to predict whether or not a domain hosts pornography. And let’s assume that for the chosen classification threshold, the False Positive rate (FP) is 10\% and the False Negative rate (FN) is 7\%. Here below, we discuss some of the concerns with using scores from such a model and discuss ways to address the issues.
Let’s get some notation out of the way. Let’s say that we have users and that we can iterate over them using . Let’s denote the total number of unique domains—domains visited by any of the users at least once during the observation window—by . And let’s use to iterate over the domains. Let’s denote the number of visits to domain by user by . And let’s denote the total number of unique domains a person visits () using . Lastly, let’s denote predicted labels about whether or not each domain hosts pornography by , so we have .
Let’s start with a simple point. Say there are 5 domains with : . Let’s say user one visits the first three sites once and let’s say that user two visits all five sites once. Given 10\% of the predictions are false positives, the total measurement error in user one’s score and the total measurement error in user two’s score . The general point is that total false positives increase as a function of predicted . And the total number of false negative increase as the number of predicted .
Read more here.
Ad targeting is often useful when you have multiple things to sell (opportunity cost) or when the cost of running an ad is non-trivial or when an irrelevant ad reduces your ability to reach the user later or any combination of the above. (For a more formal treatment, see here.)
But say that you want proof—you want to estimate the benefit of targeting. How would you do it?
When there is one product to sell, some people have gone about it as follows: randomize to treatment and control, show the ad to a random subset of respondents in the control group and an equal number of respondents picked by a model in the treatment group, and compare the outcomes of the two groups (it reduces to comparing subsets unless there are spillovers). This experiment can be thought off as a way to estimate how to spend a fixed budget optimally. (In this case, the budget is the number of ads you can run.) But if you were interested in finding out whether a budget allocated by a model would be more optimal than say random allocation, you don’t need an experiment (unless there are spillovers). All you need to do is show the ad to a random set of users. For each user, you know whether or not they would have been selected to see an ad by the model. And you can use this information to calculate payoffs for the respondents chosen by the model, and for the randomly selected group.
Let me expand for clarity. Say that you can measure profit from ads using CTR. Say that we have built two different models for selecting people to whom we should show ads—Model A and Model B. Now say that we want to compare which model yields a higher CTR. We can have four potential scenarios for selection of respondents by the model:
For CTR, 0-0 doesn’t add any information. It is the conditional probability. To measure which of the models is better, draw a fixed size random sample of users picked by model_a and another random sample of the same size from users picked by model_b and compare CTR. (The same user can be picked twice. It doesn’t matter.)
Now that we know what to do, let’s understand why experiments are wasteful. The heuristic account is as follows: experiments are there to compare ‘similar people.’ When estimating allocative efficiency of picking different sets of people, we are tautologically comparing different people. That is the point of the comparison.
All this still leaves the question of how would we measure the benefit of targeting. If you had only one ad to run and wanted to choose between showing an advertisement to everyone versus fewer people, then show the ad to everyone and estimate profits based on the rows selected in the model and profits from showing the ad to everyone. Generally, showing an ad to everyone will win.
If you had multiple ads, you would need to randomize. Assign each person in the treatment group to a targeted ad. In the control group, you could show an ad for a random product. Or you could show an advertisement for any one product that yields the maximum revenue. Pick whichever number is higher as the one to compare against.
Say that we want to find people to whom a product is relevant. One way to do that is to launch a small campaign advertising the product and learn from people who click on the ad, or better yet, learn from people who not just click on the ad but go and try out the product and end up using it. But if you didn’t have the luxury of running a small campaign and waiting a while, you can learn from organic growth.
Conventionally, people learn from organic growth by posing it as a supervised problem. And they generate the labels as follows: people who have ‘never’ (mostly: in the last 6–12 months) used the product are labeled as 0 and people who “adopted” the product in the latest time period, e.g., over the last month, are labeled 1. People who have used the product in the last 6–12 months or so are filtered out.
There are three problems with generating labels this way. First, not all the people who ‘adopt’ a product continue to use the product. Many of the people who try it find that it is not useful or find the price too high and abandon it. This means that a lot of 1s are mislabeled. Second, the cleanest 1s are the people who ‘adopted’ the product some time ago and have continued to use it since. Removing those is thus a bad idea. Third, the good 0s are those who tried the product but didn’t persist with it not those who never tried the product. Generating the labels in such a manner also allows you to mitigate one of the significant problems with learning from organic growth: people who organically find a product are different from those who don’t. Here, you are subsetting on the kinds of people who found the product, except that one found it useful and another did not. This empirical strategy has its problems, but it is distinctly better than the conventional approach.
What’s the difference between a scientist and a data scientist? Scientists often collect their own data, and data scientists often use data collected by other people. That is part jest but speaks to an important point. Good scientists know their data. Good data scientists must know their data too. To help data scientists learn about the data they use, we need to build systems that give them good data about the data. But what is good data about the data? And how do we build systems that deliver that? Here’s some advice (tailored toward rectangular data for convenience):
- From Where, How Much, and Such
- Provenance: how were each of the columns in the data created (obtained)? If the data are derivative, find out the provenance of the original data. Be as concrete as possible, linking to scripts, related teams, and such.
- How Frequently is it updated
- Cost per unit of data, e.g., a cell in rectangular data.
- What? To know what the data mean, you need a data dictionary. A data dictionary explains the key characteristics of the data. It includes:
- Information about each of the columns in plain language.
- How were the data were collected? For instance, if you conducted a survey, you need the question text and the response options (if any) that were offered, along with the ‘mode’, and
where ina sequence of questions does this lie, was it alone on the screen, etc.
- Data type
- How (if at all) are missing values generated?
- For integer columns, it gives the range, sd, mean, median, n_0s, and n_missing. For categorical, it gives the number of unique values, what each label means, and a frequency table that includes n_missing (if missing can be of multiple types, show a row for each).
- The number of duplicates in data and if they are allowed and a reason for why you would see them.
- Number of rows and columns
- For supervised models, store correlation of y with key x_vars
- What If? What if you have a question? Who should you bug? Who ‘owns’ the ‘column’ of data?
Store these data in JSON so that you can use this information to validate against. Produce the JSON for each update. You can flag when data are some s.d. above below last ingest.
Store all this metadata with the data. For e.g., you can extend the dataframe class in Scala to make it so.
Auto-generate reports in markdown with each ingest.
In many ML applications, you are also ingesting data back from the user. So you need the same as above for the data you are getting from the user (and some of it at least needs to match the stored data).
For any derived data, you need the scripts and the logic, ideally in
Where possible, follow the third normal form of databases. Only store translations when translation is expensive. Even then, think twice.
Lastly, some quality control. Periodically sit down with your team to see if you should see what you are seeing. For instance, if you are in the survey business, do the completion times make sense? If you are doing supervised learning, get a random sample of labels. Assess their quality. You can also assess the quality by looking at errors in classification that your supervised model makes. Are the errors because the data are mislabeled? Keep iterating. Keep improving. And keep cataloging those improvements. You should be able to ‘diff’ data collection, not just numerical summaries of data. And with what the method I highlight above, you should be.