Say that you are in the search engine business. And say that you have built a model that estimates how relevant an ad is based on the ‘context’: search query, previous few queries, kind of device, location, and such. Now let’s assume that for context X, the rank-ordered list of ads based on expected profit is: product A, product B, and product C. Now say that you want to estimate how effective an ad for product A is in driving the sales of product A. One conventional way to estimate this is to randomly assign during serve time: for context X, serve half the people an ad for product A and serve half the people no ad. But if it is true (and you can verify this) that an ad for product B doesn’t cause people to buy product A, then you can switch the ‘no ad’ control where you are not making any money with an ad for product B. With this, you can estimate the effectiveness of ad for product A while sacrificing the least amount of revenue. Better yet, if it is true that ad for product A doesn’t cause people to buy product B, you can also at the same time get an estimate of the efficacy of ad for product B.
What is the benefit of targeting? Why (and when) do we need experiments to estimate the benefits of targeting? And what is the right baseline to compare against?
I start with a business casual explanation, using examples to illustrate some of the issues at hand. Later in the note, I present a formal explanation to precisely describe the assumptions to clarify under what conditions targeting may be a reasonable thing to do.
Say that you have some TVs to sell. And say that you could show an ad about the TVs to everyone in the city for free. Your goal is to sell as many TVs as possible. Does it make sense for you to build a model to pick out people who would be especially likely to buy the TV and only show an ad to them? No, it doesn’t. Unless ads make people less likely to purchase TVs, you are always better-off reaching out to everyone.
You are wise. You use common sense to sell more TVs than the guy who spent a bunch of money building the model and selling less. You make tons of money. And you use the money to buy Honda and Mercedes dealerships. You still retain the magical power of being able to show ads to everyone for free. Your goal is to maximize profits. And selling Mercedes nets you more profit than Hondas. Should you use a model to show some people ads about Toyota and other people ads about Honda? The answer is still no. Under likely to hold assumptions, the optimal strategy is to show an ad for Mercedes first and then an ad for Toyota. (You can show the Toyota ad first if people who want to buy Mercedes won’t buy a cheaper car if they see an ad for a cheaper car first.)
But what if you are limited to only one ad? What would you do? In that case, a model may make sense. Let’s see how things may look with some fake data. Let’s compare the outcomes of four strategies: two model-based targeting strategies and two target-everyone with one ad strategies. To make things easier, let’s assume that selling Mercedes nets ten units of profits and selling Honda nets five units of profit. Let’s also assume that people will only buy something if they see an ad for their preferred product.
Continue reading here (pdf).
Nearly four years ago, I released autosum. Autosum exploits work by other scientists to harvest key points from (and key concerns with) a paper. The software grabs the sentence before or after the citation to build that knowledge. The output is pretty useful. See for yourself. But you could do one better by using it as a label for supervised text summarization tasks. You could learn the BERT embeddings and then use them to predict key phrases (or more).
Broadly, Google Ads works as follows: 1. Advertisers create an ad, choose keywords, and make a bid (on cost-per-click or CPC) (You can bid on cost-per-view and cost-per-impression also, but we limit our discussion to CPC.), 2. the Google Ads account team vets whether the keywords are related to the product being advertised, and 3. people see the ad from the winning bid when they search for a term that includes the keyword or when they browse content that is related to the keyword (some Google Ads are shown on sites that use Google AdSense).
There is a further nuance to the last step. Generally, on popular keywords, Google has thousands of candidate ads to choose from. And Google doesn’t simply choose the ad from the winning bid. Instead, it uses data to choose an ad (or a few ads) that yield the most profit (Click Through Rate (CTR)*bid). (Google probably has a more complex user utility function and doesn’t show ads below a low predicted
CTR*bid.) In all, who Google shows ads to depends on the predicted CTR and the money it will make per click.
Given this setup, we can reason about the audience for an ad. First, the higher the bid, the broader the audience. Second, it is not clear how well Google can predict CTR per ad conditional on keyword bid especially when the ad run is small. And if that is so, we expect Google to show the ad with the highest bid to a random subset of people searching for the keyword or browsing content related to the keyword. Under such conditions, you can use the total number of impressions per demographic group as an indicator of interest in the keyword. For instance, if you make the highest bid on the keyword ‘election’ and you find that total number of impressions that your ad makes among people 65+ are 10x more than people between ages 18-24, under some assumptions, e.g., similar use of ad blockers, similar rates of clicking ads conditional on relevance (which would become same as predicted relevance), similar utility functions (that is younger people are not more sensitive to irritation from irrelevant ads than older people), etc., you can infer relative interest of 18-24 versus 65+ in elections.
The other case where you can infer relative interest in a keyword (topic) from impressions is when ad markets are thin. For common keywords like ‘elections,’ Google generally has thousands of candidate ads for national campaigns. But if you only want to show your ad in a small geographic area or an infrequently searched term, the candidate set can be pretty small. If your ad is the only one, then your ad will be shown wherever it exceeds some minimum threshold of predicted CTR*bid. Assuming a high enough bid, you can take the total number of impressions of an ad as a proxy for total searches for the term and how often people browsed related content.
With all of this in mind, I discuss results from a Google Ads campaign. More here.
Foreknowledge of bad things is useful because it gives us an opportunity to a. prevent it, and b. plan for it.
Let’s refine our intuitions with a couple of concrete examples.
Many companies work super hard to predict customer ‘churn’—which customer is not going to use a product over a specific period (which can be the entire lifetime). If you know who is going to churn in advance, you can: a. work to prevent it, b. make better investment decisions based on expected cash flow, and c. make better resource allocation decisions.
Users “churn” because they don’t think the product is worth the price, which may be because a) they haven’t figured out a way to use the product optimally, b) a better product has come on the horizon, or c) their circumstances have changed. You can deal with this by sweetening the deal. You can prevent users from abandoning your product by offering them discounts. (It is useful to experiment to learn about the precise demand elasticity at various predicted levels of churn.) You can also give discounts is the form of offering some premium features free. Among people who don’t use the product much, you can run campaigns to help people use the product more effectively.
If you can predict cash-flow, you can optimally trade-off risk so that you always have cash at hand to pay your obligations. Churn can also help you with resource allocation. It can mean that you need to temporarily hire more customer success managers. Or it can mean that you need to lay off some people.
The second example is from patient care. If you could predict reasonably that someone will be seriously sick in a year’s time (and you can in many cases), you can use it to prioritize patient care, and again plan investment (if you were an insurance company) and resources (if you were a health services company).
Lastly, as is obvious, the earlier you can learn, the better you can plan. But generally, you need to trade-off between noise in prediction and headstart—things further away are harder to predict. The noise-headstart trade-off is something that should be done thoughtfully and amended based on data.
Samantha Laine Perfas of the Christian Science Monitor interviewed me about the gap between perceptions and reality for her podcast ‘perception gaps’ over a month ago. You can listen to the episode here (Episode 2).
The Monitor has also made the transcript of the podcast available here. Some excerpts:
“Differences need not be, and we don’t expect them to be, reasons why people dislike each other. We are all different from each other, right. …. Each person is unique, but we somehow seem to make a big fuss about certain differences and make less of a fuss about certain other differences.”
One way to fix it:
If you know so little and assume so much, … the answer is [to] simply stop doing that. Learn a little bit, assume a little less, and see where the conversation goes.
The interview is based on the following research:
- Partisan Composition (pdf) and Measuring Shares of Partisan Composition (pdf)
- Affect Not Ideology (pdf)
- Coming to Dislike (pdf)
- All in the Eye of the Beholder (pdf)
Related blog posts and think pieces:
We all overestimate how much we know. If the aphorism, “the more you know, the more you know that you don’t know” is true, then how else could it be? But knowing more is not the only path to learning about our ignorance. Mistakes are another. When we make mistakes, we get to adjust our parameters (understanding) about how much we know. Overconfident people, however, incur smaller losses when they make mistakes. They don’t learn as much from mistakes because they externalize the source of errors or don’t acknowledge the mistakes, believing it is you who is wrong, not them. So, the most ignorant (the most confident) very likely make the least progress in learning about their ignorance when they make mistakes. (Ignorance is just one source of why people overestimate how much they know. There are many other factors, including personality.) But if you know this, you can fix it.
If the canonical insight of computer science is automating repetition, the canonical insight of data science is optimization. It isn’t that computer scientists haven’t thought about optimization. They have. But computer scientists weren’t the first to think about automation, just like economists weren’t the first to think that incentives matter. Automation is just the canonical, foundational, purpose of computer science.
Similarly, optimization is the canonical, foundational purpose of data science. Data science aims to provide the “optimal” action at time t conditional on what you know. And it aims to do that by learning from data optimally. For instance, if the aim is to separate apples from oranges, the aim of supervised learning is to give the best estimate of whether the fruit is an apple or an orange given data.
For certain kinds of problems, the optimal way to learn from data is not to exploit found data but to learn from new data collected in an optimal way. For instance, randomized inference also us to compare two arbitrary regimes. And say if you want to optimize persuasiveness, you need to continuously experiment with different pitches (the number of dimensions on which pitches can be generated can be a lot), some of which exploit human frailties (which vary by people) and some that will exploit the fact that people need to be pitched the relevant value and that relevant value differs across people.
Once you know the canonical insight of a discipline, it opens up all the problems that can be “solved” by it. It also tells you what kind of platform you need to build to make optimal decisions for that problem. For some tasks, the “platform” may be supervised learning. For other tasks, like ad persuasiveness, it may be a platform that combines supervised learning (for targeting) and experimentation (for optimizing the pitch).
As the options have grown, so have the fears. Are the politically disinterested taking advantage of the nearly limitless options to opt out of news entirely? Are the politically interested siloing themselves into “echo chambers”? In an eponymous Oxford Research Encylopedia article, I discuss what we think we know, and some concerns about how we can know. Some key points:
Is the gap between how much the politically interested and politically disinterested know about politics increasing, as Post-broadcast Democracy posits? Figure 1 suggests not.
Quantity rather than ratio: “If the dependent variable is partisan affect, how ‘selective’ one is may not matter as much as the net imbalance in consumption—the difference between the number of congenial and uncongenial bits consumed…”
To measure how much political information a person is consuming, you must be able to distinguish political information from its complement. But what isn’t political information? “In this chapter, our focus is on consumption of varieties of political information. The genus is political information. And the species of this genus differ in congeniality, among other things. But what is political information? All information that influences people’s political attitudes or behaviors? If so, then limiting ourselves to news is likely too constraining. Popular television shows like The Handmaid’s Tale, Narcos, and Law and Order have clear political themes. … Shows like Will and Grace and The Cosby Show may be less clearly political, but they also have a political subtext.” (see Figure 4) … “Even if we limit ourselves to news, the domain is still not clear. Is news about a bank robbery relevant political information? What about Hillary Clinton’s haircut? To the extent that each of these affect people’s attitudes, they are arguably pertinent. “
One of the challenges with inferring consumption based on domain level data is that domain level data are crude. Going to http://nytimes.com is not the same as reading political news. And measurement error may vary by the kind of person. For instance, say we label http://nytimes.com as political news. For the political junkie, the measurement error may be close to zero. For teetotalers, it may be close to 100% (see more).
Show people a few news headlines along with the news source (you can randomize the source). What can you learn from a few such ‘trials’? You cannot learn what proportion of news they get from a particular source. you can learn the preferences, but not reliably. More from the paper: “Given the problems with self-reports, survey instruments that rely on behavioral measures are plausibly better. … We coded congeniality trichotomously: congenial, neutral, or uncongenial. The correlations between trials are alarmingly low. The polychoric correlation between any two trials range between .06 to .20. And the correlation between choosing political news in any two trials is between -.01 and .05.”
Following up on the previous point: preference for a source which has a mean slant != preference for slanted news. “Current measures of [selective exposure] are beset with five broad problems. First is conceptual errors. For instance, people frequently equate preference for information from partisan sources with a preference for congenial information.”
Probabilities from classification models can have two problems:
- Miscalibration: A p of .9 often doesn’t mean a 90% chance of 1 (assuming a dichotomous y). (You can calibrate it using isotonic regression.)
- Optimal cut-offs: For multi-class classifiers, we do not know what probability value will maximize the accuracy or F1 score. Or any metric for which you need to trade-off between FP and FN.
One way to solve #2 is to run the true labels (out of sample, otherwise there is concern about bias) and probabilities through a brute-force optimizer and gives you the optimal cut-off for the metric. Here’s the script for doing the same along with an illustration.