Citing Working Papers

2 Apr

Public versions of working papers are increasingly the norm. So are citations to them. But there are four concerns with citing working papers:

  1. Peer review: Peer review improves the quality of papers, but often enough it doesn’t catch serious, basic issues. Thus, a lack of peer review is not as serious a problem as is often claimed.
  2. Versioning: Which version did you cite? Often, there is no canonical versioning system. The best we have is tracking which conference was the paper presented at. This is not good enough.
  3. Availability: Can I check the paper, code, and data for a version? Often enough, the answer is no.

The solution to the latter two is to increase transparency through the entire pipeline. For instance, people can check how my paper with Ken has evolved on Github, including any coding errors that have been fixed between versions. (Admittedly, the commit messages can be improved. Better commit messages—plus descriptions—can make it easier to track changes across versions.)

The first point doesn’t quite deserve addressing in that the current system draws an optimistic line on the quality of published papers. Peer review ought not to end when a paper is published in a journal. If we accept that, then all concerns flagged by peers and non-peers can be addressed in various commits or responses to issues and appropriately credited.

A/B Testing Recommendation Systems

1 Apr

Say that you are building a news recommender that lists which relevant news items in each person’s news feed. Say that your first version of the news recommender is a rules-based system that uses signals like how many people in your network have seen the news, how many people in total have read the news, the freshness of the news, etc., and sums up the signals in an arbitrary way to rank news items. Your second version uses the same signals but uses a supervised model to decide on the optimal weights.

Say that you find that the recommendations vary a fair bit between the two systems. But which one is better? To suss that, you conduct an A/B test. But a naive experiment will produce biased estimates of the effect and the s.e. because:

  1. The signals on which your control group ranking system on is based are influenced by the kinds of news articles that people in treatment group see. And vice versa.
  2. There is an additional source of stochasticity in recommendations that people see: the order in which people arrive matters.

The effect of the first concern is that our estimates are likely attenuated.  To resolve the first issue, show people in the Control Group news articles based on predicted views of news articles based on historical data or pro-rated views of people assigned to control group alone. (This adds a bit of noise to the Control Group estimates.) And keep a separate table of input data for the treatment group and apply the ML model to the pro-rated data from that table.

The consequence of the second issue is that our s.e. is very plausibly much larger than what we will get with the split world testing (each condition gets its own table of counts for views, etc.). The sequence in which people arrive matters as it intersects with social influence world. To resolve the second issue, you need to estimate how the sequence of arrival affects outcomes. But given the number of pathways, the best we can probably do is bound. We could probably estimate the effect of ranking the least downloaded item first as a way to bound the effects.

p.s. The social influence world doesn’t report s.e. but this paper based on Salganik/Watts paper reports incorrect ones as it implicitly assumes that the sequence of arrival doesn’t matter.

Advice that works

31 Mar

Writing habits of some writers:

“Early in the morning. A good writing day starts at 4 AM. By 11 AM the rest of the world is fully awake and so the day goes downhill from there.”

Daniel Gilbert

“Usually, by the time my kids get off to school and I get the dogs walked, I finally sit down at my desk around 9:00. I try to check my email, take care of business-related things, and then turn it off by 10:30—I have to turn off my email to get any writing done.”

Juli Berwald

“When it comes to writing, my production function is to write every day. Sundays, absolutely. Christmas, too. Whatever. A few days a year I am tied up in meetings all day and that is a kind of torture. Write even when you have nothing to say, because that is every day.”

Tyler Cowen

“I don’t write everyday. Probably 1-2 times per week.”

Benjamin Hardy

“I’ve taught myself to write anywhere. Sometimes I find myself juggling two things at a time and I can’t be too precious with a routine. I wrote Name of the Devil sitting on a bed in a rented out room in Hollywood while I was working on a television series for A&E. My latest book, Murder Theory, was written while I was in production for a shark documentary and doing rebreather training in Catalina. I’ve written in casinos, waiting in line at Disneyland, basically wherever I have to.”

Andrew Mayne

Should we wake up at 4 am and be done by 11 am as Dan Gilbert does or should we get started at 10:30 am like Juli, near the time Dan is getting done for the day? Should we write every day like Tyler or should we do it once or twice a week like Benjamin? Or like Andrew, should we just work on teaching ourselves to “write anywhere”?

There is a certain tautological aspect to good advice. It is advice that works for you. Do what works for you. But don’t assume that you have been given advice that is right for you or that it is the only piece of advice on that topic. Advice givers rarely point out that the complete set of reasonable things that could work for you is often pretty large and contradictory and that the evidence behind the advice they are giving you is no more than anecdotal evidence with a dash of motivated reasoning.

None of this to say that you should not try hard to follow advice that you think is good. But once you see the larger point, you won’t fret as much when you can’t follow a piece of advice or when the advice doesn’t work for you. As long as you keep trying to get to where you want to be (and of course, even the merit of some wished for end states is debatable), it is ok to abandon some paths, safe in the knowledge that there are generally more paths to get there.

Stemming Link Rot

23 Mar

The Internet gives many things. But none that are permanent. That is about to change. Librarians got together and recently launched https://perma.cc/ which provides a permanent link to stuff.

Why is link rot important?

Here’s an excerpt from a paper by Gertler and Bullock:

“more than one-fourth of links published in the APSR in 2013 were broken by the end of 2014”

If what you are citing evaporates, there is no way to check the veracity of the claim. Journal editors: pay attention!

countpy: Incentivizing more and better software

22 Mar

Developers of Python packages sometimes envy R developers for the simple perks they enjoy. For example, a reliable web service that gives a reasonable indication of the total number of times an R package has been downloaded (albeit only from one of the mirrors). To achieve the same, Python developers need to do a Google BigQuery (which costs money) and wait for 30 or so seconds.

Then there are sore spots that are shared by all developers. Downloads are a shallow metric. Developers often want to know how often people use their packages. Without such a number, it is hard to defend against accusations like, “the total number of downloads is unreliable because they can be padded by numerous small releases,” “the total number of downloads doesn’t reflect how often people use the software,” etc. We partly solve this problem for Python developers by providing a website that tallies how often a package is used in repositories on Github, the largest open-source software hosting platform. http://countpy.com (Defunct. Code.) provides the total number of times a package has been called in the requirements file and in the import statement in files in Python language repositories.

The net benefit (loss) of a piece of software is, of course, greater than tallied by counts of how many people use it directly in the software they build. We don’t yet count indirect use: software that uses software that uses the software of interest. Ideally, we would like to tally the total time saved, the increase in the number of projects started, projects that wouldn’t have started had the software not been there, the impact on the style in which other code is written, and such. We also want to tally the cost of errors in the original software. To the extent that people don’t produce software because they can’t be credited reasonably for it, better metrics about the impact of software can increase the production of software andincrease the quality of the software that is being provided. 

Searching for Great Conversations

21 Mar

“When was the last time you had a great conversation? A conversation that wasn’t just two intersecting monologues, but when you overheard yourself saying things you never knew you knew, that you heard yourself receiving from somebody words that found places within you that you thought you had lost, and the sense of an eventive conversation that brought the two of you into a different plain and then fourthly, a conversation that continued to sing afterward for weeks in your mind? Conversations like that are food and drink for the soul.”


John O’Donahue h/t David Perell

Siamese Networks for Record Linkage

20 Mar

For the uninitiated:

A siamese neural network consists of twin networks which accept distinct inputs but are joined by an energy function at the top. This function computes some metric between the highest level feature representation on each side. The parameters between the twin networks are tied. Weight tying guarantees that two extremely similar images could not possibly be mapped by their respective networks to very different locations in feature space because each network computes the same function.

One Shot

Replace the word images with two representations of the same record across any two tables and you have an algorithm for producing good distance functions for efficient record linkage. Triplet loss is a natural extension to this. Looking forward to seeing some bottom line results comparing it to generic supervised results, which reminds me of the fact that I am unaware of any large benchmark datasets for the fundamental problem of statistical record linkage.

The Risk of Misunderstanding Risk

20 Mar

Women who participate in breast cancer screening from 50 to 69 live on average 12 more days. This is the best case scenario. Gerd has more such compelling numbers in his book, Calculated Risks. Gerd shares such numbers to launch a front on assault on the misunderstanding of risk. His key point is:

“Overcoming innumeracy is like completing a three-step program to statistical literacy. The first step is to defeat the illusion of certainty. The second step is to learn about the actual risks of relevant eventsand actions. The third step is to communicate the risks in an understandable way and to draw inferences without falling prey to clouded thinking.”

Gerd’s key contributions are on the third point. Gerd identifies three problems with risk communication:

  1. using relative risk than Numbers Needed to Treat (NNT) or absolute risk,
  2. Using single-event probabilities, and
  3. Using conditional probabilities than ‘natural frequencies.’

Gerd doesn’t explain what he means by natural frequencies in the book but some of his other work does. Here’s a clarifying example that illustrates how the same information can be given in two different ways, the second of which is in the form of natural frequencies:

“The probability that a woman of age 40 has breast cancer is about 1 percent. If she has breast cancer, the probability that she tests positive on a screening mammogram is 90 percent. If she does not have breast cancer, the probability that she nevertheless tests positive is 9 percent. What are the chances that a woman who tests positive actually has breast cancer?”

vs.

“Think of 100 women. One has breast cancer, and she will probably test positive. Of the 99 who do not have breast cancer, 9 will also test positive. Thus, a total of 10 women will test positive. How many of those who test positive actually have breast cancer?”

For those in a hurry, here are my notes on the book.

What’s Best? Comparing Model Outputs

10 Mar

Let’s assume that you have a large portfolio of messages: n messages of k types. And say that there are n models, built by different teams, that estimate how relevant each message is to the user on a particular surface at a particular time. How would you rank order the messages by relevance, understood as the probability a person will click on the relevant substance of the message?

Isn’t the answer: use the max. operator as a service? Just using the max. operator can be a problem because of:

a) Miscalibrated probabilities: the probabilities being output from non-linear models are not always calibrated. A probability of .9 doesn’t mean that there is a 90% chance that people will click it.

b) Prediction uncertainty: prediction uncertainty for an observation is a function of the uncertainty in the betas and distance from the bulk of the points we have observed. If you were to randomly draw a 1,000 samples each from the estimated distribution of p, a different ordering may dominate than the one we get when we compare the means.

This isn’t the end of the problems. It could be that the models are built on data that doesn’t match the data in the real world. (To discover that, you would need to compare expected error rate to actual error rate.) And the only way to fix the issue is to collect new data and build new models of it.

Comparing messages based on propensity to be clicked is unsatisfactory. A smarter comparison would take optimize for profit, ideally over the long term. Moving from clicks to profits requires reframing. Profits need not only come from clicks. People don’t always need to click on a message to be influenced by a message. They may choose to follow-up at a later time. And the message may influence more than the person clicking on the message. To estimate profits, thus, you cannot rely on observational data. To estimate the payoff for showing a message, which is equal to the estimated winning minus the estimated cost, you need to learn it over an experiment. And to compare payoffs of different messages, e.g., encourage people to use a product more, encourage people to share the product with another person, etc., you need to distill the payoffs to the same currency—ideally, cash.

Expertise as a Service

3 Mar

The best thing you can say about Prediction Machines, a new book by a trio of economists, is that it is not barren. Most of the green patches you see are about the obvious: the big gain from ML is our ability to predict better, and better predictions will change some businesses. For instance, Amazon will be able to move from shopping-and-then-shipping to shipping-and-then-shopping—you return what you don’t want—if it can forecast what its customers want well enough. Or, airport lounges will see reduced business if we can more accurately predict the time it takes to reach the airport.

Aside from the obvious, the book has some untended shrubs. The most promising of them is that supervised algorithms can have human judgment as a label. We have long known about the point. For instance, self-driving cars use human decisions as labels—we learn braking, steering, speed as a function of road conditions. But what if we could use expert human judgment as a label for other complex cognitive tasks? There is already software that exploits that point. Grammarly, for instance, uses editorial judgments to give advice about grammar and style. But there are so many other places where we could exploit this. You could use it to build educational tools that give guidance on better ways of doing something in real-time. You could also use it to reduce the need for experts.

p.s. The point about exploiting the intellectual property of experts deserves more attention.

5 is smaller than 1.9!

10 Feb

“In the late 1990s, the leading methods caught about 80 percent of fraudulent transactions. These rates improved to 90–95 percent in 2000 and to 98–99.9 percent today. That last jump is a result of machine learning; the change from 98 percent to 99.9 percent has been transformational.

An improvement from 85 percent to 90 percent accuracy means that mistakes fall by one-third. An improvement from 98 percent to 99.9 percent means mistakes fall by a factor of twenty. An improvement of twenty no longer seems incremental.”


From Prediction Machines by Agarwal, Gans, and Goldfarb.

One way to compare the improvements is to compare differences in percentages —5 and 1.9. That is what I would have done. That is so because conditional on the same difference in percentages, lower the base, the greater the multiplicative factor, which makes it a cheap way of making small improvements look better. Even then, for consistency, the comparison would have been between percentage increases in accuracy, between (90 – 85)/85 and (99.9 – 98)/98. But, AGG had to flip the estimand to percentage errors to make the latter relative change look better.

Disgusting

7 Feb

Vegetarians turn at the thought of eating the meat of a cow that has died from a heart attack. The disgust that vegetarians experience is not principled. Nor is the greater opposition to homosexuality that people espouse when they are exposed to foul smell. Haidt uses similar such provocative examples to expose chinks in how we think about what is moral and what is not.

Knowing that what we find disgusting may not always be “disgusting,” that our moral reasoning can be flawed, is a superpower. Because thinking that you are in the right makes you self-righteous. It makes you think that you know all the facts, that you are somehow better. Often, we are not. If we stop conflating disgust with being in the right or indeed, with being right, we shall all get along a lot better.

The Best We Can Do is Responsibly Answer the Questions that Life Asks of Us

5 Feb

Faced with mass murder, it is hard to escape the conclusion that life has no meaning. For how could it be that life has meaning when lives matter so little? As a German Jew in a concentration camp, Victor Frankl had to confront that question.

In Man’s Search for Meaning, Frankl gives two answers to the question. His first answer is a reflexive rejection of the meaninglessness of life. Frankl claims that life is “unconditional[ly] meaningful.” There is something to that, but not enough to hang on to for too long. It is also not his big point.

Instead, Frankl has a more nuanced point: “If there is … meaning in life …, then there must be … meaning in suffering.” (Because suffering is an inescapable part of life.) The meaning of suffering, according to him, lies in how we respond to it. Do we suffer with dignity? Or do we let suffering degrade us? The broader, deeper point that underpins the claim is that we cannot always choose our conditions, but we can choose the “stand [we take] toward the conditions.” And life’s meaning is stored in the stand we take, in how we respond to the questions that “life asks of us.”

Not only that, the extent of human achievement is: responsibly answering the questions that life asks of us. This means two things. First, that questions about human achievement can only be answered within the context of one’s life. And second, in responsibly answering questions that life asks of us, we attain what humans can ever attain. In a limited life, circumscribed by unavoidable suffering, for instance, the peak of human achievement is keeping dignity. If your life offers you more, then, by all means, do more—derive meaning from action, from beauty, and from love. But also take solace in the fact that we can achieve the greatest heights a human can achieve in how we respond to unavoidable suffering.

Ruined by Google

13 Jan

Information on tap is a boon. But if it means that the only thing we will end up knowing—have in your heads—is where to go to find the information, it may also be a bane.

Accessible stored cognitions are vital. They allow us to verify and contextualize new information. If we need to look things up, because of laziness or forgetfulness, we will end up accepting some false statements, which we would have easily refuted had we had the relevant information in our memory, or we will fail to contextualize some statements appropriately.

Information on tap also produces another malaise. It changes the topography of what we know. As search costs go down, people move from learning about a topic systematically to narrowly searching for whatever they need to know, now. And knowledge built on narrow searches looks like Swiss cheese.

Worse, many a time when people find the narrow thing they are looking for, they think that that is all there to know. For instance, in Computer Science and Machine Learning, people can increasingly execute sophisticated things without knowing much. (And that is a mostly a good thing.) But getting something to work—by copying the code from StackOverflow—gives people the sense that they “know.” And when we think we know, we also know that there is not much more to know. Thus, information on tap reduces the horizons of our knowledge about our ignorance.

In becoming better at fulfilling our narrower needs, lower search costs may be killing expertise. And that is mostly a bad thing.

See also this paper that suggests that searching on Google causes you to think that you know more.

The Benefit of Targeting

16 Dec

What is the benefit of targeting? Why (and when) do we need experiments to estimate the benefits of targeting? And what is the right baseline to compare against?

I start with a business casual explanation, using examples to illustrate some of the issues at hand. Later in the note, I present a formal explanation to precisely describe the assumptions to clarify under what conditions targeting may be a reasonable thing to do.

Business Casual

Say that you have some TVs to sell. And say that you could show an ad about the TVs to everyone in the city for free. Your goal is to sell as many TVs as possible. Does it make sense for you to build a model to pick out people who would be especially likely to buy the TV and only show an ad to them? No, it doesn’t. Unless ads make people less likely to purchase TVs, you are always better-off reaching out to everyone.

You are wise. You use common sense to sell more TVs than the guy who spent a bunch of money building the model and selling less. You make tons of money. And you use the money to buy Honda and Mercedes dealerships. You still retain the magical power of being able to show ads to everyone for free. Your goal is to maximize profits. And selling Mercedes nets you more profit than Hondas. Should you use a model to show some people ads about Toyota and other people ads about Honda? The answer is still no. Under likely to hold assumptions, the optimal strategy is to show an ad for Mercedes first and then an ad for Toyota. (You can show the Toyota ad first if people who want to buy Mercedes won’t buy a cheaper car if they see an ad for a cheaper car first.)

But what if you are limited to only one ad? What would you do? In that case, a model may make sense. Let’s see how things may look with some fake data. Let’s compare the outcomes of four strategies: two model-based targeting strategies and two target-everyone with one ad strategies. To make things easier, let’s assume that selling Mercedes nets ten units of profits and selling Honda nets five units of profit. Let’s also assume that people will only buy something if they see an ad for their preferred product.

Continue reading here (pdf).

AutoSum Plus

23 Nov

Nearly four years ago, I released autosum. Autosum exploits work by other scientists to harvest key points from (and key concerns with) a paper. The software grabs the sentence before or after the citation to build that knowledge. The output is pretty useful. See for yourself. But you could do one better by using it as a label for supervised text summarization tasks. You could learn the BERT embeddings and then use them to predict key phrases (or more).

Making an Impression: Learning from Google Ads

31 Oct

Broadly, Google Ads works as follows: 1. Advertisers create an ad, choose keywords, and make a bid (on cost-per-click or CPC) (You can bid on cost-per-view and cost-per-impression also, but we limit our discussion to CPC.), 2. the Google Ads account team vets whether the keywords are related to the product being advertised, and 3. people see the ad from the winning bid when they search for a term that includes the keyword or when they browse content that is related to the keyword (some Google Ads are shown on sites that use Google AdSense).

There is a further nuance to the last step. Generally, on popular keywords, Google has thousands of candidate ads to choose from. And Google doesn’t simply choose the ad from the winning bid. Instead, it uses data to choose an ad (or a few ads) that yield the most profit (Click Through Rate (CTR)*bid). (Google probably has a more complex user utility function and doesn’t show ads below a low predicted CTR*bid.) In all, who Google shows ads to depends on the predicted CTR and the money it will make per click.

Given this setup, we can reason about the audience for an ad. First, the higher the bid, the broader the audience. Second, it is not clear how well Google can predict CTR per ad conditional on keyword bid especially when the ad run is small. And if that is so, we expect Google to show the ad with the highest bid to a random subset of people searching for the keyword or browsing content related to the keyword. Under such conditions, you can use the total number of impressions per demographic group as an indicator of interest in the keyword. For instance, if you make the highest bid on the keyword ‘election’ and you find that total number of impressions that your ad makes among people 65+ are 10x more than people between ages 18-24, under some assumptions, e.g., similar use of ad blockers, similar rates of clicking ads conditional on relevance (which would become same as predicted relevance), similar utility functions (that is younger people are not more sensitive to irritation from irrelevant ads than older people), etc., you can infer relative interest of 18-24 versus 65+ in elections.

The other case where you can infer relative interest in a keyword (topic) from impressions is when ad markets are thin. For common keywords like ‘elections,’ Google generally has thousands of candidate ads for national campaigns. But if you only want to show your ad in a small geographic area or an infrequently searched term, the candidate set can be pretty small. If your ad is the only one, then your ad will be shown wherever it exceeds some minimum threshold of predicted CTR*bid. Assuming a high enough bid, you can take the total number of impressions of an ad as a proxy for total searches for the term and how often people browsed related content.

With all of this in mind, I discuss results from a Google Ads campaign. More here.

The Value of Predicting Bad Things

30 Oct

Foreknowledge of bad things is useful because it gives us an opportunity to a. prevent it, and b. plan for it.

Let’s refine our intuitions with a couple of concrete examples.

Many companies work super hard to predict customer ‘churn’—which customer is not going to use a product over a specific period (which can be the entire lifetime). If you know who is going to churn in advance, you can: a. work to prevent it, b. make better investment decisions based on expected cash flow, and c. make better resource allocation decisions.

Users “churn” because they don’t think the product is worth the price, which may be because a) they haven’t figured out a way to use the product optimally, b) a better product has come on the horizon, or c) their circumstances have changed. You can deal with this by sweetening the deal. You can prevent users from abandoning your product by offering them discounts. (It is useful to experiment to learn about the precise demand elasticity at various predicted levels of churn.) You can also give discounts is the form of offering some premium features free. Among people who don’t use the product much, you can run campaigns to help people use the product more effectively.

If you can predict cash-flow, you can optimally trade-off risk so that you always have cash at hand to pay your obligations. Churn can also help you with resource allocation. It can mean that you need to temporarily hire more customer success managers. Or it can mean that you need to lay off some people.

The second example is from patient care. If you could predict reasonably that someone will be seriously sick in a year’s time (and you can in many cases), you can use it to prioritize patient care, and again plan investment (if you were an insurance company) and resources (if you were a health services company).

Lastly, as is obvious, the earlier you can learn, the better you can plan. But generally, you need to trade-off between noise in prediction and headstart—things further away are harder to predict. The noise-headstart trade-off is something that should be done thoughtfully and amended based on data.

The Other Side

23 Oct

Samantha Laine Perfas of the Christian Science Monitor interviewed me about the gap between perceptions and reality for her podcast ‘perception gaps’ over a month ago. You can listen to the episode here (Episode 2).

The Monitor has also made the transcript of the podcast available here. Some excerpts:

“Differences need not be, and we don’t expect them to be, reasons why people dislike each other. We are all different from each other, right. …. Each person is unique, but we somehow seem to make a big fuss about certain differences and make less of a fuss about certain other differences.”

One way to fix it:

If you know so little and assume so much, … the answer is [to] simply stop doing that. Learn a little bit, assume a little less, and see where the conversation goes.

The interview is based on the following research:

  1. Partisan Composition (pdf) and Measuring Shares of Partisan Composition (pdf)
  2. Affect Not Ideology (pdf)
  3. Coming to Dislike (pdf)
  4. All in the Eye of the Beholder (pdf)

Related blog posts and think pieces:

  1. Party Time
  2. Pride and Prejudice
  3. Loss of Confidence
  4. How to read Ahler and Sood

Loss of Confidence

21 Oct

We all overestimate how much we know. If the aphorism, “the more you know, the more you know that you don’t know” is true, then how else could it be? But knowing more is not the only path to learning about our ignorance. Mistakes are another. When we make mistakes, we get to adjust our parameters (understanding) about how much we know. Overconfident people, however, incur smaller losses when they make mistakes. They don’t learn as much from mistakes because they externalize the source of errors or don’t acknowledge the mistakes, believing it is you who is wrong, not them. So, the most ignorant (the most confident) very likely make the least progress in learning about their ignorance when they make mistakes. (Ignorance is just one source of why people overestimate how much they know. There are many other factors, including personality.) But if you know this, you can fix it.