Pastations: Is it Time to Move Beyond Presentations?

26 Dec

In an influential essay, The Cognitive Style of PowerPoints, Tufte argues that (PowerPoint) presentations are unsuitable for serious problems. The essay is largely polemical, with Tufte freely mixing points about affordances of the medium with criticisms of bad presentations and lazy broadsides.

Hilarious stuff first:

  1. “All 3 reports have standard PP format problems: elaborate bullet outlines; segregation of words and numbers (12 of 14 slides with quantitative data have no accompanying analysis); atrocious typography; data imprisoned in tables by thick nets of spreadsheet grids; only 10 to 20 short lines of text per slide.”
  2. “On this single Columbia slide, in a PowerPoint festival of bureaucratic hyper-rationalism, 6 different levels of hierarchy are used to classify, prioritize, and display 11 simple sentences”
  3. “In 28 books on PP presentations, the 217 data graphics depict an average of 12 numbers each. Compared to the worldwide publications shown in the table at right, the statistical graphics based on PP templates are the thinnest of all, except for those in Pravda back in 1982, when that newspaper operated as the major propaganda instrument of the Soviet communist party and a totalitarian government.”

In the essay, I could only rescue two points about affordances (that I buy):

  1. “When information is stacked in time, it is difficult to understand context and evaluate relationships.”
  2. Inefficiency: “A talk, which proceeds at a pace of 100 to 160 spoken words per minute, is not an especially high resolution method of data transmission. Rates of transmitting visual evidence can be far higher. … People read 300 to 1,000 printed words a minute, and find their way around a printed map or a 35mm slide displaying 5 to 40 MB in the visual field. Yet, in a strange reversal, nearly all PowerPoint slides that accompany talks have much lower rates of information transmission than the talk itself. As shown in this table, the PowerPoint slide typically shows 40 words, which is about 8 seconds worth of silent reading material. The slides in PP textbooks are particularly disturbing: in 28 textbooks, which should use only first-rate examples, the median number of words per slide is 15, worthy of billboards, about 3 or 4 seconds of silent reading material. This poverty of content has several sources. First, the PP design style, which typically uses only about 30% to 40% of the space available on a slide to show unique content, with all remaining space devoted to Phluff, bullets, frames, and branding. Second, the slide projection of text, which requires very large type so the audience can read the words.”

From Working Backwards, which cites the article as the reason Amazon pivoted from presentations to 6-pagers for its S-team meetings, there is one more reasonable point about presentations more generally:

“…the public speaking skills of the presenter, and the graphics arts expertise behind their slide deck, have an undue—and highly variable—effect on how well their ideas are understood.”

Working Backwards

The points about graphics arts expertise, etc., apply to all documents but are likely less true for reports than presentations. (It would be great to test the prettiness of graphics on their persuasiveness.)

Reading the essay made me think harder about why we use presentations in meetings about complex topics more generally. For instance, academics frequently present to other academics. Replacing presentations with 6-pagers that people quietly read and comment on at the start of the meeting and then discuss may yield higher quality comments and better discussion and better evaluation of the scholar (and the scholarship).

p.s. If you haven’t seen Norvig’s Gettysburg Address in PowerPoint, you must.

p.p.s. Ed Haertel forwarded me this piece by Sam Wineburger on why asking students to create powerpoints is worse than asking them to write an essay.

p.p.p.s. Here’s how Amazon runs its S-team meetings (via Working Backwards):

1. 6-pager (can have appendices) distributed at the start of the discussion.

2. People read in silence and comment for the first 20 min.

3. Rest 40 min. devoted to discussion, which is organized by 1. big issues/small issues, 2. people going around the room, etc.

4. One dedicated person to take notes.

Homing in on the Home Advantage

19 Dec

A recent piece on ESPNCricinfo analyses the DRS data and argues that cricket should do away with neutral umpires. I reanalyzed the data.

If a game is officiated by a home umpire, we expect the following:

  1. Hosts will appeal less often as they are likely to be happier with the decision in the first place
  2. When visitors appeal a decision, their success rate should be higher than the hosts. Visitors are appealing against an unfavorable call—a visiting player was unfairly given out or they felt the host player was unfairly given not out. And we expect the visitors to get more bad calls.

When analyzing success rate, I think it is best to ignore appeals that are struck down because they defer to the umpire’s call. Umpire’s call generally applies to LBW decisions, and especially two aspects of the LBW decision: 1. whether the ball was pitching in line, 2. whether it was hitting the wickets. To take a recent example, in the second test of the 2021 Ashes series, Lyon got a wicket when the impact was ‘umpire’s call’ and Stuard Broad was denied a wicket for the same reason.

Ollie Robinson Unsuccessfully Challenging the LBW Decision

Stuart Broad Unsuccessfully Challenging the Not-LBW Decision

With the preliminaries over, let’s get to the data covered in the article. Table 1 provides some summary statistics of the outcomes of DRS. As is clear, the visiting team appealed the umpire’s decision far more often than the home team: 303 vs. 264. Put another way, the visiting team lodged nearly one more appeal per test than the home team. So how often did the appeals succeed? In line with our hypothesis, the home team appeals were upheld less often (24%) than visiting team’s appeals (29%).

Table 1. Review Outcomes Under Home Umpires. 41 Tests. July 2020–Nov. 2021.

HOME BATTING9639 (40%)25 (26%)32 (34%)
HOME BOWLING168108 (64%)29 (18%)31 (18%)
VISITOR BATTING14758 (39%)25 (17%)64 (44%)
VISITOR BOWLING15697 (62%)34 (22%)25 (16%)
Data From ESPNCricinfo

It could be the case that these results are a consequence of something to do with host vs. visitor than home umpires. For instance, hosts win a lot, and that generally means that they will bowl for shorter periods of time and bat for longer periods of time. We account for this by comparing outcomes under neutral umpires. The article has data on the same. There, you see that the visiting team makes fewer appeals (198) than the home team (214). And the visiting team’s success rate in appeals is slightly lower (29%) than the home team’s rate (30%).


At the bottom of the article is another table that breaks down reviews by host country:

PAKISTAN2AHSAN RAZA, ALEEM DAR*270/11 (0%)6/16 (38%)
WEST INDIES6GO BRATHWAITE, JS WILSON*9413/50 (26%)13/44 (29%)
Data from ESPNCricinfo

But the data doesn’t match the one in the table above. For one, the number of tests considered is 42 than 41. For two, and perhaps relatedly, the total number of reviews is 599 than 567. To be comprehensive, let’s do the same calculations as above. The visiting team appeals more (314) than the host team (285). The host team success rate is 22% (63/285), and the visiting team success rate is 28% (89/314). If you were to do a statistical test for success rates:

 prop.test(x = c(63, 89), n = c(285, 314))

        2-sample test for equality of proportions with continuity correction

data:  c(63, 89) out of c(285, 314)
X-squared = 2.7501, df = 1, p-value = 0.09725
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.13505623  0.01028251
sample estimates:
   prop 1    prop 2 
0.2210526 0.2834395 


28 Nov

The KNN classifier is one of the most intuitive ML algorithms. It predicts class by polling k nearest neighbors. Because it seems so simple, it is easy to miss a couple of the finer points:

  1. Sample Splitting: Traditionally, when we split the sample, there is no peeking across samples. For instance, when we split the sample between a train and test set, we cannot look at the data in the training set when predicting the label for a point in the test set. In knn, this segregation is not observed. Say we partition the training data to learn the optimal k. When predicting a point in the validation set, we must pass the entire training set. Passing the points in the validation set would be bad because then the optimal k will always be 0. (If you ignore k = 0, you can pass the rest of the dataset.)
  2. Implementation Differences: “Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor k+1 and k, have identical distances but different labels, the results will depend on the ordering of the training data.” (see here; emphasis mine.)

    This matters when the distance metric is discrete, e.g., if you use an edit-distance metric to compare strings. Worse, scikit-learn doesn’t warn users during analysis.

    In R, one popular implementation of KNN is in a package called class. (Overloading the word class seems like a bad idea but that’s for a separate thread.) In class, how the function deals with this scenario is decided by an explicit option: “If [the option is] true, all distances equal to the kth largest are included. If [the option is] false, a random selection of distances equal to the kth is chosen to use exactly k neighbours.”

    For the underlying problem, there isn’t one clear winning solution. One way to solve the problem is to move from knn to adaptive knn: include all points that are as far away as the kth point. This is what class in R does when the option all.equal is set to True. Another solution is to never change the order in which the data are accessed and to make the order as part of how the model is exported.

Fooled by Randomness

28 Sep

Permutation-based methods for calculating variable importance and interpretation are increasingly common. Here are a few common places where they are used:

Feature Importance (FI)

The algorithm for calculating permutation-based FI is as follows:

  1. Estimate a model
  2. Permute a feature
  3. Predict again
  4. Estimate decline in predictive accuracy and call the decline FI

Permutation-based FI bakes in a particular notion of FI. It is best explained with an example: Say you are calculating FI for X (1 to k) in a regression model. Say you want to estimate FI of X_k. Say X_k has a large beta. Permutation-based FI will take the large beta into account when calculating the FI. So, the notion of importance is one that is conditional on the model.

Often we want to get at a different counterfactual: If we drop X_k, what happens. You can get to that by dropping and re-estimating, letting other correlated variables get large betas. I can see a use case in checking if we can knock out say an ‘expensive’ variable. There may be other uses.

Aside: To my dismay, I kludged the two together here. In my defense, I thought it was a private email. But still, I was wrong.

Permutation-based methods are used elsewhere. For instance:

Creating Knockoffs

We construct our knockoff matrix X˜ by randomly swapping the n rows of the design matrix X. This way, the correlations between the knockoffs remain the same as the original variables but the knockoffs are not linked to the response Y. Note that this construction of the knockoffs matrix also makes the procedure random.


Local Interpretable Model-Agnostic Explanations

The recipe for training local surrogate models:

Select your instance of interest for which you want to have an explanation of its black box prediction.

Perturb your dataset and get the black box predictions for these new points.

Weight the new samples according to their proximity to the instance of interest.

Train a weighted, interpretable model on the dataset with the variations.

Explain the prediction by interpreting the local model.


Common Issue With Permutation Based Methods

“Another really big problem is the instability of the explanations. In an article 47 the authors showed that the explanations of two very close points varied greatly in a simulated setting. Also, in my experience, if you repeat the sampling process, then the explantions that come out can be different. Instability means that it is difficult to trust the explanations, and you should be very critical.”



One way to solve instability is to average over multiple rounds of permutations. It is expensive but the payoff is stability.

Monetizing Bad Models: Pay Per Correct Prediction

26 Sep

In many ML applications, especially ones where you need to train a model on customer data to get high levels of accuracy, the only models that ML SaaS companies can offer to a client out-of-the-box are bad. But many ML SaaS businesses hesitate to go to a client with a bad model. Part of the reason is that companies don’t understand that they can deliver value with a bad model. In many places, you can deliver value with a bad model by deploying a high-precision version, only offering predictions where you are highly confident. Another reason why ML SaaS companies likely hesitate is a lack of a reasonable pricing model. There, charging per correct response with some penalty for an incorrect answer may prove a good option. (If you are the sole bidder, setting the price just below the marginal cost of getting a human to label a response plus any additional business value from getting the job done more quickly may be one fine place to start.) Having such a pricing model is likely to reassure the client that they won’t be charged for the glamour of having an ML model and instead will only be charged for the results. (There is, of course, an upfront cost of switching to an ML model, which can be reasonably high and that cost needs to be assessed in terms of potential payoff over a long term.)

Interpreting Data

26 Sep

It is a myth that data speaks for itself. The analyst speaks for the data. The analyst chooses what questions to ask, what analyses to run, how the analyses are interpreted, and how they are summarized. I use excerpts from a paper by Gilliam et al. on media portrayal of crime as a way to highlight one set of choices by a group of analysts. (The excerpts also highlight the need for reading a paper fully than relying on the abstract alone.)


From Gilliam et al.; Abstract.

White Violent Criminals Are Overrepresented

From Gilliam et al.; Bottom of page 10.

White Nonviolent Criminals Are Overrepresented

From Gilliam et al.; first paragraph on page 12

Relative Underrepresentation Between Violent and Nonviolent Crime is a Problem

From Gilliam et al.; Last paragraph on page 12
From Gilliam et al.; First paragraph on page 13

The Nonscience of Machine Learning

29 Aug

In 2013, Girshick et al. released a paper that described a technique to solve an impossible-sounding problem—classifying each pixel of an image (or semantic segmentation). The technique that they proposed, R-CNN, combines deep learning, selective search, and SVM. It also has all sorts of ad hoc choices, from the size of the feature vector to the number of regions, that are justified by how well they work in practice. R-CNN is not unusual. Many machine learning papers are recipes that ‘work.’ There is a reason for that. Machine learning is an engineering discipline. It isn’t a scientific one. 

You may think that engineering must follow science, but often it is the other way round. For instance, we learned how to build things before we learned the science behind it—we trialed-and-errored and overengineered our way to many still standing buildings while the scientific understanding slowly accumulated. Similarly, we were able to predict the seasons and the phases of the moon before learning how our solar system worked. Our ability to solve problems with machine learning is similarly ahead of our ability to put it on a firm scientific basis.

Often, we build something based on some vague intuition, find that it ‘works,’ and only over time, deepen our intuition about why (and when) it works. Take, for instance, Dropout. The original paper (released in 2012, published in 2014) had the following as motivation:

A motivation for Dropout comes from a theory of the role of sex in evolution (Livnat et al., 2010). Sexual reproduction involves taking half the genes of one parent and half of the other, adding a very small amount of random mutation, and combining them to produce an offspring. The asexual alternative is to create an offspring with a slightly mutated copy of the parent’s genes. It seems plausible that asexual reproduction should be a better way to optimize individual fitness because a good set of genes that have come to work well together can be passed on directly to the offspring. On the other hand, sexual reproduction is likely to break up these co-adapted sets of genes, especially if these sets are large and, intuitively, this should decrease the fitness of organisms that have already evolved complicated coadaptations. However, sexual reproduction is the way most advanced organisms evolved. …

Srivastava et al. 2014, JMLR

Moreover, the paper provided no proof and only some empirical results. It took until Gal and Ghahramani’s 2016 paper (released in 2015) to put the method on a firmer scientific footing.

Then there are cases where we have made ad hoc choices that ‘work’ and where no one will ever come up with a convincing theory. Instead, progress will mean replacing bad advice with good. Take, for instance, the recommended step of ‘normalizing’ variables before doing k-means clustering or before doing regularized regression. The idea of normalization is simple enough: put each variable on the same scale. But it is also completely weird. Why should we put each variable on the same scale? Some variables are plausibly more substantively important than others and we ideally want to prorate by that.

What Can We Learn?

The first point is about teaching machine learning. Bricklaying is thought to be best taught via apprenticeship. And core scientific principles are thought to be best taught via books and lecturing. Machine learning is closer to the bricklaying end of the spectrum. First, there is a lot in machine learning that is ad hoc and beyond scientific or even good intuitive explanation and hence taught as something you do. Second, there is plausibly much to be learned in seeing how others trial-and-error and come up with kludges to fix the issues for which there is no guidance.

The second point is about the maturity of machine learning. Over the last few decades, we have been able to accomplish really cool things with machine learning. And these accomplishments detract us from how early we are. The fact is that we have been able to achieve cool things with very crude tools. For instance, OOS validation is a crude but very commonly used tool for preventing overfitting—we stop optimization when the OOS error starts increasing. As our scientific understanding deepens, we will likely invent better tools. The best of machine learning is a long way off. And that is exciting.

Explore Exclusive Exploitation: The Billboard Top 100 Method of Learning

14 Jul

The optimism in Internet browsing is palpable. Browse long enough, and you will have a ‘hit.’ Like gambling, which it mimics, lay browsing is a losing proposition. A better way to spend your time is to focus on known knowns—excellent teachers, communicators, etc.—and core ideas, insights, and big hits of a discipline (along with learning how disciplines solve problems). The rationale for the first is obvious. The rationale for the second point is three folds:

  1. Because we often scavenge information, many of us are not well versed in the core principles of the discipline (and adjacent disciplines) we purport to specialize in or want to learn about.
  2. The core ideas, the big hits, etc., by their very nature, are important and illuminating.
  3. Many of these big ideas are accessible, partly because people have spent time thinking about ways to communicate the points. So you will find excellent distillations of the points, and you will find that many of these ideas are on your knowledge frontier (things you can learn immediately). 

Or, if you are disciplined enough, focus relentlessly on finding new things in a narrow niche. Going from gambling to anything else is not easy. The highs won’t be as high. But the average high and ROI is a lot greater.

Fairly Certain: Using Uncertainty in Predictions to Diagnose Roots of Unfairness

8 Jul

One conventional definition of group fairness is that the ML algorithms produce predictions where the FPR (or FNR or both) is the same across groups. Fixating on equating FPR etc. can harm the very groups we are trying to help. So it may be useful to rethink how to solve the problem of reducing unfairness.

One big reason why the FPR may vary across groups is that, given the data, some groups’ outcomes are less predictable than others. This may be because of the limitations of the data itself or because of the limitations of algorithms. For instance, Kearns and Roth in their book bring up the example of college admissions. The training data for college admissions is the decisions made by college counselors. College counselors may well be worse at predicting the success of minority students because they are less familiar with their schools, groups, etc., and this, in turn, may lead to algorithms performing worse on minority students. (Assume the algorithm to be human decision-makers and the point becomes immediately clear.)

One way to address worse performance may be to estimate the uncertainty of the prediction. This allows us to deal with people with wider confidence bounds separately from people with narrower confidence bounds. The optimal strategy for people with wider confidence bounds people may be to collect additional data to become more confident in those predictions. For instance, Komiyama and Noda propose something similar (pdf) to help overcome a lack of information during hiring. Or we may need to figure out a way to compensate people based on their uncertainty interval. 

The average width of the uncertainty interval across groups may also serve as a reasonable way to diagnose this particular problem.

Optimal Data Collection When Strata and Strata Variances Are Known

8 Jul

With Ken Cor.

What’s the least amount of data you need to collect to estimate the population mean with a particular standard error? For the simplest case—estimating the mean of a binomial variable using simple random sampling, a conservative estimate of the variance (p=.5), and a ±3 confidence interval—the answer (n∼1,000) is well known. The simplest case, however, assumes little to no information. Often, we know more. In opinion polling, we generally know sociodemographic strata in the population. And we have historical data on the variability in strata. Take, for instance, measuring support for Mr. Obama. A polling company like YouGov will usually have a long time series, including information about respondent characteristics. Using this data, the company could derive how variable the support for Mr. Obama is among different sociodemographic groups. With information about strata and strata variances, we can often poll fewer people (vis-a-vis random sampling) to estimate the population mean with a particular s.e. In a note (pdf), we show how.

Why bother?

In a realistic example, we find the benefit of using optimal allocation over simple random sampling is 6.5% (see the code block below).

Assuming two groups a and b, and using the notation in the note (see the pdf)—wa denotes the proportion of group a in the population, vara and varb denote the variances of group a and b respectively, and letting p denote sample mean, we find that if you use the simple random sampling formula, you will estimate that you need to sample 1095 people. If you optimally exploit the information about strata and strata variances, you will need to just sample 1024 people.

## The Benefit of Using Optimal Allocation Rules
## wa = .8
## vara = .25; pa = .5
## varb = .16; pb = .8
## SRS: pop_mean of .8*.5 + .2*.8 = .56
# sqrt(p(1 -p)/n) = .015
# n = p*(1- p)/.015^2 = 1095

# optimal_n_plus_allocation(.8, .25, .16, .015)
#   n   na   nb 
#1024  853  171

Github Repo.:

Beyond yhat: Developing ML Products

7 Jul

Making useful products is hard. Making useful ML products is harder still, in part because there are a larger number of moving parts in an ML system. To understand the issues at stake, let’s go over the **basics** of developing an ML product.

Often, product development starts with a business problem. And your first job is to understand the business problem as well as you can, familiarizing yourself with as much detail as possible.

Let’s say the problem is as follows: A company gets a lot of customer emails. All the emails go to a common inbox from which specialist customer agents fish out emails that are relevant to them. For instance, finance specialists fish out billing emails. And technical specialists fish out emails about technical errors. Fishing is time-consuming and chaotic.

Once you understand the precise problem—time taken to discover and assign emails—work on developing solutions for the problem. When developing solutions, the bias should be toward solving the problem the best way possible than injecting custom ML into whatever solution you propose. For instance, you could propose a solution that makes it easier to search (using no ML or off-the-shelf ML) and bulk assign to a new queue. But let’s say that after careful consideration of costs and benefits, a particularly appealing solution is a system that uses machine learning to automatically direct relevant emails to specialist inboxes, obviating the need to fish. That’s a start to a solution, not the end. You need to spend enough time thinking about the solution so that you have thought about how to handle edge cases, e.g., when there is a technical issue about billing, a misclassified email, etc., and any spillover issues, like the latency of such a system, how implementing such a system may break existing data pipelines that measure the total number of emails, etc.

Next, you need to define the KPIs. How much time will be saved? What is the total cost of the saved time? How many mistakes is the system making? What is the cost of handling mistakes?

Next, you need to turn the business problem into a precise machine learning problem. What labels would you predict? How would you collect the initial labels?

Once the outline of the solution has been agreed upon, you need to don your architect’s hat and outline a system diagram. Wearing the data engineer’s hat, figure out where the data needed for training and for live classification is stored, and how you would build a pipeline for training and serving the model. This is also the time to understand what guarantees, if any, exist on the data, and how you can test those guarantees.

Right next to the data engineer’s hat is the modeler’s hat. Wearing that, you must decide what algorithms you want to run, etc. how to version control your models, etc. The ML modeler’s hat also directs your attention to your plan for how to improve the model. Machine learning is an elaborate system to learn from your losses and you must design a system to continuously learn from your errors. More precisely, you must answer what is your system in place to improve your model? There is a pipeline for that: 1. Learn about your losses: from feedback, errors, etc. 2. Understand your losses: error analysis, etc., 3. Reducing your losses: new data collection, fixing old data, diff. models, objective functions, and 4. Testing: A/B testing, offline testing, etc.

Last, you must wear an operator’s hat. Wearing that you answer the operational nitty-gritty of how to introduce a new product. This is the time when you work with stakeholders to stand up dashboards to monitor the system, develop a rollout strategy, and a rollback strategy, a dashboard for monitoring A/B tests, etc.

The key to wearing an architect’s hat is to not only designing a system but also to make sure that enough logging is in place for different parts of the system for you to triage failures. So part of the dashboard would display logs from different parts of the system.

Equilibrium Fairness: How “Fair” Algorithms Can Hurt Those They Purport to Help

7 Jul

One definition of a fair algorithm is an algorithm that yields the same FPR across groups (an example of classification parity). To achieve that, we often have to trade in some accuracy. The final model is thus less accurate but fair. There are two concerns with such models:

  1. Net Harm Over Relative Harm: Because of lower accuracy, the number of people from a minority group that are unfairly rejected (say for a loan application) may be a lot higher. (This is ignoring the harm done to other groups.) 
  2. Mismeasuring Harm? Consider an algorithm used to approve or deny loans. Say we get the same FPR across groups but lower accuracy for loans with a fair algorithm. Using this algorithm, however, means that credit is more expensive for everyone. This, in turn, may cause fewer people of the vulnerable group to get loans as the bank factors in the cost of mistakes. Another way to think about the point is that using such an algorithm causes net interest paid per borrowed dollar to increase by some number. It seems this common scenario is not discussed in many of the papers on fair ML. One reason for that may be that people are fixated on who gets approved and not the interest rate or total approvals.

No Stopping: Impact of the Stopping Rule on the Sex Ratio

20 Jun

For social scientists brought up to worry about bias stemming from stopping data collection when results look significant, the fact that a gender based stopping rule has no impact on the sex ratio seems suspect. So let’s dig deeper.

Let there be n families and let the stopping rule be that after the birth of a male child, the family stops procreating. Let p be the probability a male child is born and q=1−p

After 1 round: 

\[\frac{pn}{n} = p\]

After 2 rounds: 

\[ \frac{(pn + qpn)}{(n + qn)} = \frac{(p + pq)}{(1 + q)} = \frac{p(1 + q)}{(1 + q)} \]

After 3 rounds: 

\[\frac{(pn + qpn + q^2pn)}{(n + qn + q^2n)}\\ = \frac{(p + pq + q^2p)}{(1 + q + q^2)}\]

After k rounds: 

\[\frac{(pn + qpn + q^2pn + … + q^kpn)}{(n + qn + q^2n + \ldots q^kn)} \]

After infinite rounds:

Total male children: 

\[= pn + qpn + q^2pn + \ldots\\ = pn (1 + q + q^2 + \ldots)\\ = \frac{np}{(1 – q)}\]

Total children:

\[= n + qn + q^2n + \ldots\\ = n (1 + q + q^2 + \ldots)\\ = \frac{n}{(1 – q)}\]

Prop. Male:

\[= \frac{np}{(1 – q)} * \frac{(1 – q)}{n}\\ = p\]

If it still seems like a counterintuitive result, here’s one way to think: In each round, we get pq^k successes, and the total number of kids increases by q^k. Yet another way to think is that for any child that is born, the data generating process is unchanged.

The male-child stopping rule may not affect the aggregate sex ratio. But it does cause changes in families. For instance, it causes a negative correlation between family size and the proportion of male children. For instance, if your first child is male, you stop. (For more results in this vein, see here.)

But why does this differ from our intuition that comes from early stopping in experiments? Easy. We define early stopping as when we stop data collection as soon as the results are significant. This causes a positive bias in the number of false-positive results (w.r.t. the canonical sample-fixed-in-advance rule). But early stopping leads to both kinds of false positives—mistakenly thinking that the proportion of females is greater than .5 and mistakenly thinking that the proportion of males is greater than .5. The rule is unbiased w.r.t. to the expected value of the proportion.

ML (O)Ops: What Data To Collect? (part 3)

16 Jun

The first part of the series, “Improving and Deploying On-Device Models With Confidence,” is posted here. The second part, “Keeping Track of Changes,” is posted here.

With Atul Dhingra

For a broad class of machine learning problems, nitpicking over the neural net architecture is over (see, for instance, here). Instead, the focus has shifted to data. In the note below, we articulate some ways of thinking about what data to collect. In our discussion, we focus on supervised learning. 

The answer to “What data to collect?” varies by where you are in the product life cycle. If you are building a new ML product and the aim is to deploy something (basic) that delivers value and then iterate on it, one answer to the question is to label easy-to-predict cases—cases that allow you to build models where the precision is high but the recall is low. The bar is whether the model can do as well as business as usual for a small set of cases. The good thing is that you can hurdle that bar another way—by coding a random sample, building a model, and choosing a threshold where the precision is greater than business as usual (read more here). For producing POCs, models built on cheap data, e.g., open-source data, which plausibly do not produce value, can also “work” though they need to be managed against the threat of poor performance reducing faith in the system. 

The more conventional case is where you have a deployed model, and you want to improve its performance. There the answer to what data to collect is data that yields the highest ROI. (The answer to what data provides the highest ROI will vary over time, so we need a system that continuously answers it.) If we assume that the labeling costs for points are the same, the prioritization function reduces to ranking data by returns. To begin with, let’s assume that returns are measured by the function specified by the cost function. So, for instance, if we are looking for a model that lowers the RMSE, we would like to rank by how much reduction in RMSE we get from labeling an additional point. And naturally, we care about the test set RMSE. (You can generalize this intuition to any loss function.) So far, so good. The rub comes from the fact that there is no trivial answer to the problem. 

One way to answer the question is to run experiments, sampling across Xs, or plausibly use bandits and navigate the explore-exploit tradeoff smartly. Rather than do experiments, you can also exploit the data you have to figure out the kinds of points that make the most impact on RMSE. One way to get at that is using influence functions. There are, however, a couple of challenges in using these methods. The first is that the covariate space is large and the marginal impact is small, and that means inference is noisy. The second is a more general problem. Say you find that X_1, X_2, X_3, … are the points that lead to the largest reduction in RMSE. But how do you use that knowledge to convert it into a data collection problem? Is it that we should collect replicas of X_1? Probably not. We need to generalize from these examples and come up with a statement about the “type of data” that needs to be collected, e.g., more images where the traffic sign is covered by trees. To come up with the ‘type’, we need to specify what the example is not—how does it differ from the rest of the data we have? There are a couple of ways to answer the question. The first is to use clustering (using embeddings) and then assigning someone to label the clusters. Another is to use supervised learning to classify the X_1, X_2, X_3 from the rest of the data and figure out the “important predictors.” 

There are other answers to the question, “What data to collect?” For instance, we could look to label points where we are least certain or where we make the largest error. The intuition in the classification setting is that these points are closest to the hyperplane that separates the classes, and if you can learn to classify near the boundary, you are set. In using this method, you can also sometimes discover mislabeling. (The RMSE method we talk about above doesn’t interrogate the Y, taking the labels as given.) 

Another way to answer the question is to use model interpretation tools to figure out “why” the models are making errors. For instance, you could find that the reason why the model is making errors is because of confounding. Famously, for instance, a cat vs. dog classifier can merely be an outdoor vs. indoor classifier. And if we see the model using confounding features like the background in consideration, we could a) better label the data to segment out dogs and cats from the background, b) introduce paired examples such that the only thing different between any two images is strictly presence or absence of a dog/cat.

Partisan Morality

11 Jun

Sinn Féin and Fianna Fáil have said that activists posed as members of a polling company and went door-to-door to canvass the opinions of voters.

The rationale is simple. If you pose as an SF worker, you are likely to be met with shut doors or opinions in favor of SF got under slight duress. Is it a bridge too far or is it a harmless lie? More generally, do we use the same moral reasoning paradigm for violations by co-partisans and opposing partisans? My hunch is that for such kinds of violations we use a deontological framework for opposing partisans and a consequential one for co-partisans. The framework we use may switch depending on the circumstance. One way to test it would be to do a survey experiment with the above news article, switching parties. To get a better baseline, it may be useful to do three conditions: party_a, party_b, consumer_brand, e.g., Coke, etc.

Market Welfare: Why Are Covid-19 Vaccines Still Underfunded?

11 Jun

“To get roughly 70% of the planet’s population inoculated by April, the IMF calculates, would cost just $50bn. The cumulative economic benefit by 2025, in terms of increased global output, would be $9trn, to say nothing of the many lives that would be saved.”

The Economist frames this as an opportunity for G7. And it is. But it is also an opportunity for third-world countries, which plausibly can borrow $50bn given the return on investment. The fact that money hasn’t already been allocated poses a puzzle. Is it because governments think about borrowing decisions based on whether or not a policy is tax revenue positive (which a 180x return ought to be even with low tax collection and assessment rates)? Or is it because we don’t have a marketplace where we can transact on this information? If so, it seems like an important hole.

Here’s another way to look at this point. Among countries where the profits mostly go to a few, why do the people at the top not come to invest together so that they can harvest profits later? Brunei is probably an ok example.

The Story of Science: Storytelling Bias in Science

7 Jun

Often enough, scientists are left with the unenviable task of conducting an orchestra with out-of-tune instruments. They are charged with telling a coherent story about noisy results. Scientists defer to the demand partly because there is a widespread belief that a journal article is the appropriate grouping variable at which results should ‘make sense.’

To tell coherent stories with noisy data, scientists resort to a variety of underhanded methods. The first is simply squashing the inconvenient results—never reporting them or leaving them to the appendix or couching the results in the language of the trade, e.g., “the result is only marginally significant” or “the result is marginally significant” or “tight confidence bounds” (without ever talking about the expected effect size). Secondly, if good statistics show uncongenial results, drown the data in bad statistics, e.g., report the difference between a significant and an insignificant effect as significant. The third trick is overfitting. A sin in machine learning is a virtue in scientific storytelling. Come up with fanciful theories that could explain the result and make that the explanation. The fourth is to practice the “have your cake and eat it too” method of writing. Proclaim big results at the top and offer a thick word soup in the main text. The fifth is to practice abstinence—abstain from interpreting ‘inconsistent’ results as coming from a lack of power, bad theorizing, or heterogeneous effects.

The worst outcome of all of this malaise is that many (expectedly) become better at what they practice—bad science and nimble storytelling.

The Hateful ATE: The Effect of Affective Polarization

7 Jun

In a new paper, Broockman et al. use a clever manipulation to induce “three decades of change in affective polarization”:

In typical trust games, there are two players. Player 1 receives a cash allocation and is instructed to give “some, all, or none” of the money to Player 2. The player is also told that the researchers will triple any amount Player [1] gives to Player 2 and that Player 2 can return some, all, or none of the money back to Player 1. Therefore, the more Player 1 expects reciprocity from Player 2, the more money they should allocate to Player 2 in anticipation they will receive a larger sum in return, and the better off Player 2 will be. For example, if Player 1 gives all her money to Player 2, this sum would be tripled, and Player 2 could return half of the tripled amount to Player 1—leaving both players with 50% more than Player 1’s initial allocation. But if Player 1 gives no money to Player 2, Player 1 leaves with only her initial allocation and Player 2 leaves with nothing.

First, we always make participants take the role of Player 2. This means they always first observe an allocation another player makes to them. Second, across three consecutive rounds of game play, participants are told they are interacting with three other respondents of the opposite political party who have each been allocated $10. However, they are in fact are interacting with computerized opponents who offer allocations based on a pre-determined script. Participants randomized to the Positive Experience condition receive allocations from Player 1 of $8, $7 and $8 (tripled to $24, $21 and $24) respectively across the three rounds of the game. However, those in the Negative Experience condition receive $0 allocations in all three rounds.

Broockman et al. 2021

Next, comes the punchline. “Player 1’s reason for their allocation to you: your partisanship (all rounds), your income (Round 2)”. See Page 65.

Being told that a co- or opposing- partisan gave $0 versus being told that they gave $8, $7, and $8 because of your partisanship across three rounds has a dramatic effect on partisans’ feelings: partisans’ feelings toward opposing partisans become ‘cooler,’ it doesn’t affect their feelings towards co-partisans (impressive), and (strangely) polarizes their feelings toward elites (see the figure below).

Three comments are in order.

First, the manipulation is unrealistic given previous effect sizes (see here).“The average amount allocated to copartisans in the trust game was $4.58 (95% confidence interval [4.33, 4.83]), representing a “bonus” of some 10% over the average allocation of $4.17.”

Second, the manipulation principally ought to change perceptions of how trusting people are and not how trustworthy they are. We don’t manipulate how deceitful the other person is but how fearful they are of not having their actions reciprocated. Disliking less trusting people is slightly weird and plausibly points to how the underlying antipathy can be exacerbated by treatments that do not present a clear reason for judging another person more harshly. Or it could be that not being seen as being trustworthy and losing out on money as a result of it is insulting and aggravating.

Whatever the reason, generalizing from a bad personal interaction to all other members of a group is disturbing. (The fact that treatment cools people’s feelings toward opposing partisans suggests people expect better from them, which is interesting.) Ascribing feelings from a bad personal experience to elites seems odder (and more disturbing) still.

The absence of commensurate co- and opposing- partisan feeling panels for elites feels odd.

The paper finds that having a “bad” personal experience (vis-a-vis a better one) with an opposing partisan increases interpersonal animus (plus polarization of feelings toward partisan elites) but doesn’t cause partisans to like opposing partisan MCs less or co-partisan MCs more (though see above. Note that the pooled estimate for the opposing party is 1.5% or so—which is about what I would expect; it likely deserves another run at the bank). (I didn’t understand the change from co-partisan and opposing-partisan MCs to “own MCs” in the next analysis, so I am omitting that.) The paper discusses other DVs: 

  1. Interest in expressing party-consistent issue preferences (no effect)
  2. Support for bi-partisan legislation (~ more in favor)
  3. Opposition to democratic norms (pooled index seems to move by d = .09 and is nearly sig. at conventional levels). (I make a special reference to the index because presumably it has the least measurement error and is least likely to show an idiosyncratic pattern given sample size. There is also a small point about how multiple comparison adjustments are made—plausibly they should account for measurement error.)
  4.  Endorsement of partisan-congenial claims (Ds yes; Rs no)

The theorized path from bad personal experience with a co- (or opposing) partisan to opposition to democratic norms, etc., seems convoluted to me. So let’s unpack the theoretical underpinnings of the expectations. Interpersonal animus among partisans is an indicator of affective polarization. And the experiment successfully manipulates interpersonal animus. So what’s the issue? One escape hatch is that the concept is not uni-dimensional. Another is that any increase in interpersonal affect manifests in political consequences only over long periods as it causes people to watch different media, trust different things, etc.

The True Ones: Best Guess of True Proportion of 1s

30 May

ML models are generally used to make predictions about individual observations. Sometimes, however, the business decision is based on aggregate data. For example, say a company sells pants and wants to know how many will be returned over a certain period. Say the company has an ML model that predicts the chance a customer will return a pant. A natural thing to do would be to use the individual returns to get an expected return count.

One way to get an expected return count, if the model produces calibrated probabilities, is to simply take the mean. But say that you built an ML model to predict a dichotomous variable and you only have access to categorized outputs (1s and 0s). Say for model X, for cat == 1, the OOS recall is r and precision = p. Let’s say we use the model to predict labels for another dataset. Let’s say we observe 100 1s and 200 0s. What is the best estimate of the true proportion of 1s in the new dataset?

The quantity of interest = TP + FN

TP + FN = TP/r

TP = (TP + FP)*p

TP + FN = ((TP + FP)*p)/r = 100*p/r

(TP + FN)/n = 100p/300r = p/3r

ML (O)Ops! Keeping Track of Changes (Part 2)

22 Mar

The first part of the series, “Improving and Deploying On-Device Models With Confidence”, is posted here.

With Atul Dhingra

One way to automate classification is to compare new instances to a known list and plug in the majority class of the exact match. For such instance-based learning, you often don’t need to version data; you just need a hash table. When you are not relying on an exact match—most machine learning—you often need to version data to reproduce the behavior.

Reproducibility is the bedrock of mature software engineering. It is fundamental because it allows you to diagnose issues. You can reproduce the behavior of a ‘version.’ With that power, you can correlate changes in inputs with changes in outputs. Systems that enable reproducibility, like version control, have another vital purpose—reducing risk stemming from changes and allow regression testing in systems that depend on data, such as ML. They reduce it by allowing for changes to be rolled back. 

To reproduce outputs from machine learning models, we need to do more than store data. We also need to store hyper-parameters, details about the OS, programming language, and packages, among other things. But given the primary value of reproducibility is instrumental—diagnosis—we not just want the ability to reproduce but also the ability to understand changes and correlate them. Current solutions miss the mark.

Current Solutions and Problems

One way to version data is to treat it as a binary blob. Store all the data you learned a model on to a server and store a reference to the data in your repository. If the data changes, store the new version and create a new pointer. One downside of using a <code>git lfs</code> like mechanism is that your storage blows up. Another is that build times can be large if the local cache is small or more generally if access costs are large. Yet another problem is the lack of a neat interface that allows you to track more than source data. 

DVC purports to solve all three problems. It solves the first by providing a way to not treat the data as a blob. For instance, in a computer vision workflow, the source data is image files with some elementary tags—labels, assignments to train and test, etc. The differences between data versions are 1) changes in images (additions mostly) and 2) changes in mapping to labels and assignments. DVC allows you to store the differences in corpora of images as a list of additional hashes to files. DVC is silent on the second point—efficient storage of changes in mappings. We come to it later. DVC purports to solve the second problem by allowing you to save to local cloud storage. But it can still be time-consuming to download data from the cloud storage buckets. The reason is as follows. Each time you want to work on an experiment, you need to clone the entire cache to check out the appropriate files. And if not handled properly, the cloning time often significantly exceeds typical training times. Worse, it locks you into a cloud provider for any optimizations you may want to alleviate these time-bound cache downloads. DVC purports to solve the last problem by using yaml, tags, etc. But anarchy prevails. 

Future Solutions

Interpretable Changes

One of the big problems with data versioning is that the diffs are not human-readable, much less comprehensible. The diffs are usually very long, and the changes in the diff are hashes, which means that to review an MR/PR/Diff, the reviewer has to check out the change and pull the data with the updated hashes. The process can be easily improved by adding an extra layer that auto-summarizes the changes into a human-readable form. We can, of course, easily do more. We can provide ways to understand how changes to inputs correlate with changes in outputs.

Diff. Tables

The standard method of understanding data as a blob seems uniquely bad. For conventional rectangular databases, changes can be understood as changes in functional transformations of core data lake tables. For instance, say we store the label assignments of images in a table. And say we revise the labels of 100 images. (The core data lake tables are immutable, so the changes are executed in the downstream tables.) One conventional way of storing the changes is to use a separate table for recording changes. Another is to write an update statement that is run whenever “the v2” table is generated. This means the differences across data are now tied to a data transformation computation graph. When data transformation is inexpensive, we can delay running the transformations till the table is requested. In other cases, we can cache the tables.