No Chance: The Problem with Re-Randomization

8 Aug

Randomization in social and clinical experiments is generally accepted as the “gold standard” for causal conclusions, because it balances baseline covariates across treatment groups on average, yielding unbiased causal effects. However, although randomization balances baseline covariates on average, it is possible that covariates could still be imbalanced just by random chance, compromising the validity of results. Although chance imbalance is often thought of as a rare, unlucky occurrence, it actually is quite common. For example, with only 10 independent covariates, there is a 40 percent chance that at least one will be significantly (using α = 0.05) different at baseline, just by random chance! Why subject your RCT to this kind of risk? If baseline covariates are thought to matter, balance should be checked at the time of randomization, before the experiment is conducted, and allocations yielding unacceptable balance should be eliminated.

Rerandomization (Morgan and Rubin, 2012) provides a way to avoid this chance imbalance for baseline covariates available at the time of randomization. Rerandomization works by checking balance at the time of randomization and rerandomizing if balance is unacceptable according to pre-specified criteria for acceptable balance. This process continues until an allocation with acceptable balance is achieved, and only then is treatment actually administered. When the criteria for acceptable balance is objective and specified in advance, and when treatment groups are equally sized, rerandomization maintains overall unbiasedness while also guarding against conditional bias due to chance imbalance. Thus we preserve the “gold standard” benefits of randomization, while avoiding detrimental chance imbalances; an idea Tukey (1993) called the “platinum standard.”

From https://healthpolicy.usc.edu/evidence-base/rerandomization-what-is-it-and-why-should-you-use-it-for-random-assignment/

The problem with re-randomization is the same as the problem with matching: you measure and increase the balance on observables, which could decrease the balance on unobserved variables that are ~ uncorrelated with the observed variables. (See here for a simple simulation.) If everyone does re-randomization to increase the balance on the same observed variables, the nice aspect of experimental inference—unbiasedness in expectation—may not hold. Caveat emptor!

Out of Network: The Tradeoffs in Using Network Based Targeting

1 Aug

In particular, in 521 villages in Haryana, we provided information on monthly immunization camps to either randomly selected individuals (in some villages) or to individuals nominated by villagers as people who would be good at transmitting information (in other villages). We find that the number of children vaccinated every month is 22% higher in villages in which nominees received the information.

From Banerjee et al. 2019

The buildings, which are social units, were randomized to (1) targeting 20% of the women at random, (2) targeting friends of such randomly chosen women, (3) targeting pairs of people composed of randomly chosen women and a friend, or (4) no targeting. Both targeting algorithms, friendship nomination and pair targeting, enhanced adoption of a public health intervention related to the use of iron-fortified salt for anemia.

Coupon redemption reports showed that unadjusted adoption rates were 13.6% (SE = 1.5%) in the friend-targeted clusters, 11.2% (SE = 1.4%) in pair-targeted clusters, 9.1% (SE = 1.3%) in the randomly targeted clusters, and 0% in the control clusters receiving no intervention.

From Alexander et al. 2022

Here’s a Twitter thread on the topic by Nicholas Christakis.

Targeting “structurally influential individuals,” e.g., people with lots of friends, people who are well regarded, etc., can lead to larger returns per ‘contact.’ This can be a useful thing. And as the studies demonstrate, finding these influential people is not hard—just ask a few people. There are, however, a few concerns:

  1. One of the concerns with any targeting strategy is that it can change who is treated. When you use network-based targeting, it biases the treated sample toward those who are more connected. That could be a good thing, especially if returns are the highest on those with the most friends, like in the case of curbing contagious diseases, or it could be a bad thing if the returns are the greatest on the least connected people. The more general point here is that most ROI calculations for network targeting have only accounted for costs of contact and assumed the benefits to be either constant or increasing in network size. One can easily rectify this by specifying the ROI function more fully or adding “fairness” or some kind of balance as a constraint.
  2. There is some stochasticity that stems from which person is targeted, and their idiosyncratic impact needs to be baked into standard error calculations for the ‘treatment,’ which is the joint of whatever the experimenters are doing and what the individual chooses to do with the experimenter’s directions (compliance needs a more careful definition). Interventions with targeting are liable to have thus more variable effects than without targeting and plausibly need to be reproduced more often before they used as policy.

Back to the Future: Engineering Without Data (Models)

18 Jul

Software engineering has changed dramatically in the last few decades. The rise of AWS, high-level languages, powerful libraries, and frameworks increasingly allow engineers to focus on business logic. Today, software engineers spend much of their time writing code that reasons over data to show something or do something. But how engineering is done has not caught up in some crucial ways:

  1.  Software Development Tools. Most data scientists today work in a notebook on a server where they heavily interact with the data as they refine the code (algorithm). Most engineers still work locally without access to production data. Part of the reason engineers don’t have access to the data is because they work locally—for security and compliance reasons, access to production data from the local machine is banned in most places. Plausibly, a bigger reason is that engineers are stuck in a paradigm where they don’t think access to production data is foundational to faster, higher-quality software development. This belief is reflected in the ad-hoc solutions to the problem that are being tried across the industry, e.g., synthetic data (which is hard to create, maintain, and scale).
  2. Data Modeling. The focus on data modeling has sharply decreased over time in many companies. There are at least four underlying forces behind this trend. First, the combination of the volume of the data being generated and the rise of cheap blob storage (combined with the fact that computing power is comparatively vastly more expensive today) incentivizes the storage of unstructured data. Second, agile development, which prioritizes customer-facing progress over short time units, may cause underinvestment in costly, foundational work (see here). Third, the engineering organizations are changing in that the producers of the data are no longer seen as owners of the data. The fourth and last point is perhaps the most crucial—the surfeit of data has led to some magical thinking about the ease with which data can be used to power insights. Our ability to derive business insights from unstructured and dirty data, except for a small minority of cases, e.g., search, doesn’t exist. The only thing the surfeit of data has done is that it has widened and deepened the pool of insights that can be delivered. It hasn’t made it any easier to derive those insights, which continue to rely on good old-fashioned manual work to understand the use case and curate and structure the data appropriately. (It also then becomes an opportunity for building software.)

    Engineers pay the price of not investing in data modeling by making the code more complex (and hence, more unmaintainable) and by allocating time to fix “bugs.” (The reason I put the word bugs in air quotes is because obvious consequences of a bad system should not be called bugs.)
  3. Data Drift. Machine Learning Engineers (MLEs) obsess about it. Most other engineers haven’t ever heard of the term. Everyone should worry. Technically, the only difference between using ML and engineering for rule creation is that ML auto-creates rules while conventional engineering relies on handcrafting the rules. Both systems test the efficacy of their rules on the current data. Both systems assume that the data will not drift. Only MLEs monitor the data, thinking hard about what data the rules work for and how to monitor data drift. Other engineers need to sign up.

The solutions are as simple as the problems are immense: invest in data quality, data monitoring, and data models. To achieve that, we need to change how organizations are structured, how they are run, and what engineers think the hard problems are.

Noise: A Flaw in Book Writing

10 Jul

This is a review of Noise, A Flaw in Human Judgment by Kahneman, Sibony, and Sunstein.

The phrase “noise in decision making” brings to mind “random” error. Scientists, however, shy away from random error. Science is mostly about systematic error, except, perhaps, quantum physics. So Kahneman et al. conceive of noise as seemingly random error that is a result of unmeasured biases. For instance, research suggests that heat causes bad mood. And bad mood may, in turn, cause people to judge more harshly. If this were to hold, the variability in judging stemming from the weather can end up being interpreted as noise. But, as is clear, there is no “random” error, merely bias. Kahneman et al. make a hash of this point. Early on, they give the conventional formula of total expected error as the sum of bias and variance (they don’t further decompose variance into irreducible error and ‘random’ error) with the aim of talking about the two separately, and naturally, never succeed in doing that.

The conceptual issues ought not detract us from the important point of the book. It is useful to think about human judgment systems as mathematical functions. We should expect the same inputs to map to the same output. It turns out that it isn’t even remotely true in most human decision-making systems. Take insurance underwriting, for instance. Given the same data (realistic but made-up information about cases), the median percentage difference between quotes between any pair of underwriters is an eye-watering 55% (which means that for half of the cases, it is worse than 55%), about five times as large as expected by the executives. There are a few interesting points that flow from this data. First, if you are a customer, your optimal strategy is to get multiple quotes. Second, what explains ignorance about the disagreement? There could be a few reasons. First, when people come across a quote from another underwriter, they may ‘anchor’ their estimate on the number they see, reducing the gap between the number and the counterfactual. Second, colleagues plausibly read to agree—less effort and optimizing for collegiality, asking, “Could this make sense?”, than read to evaluate, “Does this make sense?” (see my notes for a fuller set of potential explanations.)

Data from asylum reviews is yet starker. “A study of cases that were randomly allotted to different judges found that one judge admitted 5% of applicants, while another admitted 88%.” (Paper.)

Variability can stem from only two things. It could be that the data doesn’t allow for a unique judgment (irreducible error). (But even here, the final judgment should reflect the uncertainty in the data.) Or that at least one person is ‘wrong’ (has a different answer than others). Among other things, this can stem from:

  1. variation in skill, e.g., how to assess patent applications
  2. variation in effort, e.g., some people put more effort than others
  3. agency and preferences, e.g., I am a conservative judge, and I can deny an asylum application because I have the power to do so
  4. biases like using irrelevant information, e.g., weather, hypoglycemia, etc.

(Note: a lack of variability doesn’t mean we are on to the right answer.)

The list of proposed solutions is extensive—from selecting better judges to the wisdom of the crowds to using models to training people better to more elaborate schemes like dividing the decision task and asking people to make relative than absolute judgments. The evidence backing the solutions is not always hefty, which meshes with the ideolog-like approach to evidence present everywhere in the book. When I did a small audit of the citations, three things stood out (the overarching theme is adherence to the “No Congenial Result Scrutinized or Left Uncited Act”):

  1. Extremely small n studies cited without qualification. Software engineers.
    Quote from the book: “when the same software developers were asked on two separate days to estimate the completion time for the same task, the hours they projected differed by 71%, on average.”
    The underlying paper: “In this paper, we report from an experiment where seven experienced software professionals estimated the same sixty software development tasks over a period of three months. Six of the sixty tasks were estimated twice.
  2. Extremely small n studies cited without qualification. Israeli Judges.
    Hypoglycemia and judgment: “Our data consist of 1,112 judicial rulings, collected over 50 d in a 10-mo period, by eight Jewish-Israeli judges (two females) who preside over two different parole boards that serve four major prisons in Israel.”
  3. Surprising but likely unreplicable results. “When calories are on the left, consumers receive that information first and evidently think “a lot of calories!” or “not so many calories!” before they see the item. Their initial positive or negative reaction greatly affects their choices. By contrast, when people see the food item first, they apparently think “delicious!” or “not so great!” before they see the calorie label. Here again, their initial reaction greatly affects their choices. This hypothesis is supported by the authors’ finding that for Hebrew speakers, who read right to left, the calorie label has a significantly larger impact..” (Paper.)
    “We show that if the effect sizes in Dallas et al. (2019) are representative of the populations, a replication of the six studies (with the same sample sizes) has a probability of only 0.014 of producing uniformly significant outcomes.” (Paper.)
  4. Citations to HBR. Citations to think pieces in Harvard Business Review (10 citations in total based on a keyword search) and books like ‘Work Rules!’ for a fair many claims.

Here are my notes for the book.

Building Code: Making Government Code Publicly Available

16 May

Very little of the code that the government pays for is open-sourced. One of the reasons is that private companies would rather the code remain under wraps so that the errors never come to light, the price for producing software is never debated, and they get to continue to charge for similar work elsewhere.

Open-sourcing code is liable to produce the following benefits:

  1. It will help us discover bugs.
  2. It will reduce the cost of building similar software. In a federal system, many local agencies produce (or buy) similar software to help administer similar services. Having the code open-sourced is likely to reduce the barrier to entry for firms bidding to build such software and will likely lead to lower costs over time.
  3. Freely available software under a generous license, e.g., queue management software, optimal staffing software, etc., benefits the economy as firms do not have to invest as much in building such systems.
  4. It will likely increase trust in the government. For instance, where software is used to estimate benefits, the auditability of the software is likely to lead to a modest increase in confidence in the correctness of how the law has been translated into code.

There are at least three ways to open-sourcing government code. First, firms like OpenGov that produce open-source software for the government are already helping bring some of the code online. But given that the space for government software is large, it will likely take many decades for a tangible proportion of software to be open-sourced. Second, we can lobby the government to change the law so that companies (and agencies) are mandated to open source certain software they build for the government. But the prognosis is bleak, given that the government contractors are likely lobbying hard against it. The third option is to use FOIA to request code and make it available on Github. I sense that this is a tenable option.

Sampling Domain Knowledge

15 May

Say that we want to measure how often people go to risky websites. Let’s assume that the measurement of risk is expensive. We have data on how often people visit each domain on the web from a large sample. The number of unique domains in the data is large, making measuring the population of domains impossible. Say there is a sharp skew in the visitation of domains. What is the fewest number of domains we need to measure to get s.e. of no greater than X per row?

Here are some ideas:

  1. The base solution is simple: sample domains in each row (with replacement) in proportion to views/time to get to the desired s.e. Then, collate the selected domains and get labels for those.
  1. Exploit the skew in the distribution. For instance, sample from 99% of the distribution and save yourself from the long tail. Bound each estimate by the unsampled 1% (which could be anything) and enjoy. For greater accuracy, do a smaller, cruder sample of the 1% and get to the +/- 10% with an n = 100. The full version of this point is as follows: we benefit from increasing the probability of including more frequently occurring domains. Taken to the extremum, you could deterministically include the most frequent domains, and then prorate the size of the sample for the rest by the size of the area under the curve. This kind of strategy can help answer: how to optimally sample skewed distributions to get the smallest s.e. with the fewest observations?

Data Police

13 Mar

In a new paper, Chohlas-Wood et al. present three interesting points:

  1. Some of the major policing strategies have scant empirical support:
    • The impact of “pulling over drivers for minor traffic violations” (for the alleged purpose of “[preventing] criminal activity by intercepting individuals driving to and from the scene of a crime”) in Nashville was ~ 0 on serious crimes. (See Figures 1 and 2). To get a sense of the scale of the intervention: “In 2012, the MNPD conducted traffic stops up to ten times more frequently per capita than police departments in similar U.S. cities.”
    • The impact of stop and frisk in NYC on serious crime was also ~ 0. Again, to get a sense of the scale of the policy: “NYPD officers reported conducting nearly 700,000 Terry stops in 2011 alone, nearly 90% of which involved Black or Hispanic pedestrians.”
    • GS: None of this is terribly surprising. All over the world, very few policies are chosen as a result of careful data analysis. Why would policing be any different? My other prior based on looking at a fair bit of US crime data is that to a first approximation, all trends are national. When policing is local and trends are national, it suggests that the way policing is done is perhaps not the most important factor in preventing crime.
  2. Racial bias in who is stopped:
    • “[A]t any given level of risk Black and Hispanic individuals were frisked considerably more often than white individuals.” (NYC, 2011-2012)
    • “[T]he rates at which frisks recover weapons are significantly lower for frisked Black individuals (3.8%) and Hispanic individuals (3.4%) compared to white individuals (5.7%).” (From the Chicago Police Department (CPD) in 2017)
    • Contraband recovery rate for Blacks = 17%, Hispanics = 20%, Whites = 27% (Chicago 2014–2019, traffic stops.)
    • Contraband recovery rate for Blacks = 24%, Hispanics = 23%, Whites = 34% (Philadelphia 2014–2019; traffic stops.)
    • GS: I am impressed by the contraband recovery rates. Either the base rate of ‘contraband’ is super high or the police is very good. My hunch is the former but would love to see data. (See below.)
    • GS: If police select who to stop based on observable characteristics (conditional on location; what else can they rely on?), criminals may be incentivized to game that reducing the value of observables over time.
  3. Whack-a-mole nature of policing policies
    • “The settlement agreement with the ACLU took effect on January 1, 2016.85 For 2016, the CPD reported a total of approxi-mately 100,000 pedestrian stops, a sharp drop from the roughly 600,000 stops reported for 2015 (Figure 9).86 At the same time, the number of traffic stops made by the CPD began to rise. The CPD reported around 100,000 traffic stops in 2014 and a similar amount in 2015, but by 2019, the CPD reported nearly 600,000 traffic stops, with large increases occurring each year from 2016 to 2019. These traffic stops came to closely resemble the pedestrian stops that the CPD was contemporaneously under pressure to curtail. …”
    • Following a consent decree and settlement in 2011, pedestrian stops fell from more than 200,000 reported stops in 2014 (the earliest year for which we have data released publicly by the city) to fewer than 100,000 reported stops in each of 2018 and 2019, while traffic stops almost doubled in the same period”

p.s. Graham sends this:

“Back in the 1990s, it looked like the Supreme Court was going to run drug checkpoints, so Indianapolis started doing one. Drivers were stopped completely at random until the Supreme Court put an end to it.

The city conducted six such roadblocks between August and November that year, stopping 1,161 vehicles and arresting 104 motorists. Fifty-five arrests were for drug-related crimes, while 49 were for offenses unrelated to drugs. The overall “hit rate” of the program was thus approximately nine percent.

If you take this as a baseline, police are twice as good at finding contraband as random selection. If “contraband” just means drugs, then probably four times as good. So the baseline rate of contraband is high (a surprising number of people have warrants, drugs, and weapons) but police are also beating the odds.”

Chicago is not Indianapolis and 2015 is not 2000 but still valuable.

p.p.s. Graham also highlights an issue with Figure 2. Chohlas-Wood et al. plot the murder rate per 1k on the same graphs as vehicle stops per 1k. This naturally squishes the variation in the murder rate. The general rule is that you should avoid plotting variables that vary by orders of magnitude on the same graph. At any rate, doing so gives the appearance that the authors are putting a thumb on the scale.

Pastations: Is it Time to Move Beyond Presentations?

26 Dec

In an influential essay, The Cognitive Style of PowerPoints, Tufte argues that (PowerPoint) presentations are unsuitable for serious problems. The essay is largely polemical, with Tufte freely mixing points about affordances of the medium with criticisms of bad presentations and lazy broadsides.

Hilarious stuff first:

  1. “All 3 reports have standard PP format problems: elaborate bullet outlines; segregation of words and numbers (12 of 14 slides with quantitative data have no accompanying analysis); atrocious typography; data imprisoned in tables by thick nets of spreadsheet grids; only 10 to 20 short lines of text per slide.”
  2. “On this single Columbia slide, in a PowerPoint festival of bureaucratic hyper-rationalism, 6 different levels of hierarchy are used to classify, prioritize, and display 11 simple sentences”
  3. “In 28 books on PP presentations, the 217 data graphics depict an average of 12 numbers each. Compared to the worldwide publications shown in the table at right, the statistical graphics based on PP templates are the thinnest of all, except for those in Pravda back in 1982, when that newspaper operated as the major propaganda instrument of the Soviet communist party and a totalitarian government.”

In the essay, I could only rescue two points about affordances (that I buy):

  1. “When information is stacked in time, it is difficult to understand context and evaluate relationships.”
  2. Inefficiency: “A talk, which proceeds at a pace of 100 to 160 spoken words per minute, is not an especially high resolution method of data transmission. Rates of transmitting visual evidence can be far higher. … People read 300 to 1,000 printed words a minute, and find their way around a printed map or a 35mm slide displaying 5 to 40 MB in the visual field. Yet, in a strange reversal, nearly all PowerPoint slides that accompany talks have much lower rates of information transmission than the talk itself. As shown in this table, the PowerPoint slide typically shows 40 words, which is about 8 seconds worth of silent reading material. The slides in PP textbooks are particularly disturbing: in 28 textbooks, which should use only first-rate examples, the median number of words per slide is 15, worthy of billboards, about 3 or 4 seconds of silent reading material. This poverty of content has several sources. First, the PP design style, which typically uses only about 30% to 40% of the space available on a slide to show unique content, with all remaining space devoted to Phluff, bullets, frames, and branding. Second, the slide projection of text, which requires very large type so the audience can read the words.”

From Working Backwards, which cites the article as the reason Amazon pivoted from presentations to 6-pagers for its S-team meetings, there is one more reasonable point about presentations more generally:

“…the public speaking skills of the presenter, and the graphics arts expertise behind their slide deck, have an undue—and highly variable—effect on how well their ideas are understood.”

Working Backwards

The points about graphics arts expertise, etc., apply to all documents but are likely less true for reports than presentations. (It would be great to test the effect of the prettiness of graphics on their persuasiveness.)

Reading the essay made me think harder about why we use presentations in meetings about complex topics more generally. For instance, academics frequently present to other academics. Replacing presentations with 6-pagers that people quietly read and comment on at the start of the meeting and then discuss may yield higher quality comments and better discussion and better evaluation of the scholar (and the scholarship).

p.s. If you haven’t seen Norvig’s Gettysburg Address in PowerPoint, you must.

p.p.s. Ed Haertel forwarded me this piece by Sam Wineburger on why asking students to create powerpoints is worse than asking them to write an essay.

p.p.p.s. Here’s how Amazon runs its S-team meetings (via Working Backwards):

1. 6-pager (can have appendices) distributed at the start of the discussion.

2. People read in silence and comment for the first 20 min.

3. Rest 40 min. devoted to discussion, which is organized by 1. big issues/small issues, 2. people going around the room, etc.

4. One dedicated person to take notes.

Homing in on the Home Advantage

19 Dec

A recent piece on ESPNCricinfo analyses the DRS data and argues that cricket should do away with neutral umpires. I reanalyzed the data.

If a game is officiated by a home umpire, we expect the following:

  1. Hosts will appeal less often as they are likely to be happier with the decision in the first place
  2. When visitors appeal a decision, their success rate should be higher than the hosts. Visitors are appealing against an unfavorable call—a visiting player was unfairly given out or they felt the host player was unfairly given not out. And we expect the visitors to get more bad calls.

When analyzing success rate, I think it is best to ignore appeals that are struck down because they defer to the umpire’s call. Umpire’s call generally applies to LBW decisions, and especially two aspects of the LBW decision: 1. whether the ball was pitching in line, 2. whether it was hitting the wickets. To take a recent example, in the second test of the 2021 Ashes series, Lyon got a wicket when the impact was ‘umpire’s call’ and Stuard Broad was denied a wicket for the same reason.

Ollie Robinson Unsuccessfully Challenging the LBW Decision

Stuart Broad Unsuccessfully Challenging the Not-LBW Decision

With the preliminaries over, let’s get to the data covered in the article. Table 1 provides some summary statistics of the outcomes of DRS. As is clear, the visiting team appealed the umpire’s decision far more often than the home team: 303 vs. 264. Put another way, the visiting team lodged nearly one more appeal per test than the home team. So how often did the appeals succeed? In line with our hypothesis, the home team appeals were upheld less often (24%) than visiting team’s appeals (29%).

Table 1. Review Outcomes Under Home Umpires. 41 Tests. July 2020–Nov. 2021.

REVIEWER TYPETOTAL PLAYER REVIEWSSTRUCK DOWN (%)UMPIRE’S CALL (STRUCK DOWN) (%)UPHELD (%)
HOME BATTING9639 (40%)25 (26%)32 (34%)
HOME BOWLING168108 (64%)29 (18%)31 (18%)
VISITOR BATTING14758 (39%)25 (17%)64 (44%)
VISITOR BOWLING15697 (62%)34 (22%)25 (16%)
Data From ESPNCricinfo

It could be the case that these results are a consequence of something to do with host vs. visitor than home umpires. For instance, hosts win a lot, and that generally means that they will bowl for shorter periods of time and bat for longer periods of time. We account for this by comparing outcomes under neutral umpires. The article has data on the same. There, you see that the visiting team makes fewer appeals (198) than the home team (214). And the visiting team’s success rate in appeals is slightly lower (29%) than the home team’s rate (30%).

p.s.

At the bottom of the article is another table that breaks down reviews by host country:

HOST COUNTRYTESTSUMPIRESREVIEWSHOSTS’ SUCCESS (%)VISITORS’ SUCCESS (%)
ENGLAND13AG WHARF, MA GOUGH*, RA KETTLEBOROUGH*, RK ILLINGWORTH*19022/85 (26%)32/105 (30%)
NEW ZEALAND4CB GAFFANEY*, CM BROWN, WR KNIGHTS413/17 (18%)5/24 (21%)
AUSTRALIA4BNJ OXENFORD, P WILSON, PR REIFFEL*555/30 (17%)6/25 (24%)
SOUTH AFRICA2AT HOLDSTOCK, M ERASMUS*202/10 (20%)3/10 (30%)
SRI LANKA6HDPK DHARMASENA*, RSA PALLIYAGURUGE859/42 (21%)13/43 (30%)
PAKISTAN2AHSAN RAZA, ALEEM DAR*270/11 (0%)6/16 (38%)
INDIA5AK CHAUDHARY, NITIN MENON*, VK SHARMA879/40 (23%)11/47 (23%)
WEST INDIES6GO BRATHWAITE, JS WILSON*9413/50 (26%)13/44 (29%)
Data from ESPNCricinfo

But the data doesn’t match the one in the table above. For one, the number of tests considered is 42 than 41. For two, and perhaps relatedly, the total number of reviews is 599 than 567. To be comprehensive, let’s do the same calculations as above. The visiting team appeals more (314) than the host team (285). The host team success rate is 22% (63/285), and the visiting team success rate is 28% (89/314). If you were to do a statistical test for success rates:

 prop.test(x = c(63, 89), n = c(285, 314))

        2-sample test for equality of proportions with continuity correction

data:  c(63, 89) out of c(285, 314)
X-squared = 2.7501, df = 1, p-value = 0.09725
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.13505623  0.01028251
sample estimates:
   prop 1    prop 2 
0.2210526 0.2834395 

Nextdoor

28 Nov

The KNN classifier is one of the most intuitive ML algorithms. It predicts class by polling k nearest neighbors. Because it seems so simple, it is easy to miss a couple of the finer points:

  1. Sample Splitting: Traditionally, when we split the sample, there is no peeking across samples. For instance, when we split the sample between a train and test set, we cannot look at the data in the training set when predicting the label for a point in the test set. In knn, this segregation is not observed. Say we partition the training data to learn the optimal k. When predicting a point in the validation set, we must pass the entire training set. Passing the points in the validation set would be bad because then the optimal k will always be 0. (If you ignore k = 0, you can pass the rest of the dataset.)
  2. Implementation Differences: “Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor k+1 and k, have identical distances but different labels, the results will depend on the ordering of the training data.” (see here; emphasis mine.)

    This matters when the distance metric is discrete, e.g., if you use an edit-distance metric to compare strings. Worse, scikit-learn doesn’t warn users during analysis.

    In R, one popular implementation of KNN is in a package called class. (Overloading the word class seems like a bad idea but that’s for a separate thread.) In class, how the function deals with this scenario is decided by an explicit option: “If [the option is] true, all distances equal to the kth largest are included. If [the option is] false, a random selection of distances equal to the kth is chosen to use exactly k neighbours.”

    For the underlying problem, there isn’t one clear winning solution. One way to solve the problem is to move from knn to adaptive knn: include all points that are as far away as the kth point. This is what class in R does when the option all.equal is set to True. Another solution is to never change the order in which the data are accessed and to make the order as part of how the model is exported.

Fooled by Randomness

28 Sep

Permutation-based methods for calculating variable importance and interpretation are increasingly common. Here are a few common places where they are used:

Feature Importance (FI)

The algorithm for calculating permutation-based FI is as follows:

  1. Estimate a model
  2. Permute a feature
  3. Predict again
  4. Estimate decline in predictive accuracy and call the decline FI

Permutation-based FI bakes in a particular notion of FI. It is best explained with an example: Say you are calculating FI for X (1 to k) in a regression model. Say you want to estimate FI of X_k. Say X_k has a large beta. Permutation-based FI will take the large beta into account when calculating the FI. So, the notion of importance is one that is conditional on the model.

Often we want to get at a different counterfactual: If we drop X_k, what happens. You can get to that by dropping and re-estimating, letting other correlated variables get large betas. I can see a use case in checking if we can knock out say an ‘expensive’ variable. There may be other uses.

Aside: To my dismay, I kludged the two together here. In my defense, I thought it was a private email. But still, I was wrong.

Permutation-based methods are used elsewhere. For instance:

Creating Knockoffs

We construct our knockoff matrix X˜ by randomly swapping the n rows of the design matrix X. This way, the correlations between the knockoffs remain the same as the original variables but the knockoffs are not linked to the response Y. Note that this construction of the knockoffs matrix also makes the procedure random.

From https://arxiv.org/pdf/1907.03153.pdf#page=4

Local Interpretable Model-Agnostic Explanations

The recipe for training local surrogate models:

Select your instance of interest for which you want to have an explanation of its black box prediction.

Perturb your dataset and get the black box predictions for these new points.

Weight the new samples according to their proximity to the instance of interest.

Train a weighted, interpretable model on the dataset with the variations.

Explain the prediction by interpreting the local model.

From https://christophm.github.io/interpretable-ml-book/lime.html

Common Issue With Permutation Based Methods

“Another really big problem is the instability of the explanations. In an article 47 the authors showed that the explanations of two very close points varied greatly in a simulated setting. Also, in my experience, if you repeat the sampling process, then the explantions that come out can be different. Instability means that it is difficult to trust the explanations, and you should be very critical.”

From https://christophm.github.io/interpretable-ml-book/lime.html

Solution

One way to solve instability is to average over multiple rounds of permutations. It is expensive but the payoff is stability.

Monetizing Bad Models: Pay Per Correct Prediction

26 Sep

In many ML applications, especially ones where you need to train a model on customer data to get high levels of accuracy, the only models that ML SaaS companies can offer to a client out-of-the-box are bad. But many ML SaaS businesses hesitate to go to a client with a bad model. Part of the reason is that companies don’t understand that they can deliver value with a bad model. In many places, you can deliver value with a bad model by deploying a high-precision version, only offering predictions where you are highly confident. Another reason why ML SaaS companies likely hesitate is a lack of a reasonable pricing model. There, charging per correct response with some penalty for an incorrect answer may prove a good option. (If you are the sole bidder, setting the price just below the marginal cost of getting a human to label a response plus any additional business value from getting the job done more quickly may be one fine place to start.) Having such a pricing model is likely to reassure the client that they won’t be charged for the glamour of having an ML model and instead will only be charged for the results. (There is, of course, an upfront cost of switching to an ML model, which can be reasonably high and that cost needs to be assessed in terms of potential payoff over a long term.)

Interpreting Data

26 Sep

It is a myth that data speaks for itself. The analyst speaks for the data. The analyst chooses what questions to ask, what analyses to run, how the analyses are interpreted, and how they are summarized. I use excerpts from a paper by Gilliam et al. on media portrayal of crime as a way to highlight one set of choices by a group of analysts. (The excerpts also highlight the need for reading a paper fully than relying on the abstract alone.)

Abstract

From Gilliam et al.; Abstract.

White Violent Criminals Are Overrepresented

From Gilliam et al.; Bottom of page 10.

White Nonviolent Criminals Are Overrepresented

From Gilliam et al.; first paragraph on page 12

Relative Underrepresentation Between Violent and Nonviolent Crime is a Problem

From Gilliam et al.; Last paragraph on page 12
From Gilliam et al.; First paragraph on page 13

The Nonscience of Machine Learning

29 Aug

In 2013, Girshick et al. released a paper that described a technique to solve an impossible-sounding problem—classifying each pixel of an image (or semantic segmentation). The technique that they proposed, R-CNN, combines deep learning, selective search, and SVM. It also has all sorts of ad hoc choices, from the size of the feature vector to the number of regions, that are justified by how well they work in practice. R-CNN is not unusual. Many machine learning papers are recipes that ‘work.’ There is a reason for that. Machine learning is an engineering discipline. It isn’t a scientific one. 

You may think that engineering must follow science, but often it is the other way round. For instance, we learned how to build things before we learned the science behind it—we trialed-and-errored and overengineered our way to many still standing buildings while the scientific understanding slowly accumulated. Similarly, we were able to predict the seasons and the phases of the moon before learning how our solar system worked. Our ability to solve problems with machine learning is similarly ahead of our ability to put it on a firm scientific basis.

Often, we build something based on some vague intuition, find that it ‘works,’ and only over time, deepen our intuition about why (and when) it works. Take, for instance, Dropout. The original paper (released in 2012, published in 2014) had the following as motivation:

A motivation for Dropout comes from a theory of the role of sex in evolution (Livnat et al., 2010). Sexual reproduction involves taking half the genes of one parent and half of the other, adding a very small amount of random mutation, and combining them to produce an offspring. The asexual alternative is to create an offspring with a slightly mutated copy of the parent’s genes. It seems plausible that asexual reproduction should be a better way to optimize individual fitness because a good set of genes that have come to work well together can be passed on directly to the offspring. On the other hand, sexual reproduction is likely to break up these co-adapted sets of genes, especially if these sets are large and, intuitively, this should decrease the fitness of organisms that have already evolved complicated coadaptations. However, sexual reproduction is the way most advanced organisms evolved. …

Srivastava et al. 2014, JMLR

Moreover, the paper provided no proof and only some empirical results. It took until Gal and Ghahramani’s 2016 paper (released in 2015) to put the method on a firmer scientific footing.

Then there are cases where we have made ad hoc choices that ‘work’ and where no one will ever come up with a convincing theory. Instead, progress will mean replacing bad advice with good. Take, for instance, the recommended step of ‘normalizing’ variables before doing k-means clustering or before doing regularized regression. The idea of normalization is simple enough: put each variable on the same scale. But it is also completely weird. Why should we put each variable on the same scale? Some variables are plausibly more substantively important than others and we ideally want to prorate by that.

What Can We Learn?

The first point is about teaching machine learning. Bricklaying is thought to be best taught via apprenticeship. And core scientific principles are thought to be best taught via books and lecturing. Machine learning is closer to the bricklaying end of the spectrum. First, there is a lot in machine learning that is ad hoc and beyond scientific or even good intuitive explanation and hence taught as something you do. Second, there is plausibly much to be learned in seeing how others trial-and-error and come up with kludges to fix the issues for which there is no guidance.

The second point is about the maturity of machine learning. Over the last few decades, we have been able to accomplish really cool things with machine learning. And these accomplishments detract us from how early we are. The fact is that we have been able to achieve cool things with very crude tools. For instance, OOS validation is a crude but very commonly used tool for preventing overfitting—we stop optimization when the OOS error starts increasing. As our scientific understanding deepens, we will likely invent better tools. The best of machine learning is a long way off. And that is exciting.

Explore Exclusive Exploitation: The Billboard Top 100 Method of Learning

14 Jul

The optimism in Internet browsing is palpable. Browse long enough, and you will have a ‘hit.’ Like gambling, which it mimics, lay browsing is a losing proposition. A better way to spend your time is to focus on known knowns—excellent teachers, communicators, etc.—and core ideas, insights, and big hits of a discipline (along with learning how disciplines solve problems). The rationale for the first is obvious. The rationale for the second point is three folds:

  1. Because we often scavenge information, many of us are not well versed in the core principles of the discipline (and adjacent disciplines) we purport to specialize in or want to learn about.
  2. The core ideas, the big hits, etc., by their very nature, are important and illuminating.
  3. Many of these big ideas are accessible, partly because people have spent time thinking about ways to communicate the points. So you will find excellent distillations of the points, and you will find that many of these ideas are on your knowledge frontier (things you can learn immediately). 

Or, if you are disciplined enough, focus relentlessly on finding new things in a narrow niche. Going from gambling to anything else is not easy. The highs won’t be as high. But the average high and ROI will be a lot greater.

Fairly Certain: Using Uncertainty in Predictions to Diagnose Roots of Unfairness

8 Jul

One conventional definition of group fairness is that the ML algorithms produce predictions where the FPR (or FNR or both) is the same across groups. Fixating on equating FPR etc. can harm the very groups we are trying to help. So it may be useful to rethink how to solve the problem of reducing unfairness.

One big reason why the FPR may vary across groups is that, given the data, some groups’ outcomes are less predictable than others. This may be because of the limitations of the data itself or because of the limitations of algorithms. For instance, Kearns and Roth in their book bring up the example of college admissions. The training data for college admissions is the decisions made by college counselors. College counselors may well be worse at predicting the success of minority students because they are less familiar with their schools, groups, etc., and this, in turn, may lead to algorithms performing worse on minority students. (Assume the algorithm to be human decision-makers and the point becomes immediately clear.)

One way to address worse performance may be to estimate the uncertainty of the prediction. This allows us to deal with people with wider confidence bounds separately from people with narrower confidence bounds. The optimal strategy for people with wider confidence bounds people may be to collect additional data to become more confident in those predictions. For instance, Komiyama and Noda propose something similar (pdf) to help overcome a lack of information during hiring. Or we may need to figure out a way to compensate people based on their uncertainty interval. 

The average width of the uncertainty interval across groups may also serve as a reasonable way to diagnose this particular problem.

Optimal Data Collection When Strata and Strata Variances Are Known

8 Jul

With Ken Cor.

What’s the least amount of data you need to collect to estimate the population mean with a particular standard error? For the simplest case—estimating the mean of a binomial variable using simple random sampling, a conservative estimate of the variance (p=.5), and a ±3 confidence interval—the answer (n∼1,000) is well known. The simplest case, however, assumes little to no information. Often, we know more. In opinion polling, we generally know sociodemographic strata in the population. And we have historical data on the variability in strata. Take, for instance, measuring support for Mr. Obama. A polling company like YouGov will usually have a long time series, including information about respondent characteristics. Using this data, the company could derive how variable the support for Mr. Obama is among different sociodemographic groups. With information about strata and strata variances, we can often poll fewer people (vis-a-vis random sampling) to estimate the population mean with a particular s.e. In a note (pdf), we show how.

Why bother?

In a realistic example, we find the benefit of using optimal allocation over simple random sampling is 6.5% (see the code block below).

Assuming two groups a and b, and using the notation in the note (see the pdf)—wa denotes the proportion of group a in the population, vara and varb denote the variances of group a and b respectively and letting p denote sample mean, we find that if you use the simple random sampling formula, you will estimate that you need to sample 1095 people. If you optimally exploit the information about strata and strata variances, you will just need to sample 1024 people.

## The Benefit of Using Optimal Allocation Rules
## wa = .8
## vara = .25; pa = .5
## varb = .16; pb = .8
## SRS: pop_mean of .8*.5 + .2*.8 = .56
   
# sqrt(p(1 -p)/n) = .015
# n = p*(1- p)/.015^2 = 1095

# optimal_n_plus_allocation(.8, .25, .16, .015)
#   n   na   nb 
#1024  853  171

Github Repo.: https://github.com/soodoku/optimal_data_collection/

Beyond yhat: Developing ML Products

7 Jul

Making useful products is hard. Making useful ML products is harder still, in part because there are a larger number of moving parts in an ML system. To understand the issues at stake, let’s go over the **basics** of developing an ML product.

Often, product development starts with a business problem. And your first job is to understand the business problem as well as you can, familiarizing yourself with as much detail as possible.

Let’s say the problem is as follows: A company gets a lot of customer emails. All the emails go to a common inbox from which specialist customer agents fish out emails that are relevant to them. For instance, finance specialists fish out billing emails. And technical specialists fish out emails about technical errors. Fishing is time-consuming and chaotic.

Once you understand the precise problem—time taken to discover and assign emails—work on developing solutions for the problem. When developing solutions, the bias should be toward solving the problem the best way possible than injecting custom ML into whatever solution you propose. For instance, you could propose a solution that makes it easier to search (using no ML or off-the-shelf ML) and bulk assign to a new queue. But let’s say that after careful consideration of costs and benefits, a particularly appealing solution is a system that uses machine learning to automatically direct relevant emails to specialist inboxes, obviating the need to fish. That’s a start to a solution, not the end. You need to spend enough time thinking about the solution so that you have thought about how to handle edge cases, e.g., when there is a technical issue about billing, a misclassified email, etc., and any spillover issues, like the latency of such a system, how implementing such a system may break existing data pipelines that measure the total number of emails, etc.

Next, you need to define the KPIs. How much time will be saved? What is the total cost of the saved time? How many mistakes is the system making? What is the cost of handling mistakes?

Next, you need to turn the business problem into a precise machine learning problem. What labels would you predict? How would you collect the initial labels?

Once the outline of the solution has been agreed upon, you need to don your architect’s hat and outline a system diagram. Wearing the data engineer’s hat, figure out where the data needed for training and for live classification is stored, and how you would build a pipeline for training and serving the model. This is also the time to understand what guarantees, if any, exist on the data, and how you can test those guarantees.

Right next to the data engineer’s hat is the modeler’s hat. Wearing that, you must decide what algorithms you want to run, etc. how to version control your models, etc. The ML modeler’s hat also directs your attention to your plan for how to improve the model. Machine learning is an elaborate system to learn from your losses and you must design a system to continuously learn from your errors. More precisely, you must answer what is your system in place to improve your model? There is a pipeline for that: 1. Learn about your losses: from feedback, errors, etc. 2. Understand your losses: error analysis, etc., 3. Reducing your losses: new data collection, fixing old data, diff. models, objective functions, and 4. Testing: A/B testing, offline testing, etc.

Last, you must wear an operator’s hat. Wearing that you answer the operational nitty-gritty of how to introduce a new product. This is the time when you work with stakeholders to stand up dashboards to monitor the system, develop a rollout strategy, and a rollback strategy, a dashboard for monitoring A/B tests, etc.

The key to wearing an architect’s hat is to not only designing a system but also to make sure that enough logging is in place for different parts of the system for you to triage failures. So part of the dashboard would display logs from different parts of the system.

Equilibrium Fairness: How “Fair” Algorithms Can Hurt Those They Purport to Help

7 Jul

One definition of a fair algorithm is an algorithm that yields the same FPR across groups (an example of classification parity). To achieve that, we often have to trade in some accuracy. The final model is thus less accurate but fair. There are two concerns with such models:

  1. Net Harm Over Relative Harm: Because of lower accuracy, the number of people from a minority group that are unfairly rejected (say for a loan application) may be a lot higher. (This is ignoring the harm done to other groups.) 
  2. Mismeasuring Harm? Consider an algorithm used to approve or deny loans. Say we get the same FPR across groups but lower accuracy for loans with a fair algorithm. Using this algorithm, however, means that credit is more expensive for everyone. This, in turn, may cause fewer people of the vulnerable group to get loans as the bank factors in the cost of mistakes. Another way to think about the point is that using such an algorithm causes net interest paid per borrowed dollar to increase by some number. It seems this common scenario is not discussed in many of the papers on fair ML. One reason for that may be that people are fixated on who gets approved and not the interest rate or total approvals.

No Stopping: Impact of the Stopping Rule on the Sex Ratio

20 Jun

For social scientists brought up to worry about bias stemming from stopping data collection when results look significant, the fact that a gender based stopping rule has no impact on the sex ratio seems suspect. So let’s dig deeper.

Let there be n families and let the stopping rule be that after the birth of a male child, the family stops procreating. Let p be the probability a male child is born and q=1−p

After 1 round: 

\[\frac{pn}{n} = p\]

After 2 rounds: 

\[ \frac{(pn + qpn)}{(n + qn)} = \frac{(p + pq)}{(1 + q)} = \frac{p(1 + q)}{(1 + q)} \]

After 3 rounds: 

\[\frac{(pn + qpn + q^2pn)}{(n + qn + q^2n)}\\ = \frac{(p + pq + q^2p)}{(1 + q + q^2)}\]

After k rounds: 

\[\frac{(pn + qpn + q^2pn + … + q^kpn)}{(n + qn + q^2n + \ldots q^kn)} \]

After infinite rounds:

Total male children: 

\[= pn + qpn + q^2pn + \ldots\\ = pn (1 + q + q^2 + \ldots)\\ = \frac{np}{(1 – q)}\]

Total children:

\[= n + qn + q^2n + \ldots\\ = n (1 + q + q^2 + \ldots)\\ = \frac{n}{(1 – q)}\]

Prop. Male:

\[= \frac{np}{(1 – q)} * \frac{(1 – q)}{n}\\ = p\]

If it still seems like a counterintuitive result, here’s one way to think: In each round, we get pq^k successes, and the total number of kids increases by q^k. Yet another way to think is that for any child that is born, the data generating process is unchanged.

The male-child stopping rule may not affect the aggregate sex ratio. But it does cause changes in families. For instance, it causes a negative correlation between family size and the proportion of male children. For instance, if your first child is male, you stop. (For more results in this vein, see here.)

But why does this differ from our intuition that comes from early stopping in experiments? Easy. We define early stopping as when we stop data collection as soon as the results are significant. This causes a positive bias in the number of false-positive results (w.r.t. the canonical sample-fixed-in-advance rule). But early stopping leads to both kinds of false positives—mistakenly thinking that the proportion of females is greater than .5 and mistakenly thinking that the proportion of males is greater than .5. The rule is unbiased w.r.t. to the expected value of the proportion.