The New Unit of Scientific Production

11 Aug

One fundamental principle of science is that there is no privileged observer. You get to question what people did. But to question, you first must know what people did. So part of good scientific practice is to make it easy for people to understand how the sausage was made—how the data were collected, transformed, and analyzed—and ideally, why you chose to make the sausage that particular way. Papers are ok places for describing all this, but we now have better tools: version controlled repositories with notebooks and readme files.

The barrier to understanding is not just lack of information, but also poorly organized information. There are three different arcs of information: cross-sectional (where everything is and how it relates to each other), temporal (how the pieces evolve over time), and inter-personal (who is making the changes). To be organized cross-sectionally, you need to be macro organized (where is the data, where are the scripts, what do each of the scripts do, how do I know what the data mean, etc.), and micro organized (have logic and organization to each script; this also means following good coding style). Temporal organization in version control simply requires you to have meaningful commit messages. And inter-personal organization requires no effort at all, beyond the logic of pull requests.

The obvious benefits of this new way are known. But what is less discussed is that this new way allows you to critique specific pull requests and decisions made in certain commits. This provides an entirely new way to make progress in science. The new unit of science also means that we just don’t dole out credits in crude currency like journal articles but we can also provide lower denominations. We can credit each edit, each suggestion. And why not. The third big benefit is that we can build epistemological trees where the logic of disagreement is clear.

The dead tree edition is dead. It is also time to retire the e-version of the dead tree edition.

Quality Data: Plumbing ML Data Pipelines

6 Aug

What’s the difference between a scientist and a data scientist? Scientists often collect their own data, and data scientists often use data collected by other people. That is part jest but speaks to an important point. Good scientists know their data. Good data scientists must know their data too. To help data scientists learn about the data they use, we need to build systems that give them good data about the data. But what is good data about data? And how do we build systems that deliver that? Here’s some advice (tailored toward rectangular data for convenience):

  • From Where, How Much, and Such
    • Provenance: how were each of the columns in the data created (obtained)? If the data are derivative, find out the provenance of the original data. Be as concrete as possible, linking to scripts, related teams, and such.
    • How Frequently is it updated
    • Cost per unit of data, e.g., a cell in rectangular data.
    Both, the frequency with which data are updated, and the cost per unit of data may change over time. Provenance may change as well: a new team (person) may start managing data. So the person who ‘owns’ the data must come back to these questions every so often. Come up with a plan.
  • What? To know what the data mean, you need a data dictionary. A data dictionary explains the key characteristics of the data. It includes:
    1. Information about each of the columns in plain language.
    2. How were the data were collected? For instance, if you conducted a survey, you need the question text and the response options (if any) that were offered, along with the ‘mode’, and where in a sequence of questions does this lie, was it alone on the screen, etc.
    3. Data type
    4. How (if at all) are missing values generated?
    5. For integer columns, it gives the range, sd, mean, median, n_0s, and n_missing. For categorical, it gives the number of unique values, what each label means, and a frequency table that includes n_missing (if missing can be of multiple types, show a row for each).
    6. The number of duplicates in data and if they are allowed and a reason for why you would see them. 
    7. Number of rows and columns
    8. Sampling
    9. For supervised models, store correlation of y with key x_vars
  • What If? What if you have a question? Who should you bug? Who ‘owns’ the ‘column’ of data?

Store these data in JSON so that you can use this information to validate against. Produce the JSON for each update. You can flag when data are some s.d. above below last ingest.

Store all this metadata with the data. For e.g., you can extend the dataframe class in Scala to make it so.

Auto-generate reports in markdown with each ingest.

In many ML applications, you are also ingesting data back from the user. So you need the same as above for the data you are getting from the user (and some of it at least needs to match the stored data). 

For any derived data, you need the scripts and the logic, ideally in a notebook. This is your translation function.

Where possible, follow the third normal form of databases. Only store translations when translation is expensive. Even then, think twice.

Lastly, some quality control. Periodically sit down with your team to see if you should see what you are seeing. For instance, if you are in the survey business, do the completion times make sense? If you are doing supervised learning, get a random sample of labels. Assess their quality. You can also assess the quality by looking at errors in classification that your supervised model makes. Are the errors because the data are mislabeled? Keep iterating. Keep improving. And keep cataloging those improvements. You should be able to ‘diff’ data collection, not just numerical summaries of data. And with what the method I highlight above, you should be.

Pride and Prejudice

14 Jul

It is ‘so obvious’ that policy A >> policy B that only who don’t want to know or who want inferior things would support policy B. Does this conversation remind you of any that you have had? We don’t just have such conversations about policies. We also have them about people. Way too often, we are being too harsh.

We overestimate how much we know. We ‘know know’ that we are right, we ‘know’ that there isn’t enough information in the world that will make us switch to policy B. Often, the arrogance of this belief is lost on us. As Kahneman puts it, we are ‘ignorant of our own ignorance.’ How could it be anything else? Remember the aphorism, “the more you know, the more you know you don’t know”? The aphorism may not be true but it gets the broad point right. The ignorant are overconfident. And we are ignorant. The human condition is such that it doesn’t leave much room for being anything else (see the top of this page).

Here’s one way to judge your ignorance (see here for some other ideas). Start by recounting what you know. Sit in front of a computer and type it up. Go for it. And then add a sentence about how do you know. Do you recall reading any detailed information about this person or issue? From where? Would you have bought a car if you had that much information about a car?

We not just overestimate what we know, we also underestimate what other people know. Anybody with different opinions must know less than I. It couldn’t be that they know more, could it?

Both, being overconfident about what we know and underestimating what other people know leads to the same thing: being too confident about the rightness of our cause and mistaking our prejudices for obvious truths.

George Carlin got it right. “Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac?” It seems the way we judge drivers is how we judge everything else. Anyone who knows less than you is either being willfully obtuse or an idiot. And those who know more than you just look like ‘maniacs.’

The Base ML Model

12 Jul

The days of the artisanal ML model are mostly over. The artisanal model builds off domain “knowledge” (it can often be considerably less than that, bordering on misinformation). The artisan has long discussions with domain experts about what variables to include and how to include them in the model, often making idiosyncratic decisions about both. Or the artisan thinks deeply and draws on his own well. And then applies a couple of methods to the final feature set of 10s of variables, and out pops “the” model. This is borderline farcical when the datasets are both long and wide. For supervised problems, the low cost, scalable, common sense thing to do is to implement the following workflow:

1. Get good univariate summaries of each column in the data: mean, median, min., max, sd, n_missing for numerics, and the number of unique values, n_missing, frequency count for categories, etc. Use this to diagnose and understand the data. What stuff is common? On what variables do we have bad data? (see pysum.)

2. Get good bivariate summaries. Correlations for continuous variables and differences in means for categorical variables are reasonable. Use this to understand how the variables are related. Use this to understand the data.

3. Create a dummy vector for missing values for each variable

4. Subset on non-sparse columns

5. Regress on all non-sparse columns, ideally using NN, so that you are not in the business of creating interactions and such.

I have elided over a lot of detail. So let’s take a more concrete example. Say you are predicting whether someone will be diagnosed with diabetes in year y given the claims they make in year y-1, y-2, y-3, etc. Say claim for each service and medicine is a unique code. Tokenize all the claim data so that each unique code gets its own column, and filter on the non-sparse codes. How much information about time you want to preserve depends on you. But for the first cut, roll up the data so that code X made in any year is treated equally. Voila! You have your baseline model.

Code 44: How to Read Ahler and Sood

27 Jun

This is a follow-up to the hilarious Twitter thread about the sequence of 44s. Numbers in Perry’s 538 piece come from this paper.

First, yes 44s are indeed correct. (Better yet, look for yourself.) But what do the 44s refer to? 44 is the average of all the responses. When Perry writes “Republicans estimated the share at 46 percent,” (we have similar language in the paper, which is regrettable as it can be easily misunderstood), it doesn’t mean that every Republican thinks so. It may not even mean that the median Republican thinks so. See OA 1.7 for medians, OA 1.8 for distributions, but see also OA 2.8.1, Table OA 2.18, OA 2.8.2, OA 2.11 and Table OA 2.23.

Key points =

1. Large majorities overestimate the share of party-stereotypical groups in the party, except for Evangelicals and Southerners.

2. Compared to what people think is the share of a group in the population, people still think the share of the group in the stereotyped party is greater. (But how much more varies a fair bit.)

3. People also generally underestimate the share of counter-stereotypical groups in the party.

Automating Understanding, Not Just ML

27 Jun

Some of the most complex parts of Machine Learning are largely automated. The modal ML person types in simple commands for very complex operations and voila! Some companies, like Microsoft (Azure) and DataRobot, also provide a UI for this. And this has generally not turned out well. Why? Because this kind of system does too little for the modal ML person and expects too much from the rest. So the modal ML person doesn’t use it. And the people who do use it, generally use it badly. The black box remains the black box. But not much is needed to place a lamp in this black box. Really, just two things are needed:

1. A data summarization and visualization engine, preferably with some chatbot feature that guides people smartly through the key points, including the problems. For instance, start with univariate summaries, highlighting ranges, missing data, sparsity, and such. Then, if it is a supervised problem, give people a bunch of loess plots or explain the ‘best fitting’ parametric approximations with y in plain English, such as, “people who eat 1 more cookie live 5 minutes shorter on average.”

2. An explanation engine, including what the explanations of observational predictions mean. We already have reasonable implementations of this.

When you have both, you have automated complexity thoughtfully, in a way that empowers people, rather than create a system that enables people to do fancy things badly.

Talking On a Tangent

22 Jun

What is the trend over the last X months? One estimate of the ‘trend’ over the last k time periods is what I call the ‘hold up the ends’ method. Look at t_k and t_0, get the difference between the two, and divide by the number of time periods. If t_k > t_0, you say that things are going up. If t_k < t_0, you say things are going down. And if they are the same, then you say that things are flat. But this method can elide over important non-linearity. For instance, say unemployment went down in the first 9 months and then went up over the last 3 but ended with t_k < t_0. What is the trend? If by trend, we mean average slope over the last t time periods, and if there is no measurement error, then 'hold up the ends' method is reasonable. If there is measurement error, we would want to smooth the time series first before we hold up the ends. Often people care about 'consistency' in the trend. One estimate of consistency is the following: the proportion of times we get a number of the same sign when we do pairwise comparison of any two time consecutive time periods. Often people also care more about later time periods than earlier time periods. And one could build on that intuition by weighting later changes more.

Optimal Targeting 101

22 Jun

Say you make k products and have a list of potential customers. Assume that you can show each of these potential customers one ad. And that showing them an ad costs nothing. Assume also that you get profits p_1, ..., p_k from each of the k products. What is the optimal targeting strategy?

The goal is to maximize profit. If you didn’t know the probability each customer will buy a product if shown an ad or assume it to be the same across products, then the strategy reduces to showing the ad for the most profitable product to everyone.

If we can estimate the probability the customer will buy each product if they are shown an ad for the product well, we can do better. (We assume that customers won’t buy the product if they don’t see the ad for it.)

When k = 1, the optimal strategy is to target everyone. Now let’s solve it for when k = 2. A customer can either buy \(k_0\) or k_1 or buy nothing at all.

If everyone is shown k_0, the profits are p_0*prob_i_0_true where prob_i_0_true gives the true probability of the ith person buying k_0 when shown an ad for it. And if everyone is shown k_1 ad, the profits are p_1*prob_i_1_true. The net utility calc. for each customer = p_1*prob_i_1 – p_0*prob_i_0. (We generally won’t know prob_i but would estimate it from data and hence the subscript true is not there.) You can recover two numbers from this. One is how to target—pick whichever number is bigger for each customer. Another is in what order to target in: sort by absolute value.

Calculating the Benefits of Targeting

If you don’t target, the best case scenario is that you earn the bigger of the two numbers: p_1*prob_i_1_true, p_0*prob_i_0_true

If you target well (estimate probabilities well), you will recover something that is as good or better than that.

Since we generally won’t have the true probabilities, the best way to estimate the benefit of targeting is via A/B testing. But if you can’t do that, one estimate =

No Targeting estimate = Larger of p_1*prob_i_1, p_0*prob_i_0
Targeting estimate = Sum over all i
p_1*prob_i_1 if p_1*prob_i_1 > p_0*prob_i_0
p_0*prob_i_0 if p_1*prob_i_1 < p_0*prob_i_0
Similar calculations can be done for deriving estimates of the value of better targeting.

Firmly Against Posing Firmly

31 May

“What is crucial for you as the writer is to express your opinion firmly,” writes William Zinsser in “On Writing Well: An Informal Guide to Writing Nonfiction.” To emphasize the point, Bill repeats the point at the end of the paragraph, ending with, “Take your stand with conviction.”

This advice is not for all writers—Bill particularly wants editorial writers to write with a clear point of view.

When Bill was an editorial writer for the New York Herald Tribune, he attended a daily editorial meeting to “discuss what editorials … to write for the next day and what position …[to] take.” Bill recollects,

“Frequently [they] weren’t quite sure, especially the writer who was an expert on Latin America.

“What about that coup in Uruguay?” the editor would ask. “It could represent progress for the economy,” the writer would reply, “or then again it might destabilize the whole political situation. I suppose I could mention the possible benefits and then—”

The editor would admonish such uncertainty with a curt “let’s not go peeing down both legs.”

Bill approves of taking a side. He likes what the editor is saying if not the language. He calls it the best advice he has received on writing columns. I don’t. Certainty should only come from one source: conviction born from thoughtful consideration of facts and arguments. Don’t feign certainty. Don’t discuss concerns in a perfunctory manner. And don’t discuss concerns at the end.

Surprisingly, Bill agrees with the last bit about not discussing concerns in a perfunctory manner at the end. But for a different reason. He thinks that “last-minute evasions and escapes [cancel strength].”

Don’t be a mug. If there are serious concerns, don’t wait until the end to note them. Note them as they come up.


1 May

In 2010, Google estimated that approximately 130M books had been published.

As a species, we still know very little about the world. But what we know already far exceeds what any of us can learn in a lifetime.

Scientists are acutely aware of the point. They must specialize, as chances of learning all the key facts about anything but the narrowest of the domains are slim. They must also resort to shorthand to communicate what is known and what is new. The shorthand that they use is—citations. However, this vital building block of science is often rife with problems. The three key problems with how scientists cite are:

1. Cite in an imprecise manner. This broad claim is supported by X. Or, our results are consistent with XYZ. (Our results are consistent with is consistent with directional thinking than thinking in terms of effect size. That means all sorts of effects are consistent, even those 10x as large.) For an example of how I think work should be cited, see Table 1 of this paper.

2. Do not carefully read what they cite. This includes misstating key claims and citing retracted articles approvingly (see here). The corollary is that scientists do not closely scrutinize papers they cite, with the extent of scrutiny explained by how much they agree with the results (see the next point). For a provocative example, see here.)

3. Cite in a motivated manner. Scientists ‘up’ the thesis of articles they agree with, for instance, misstating correlation as causation. And they blow up minor methodological points with articles whose results their paper’s result is ‘inconsistent’ with. (A brief note on motivated citations: here).