goji berries

The Puzzle of Price Dispersion on Amazon

29 Mar

March 29, 2020April 2, 2020 editor

Price dispersion is an excellent indicator of transactional frictions. It isn’t that absent price dispersion, we can confidently say that frictions are negligible. Frictions can be substantial even when price dispersion is zero. For instance, if the search costs are high enough that it makes it irrational to search, all the sellers will price the good at the buyer’s Willingness To Pay (WTP). Third world tourist markets, which are full of hawkers selling the same thing at the same price, are good examples of that. But when price dispersion exists, we can be reasonably sure that there are frictions in transacting. This is what makes the existence of substantial price dispersion on Amazon compelling.

Amazon makes price discovery easy, controls some aspects of quality by kicking out sellers who don’t adhere to its policies and provides reasonable indicators of quality of service with its user ratings. But still, on nearly all items that I looked at, there was substantial price dispersion. Take, for instance, the market for a bottle of Nature Made B12 vitamins. Prices go from $8.40 to nearly $30! With taxes, the dispersion is yet greater. If the listing costs are non-zero, it is not immediately clear why sellers selling the product at $30 are in the market. It could be that the expected service quality for the $30 seller is higher except that between the highest price seller and the next highest price seller, the ratings of the highest price seller are lower (take a look at shipping speed as well). And I would imagine that the ratings (and the quality) of Amazon, which comes in with the lowest price, are the highest. More generally, I have a tough time thinking about aspects of service and quality that are worth so much that the range of prices goes from 1x to 4x for a branded bottle of vitamin pills.

One plausible explanation is that the lowest price seller has a non-zero probability of being out of stock. And the more expensive and worse-quality sellers are there to catch these low probability events. They set a price that is profitable for them. One way to think about it is that the marginal cost of additional supply rises in the way the listed prices show. If true, then there seems to be an opportunity to make money. And it is possible that Amazon is leaving money on the table.

p.s. Sales of the boxed set of Harry Potter shows a similar pattern.

harry_potter Download

amazon_nature_made_b12 Download

It Pays to Search

28 Mar

March 28, 2020March 29, 2020 editor

In Reinventing the Bazaar, John McMillan discusses how search costs affect the price the buyer pays. John writes:

“Imagine that all the merchants are quoting $1[5]. Could one of them do better by undercutting this price? There is a downside to price-cutting: a reduction in revenue from any customers who would have bought from this merchant even at the higher price. If information were freely available, the price-cutter would get a compensating boost in sales as additional customers flocked in. When search costs exist, however, such extra sales may be negligible. If you incur a search cost of 10 cents or more for each merchant you sample, and there are fifty sellers offering the urn, then even if you know there is someone out there who is willing to sell it at cost, so you would save $5, it does not pay you to look for him. You would be looking for a needle in a haystack. If you visited one more seller, you would have a chance of one in fifty of that seller being the price-cutter, so the return on average from that extra price quote would be 10 cents (or $5 multiplied by 1/50), which is the same as your cost of getting one more quote. It does not pay to search.”
Reinventing the Bazaar, John McMillan

John got it wrong. It pays to search. The cost and the expected payoff for the first quote is 10 cents. But if the first quote is $15, the expected payoff for the second quote—(1/49)*$50—is greater than 10 cents. And so on.

Another way to solve for it is to come up with the expected number of quotes you need to get to get to the seller selling at $10. It is 25. Given you need to spend on average $2.50 to get a benefit of $2.50, you will gladly search.

Yet another way to think is that the worst case is that you make no money—when the $10 seller is the last one you get a quote from. But in every other case, you make money.

For the equilibrium price, you need to make assumptions. But if the buyer knows that there is a price cutter, they will all buy from him. This means that the price cutter will be the only seller remaining.

There are two related fun points. First, one of the reasons markets are competitive on price when true search costs are high is likely because people price their time remarkably low. Second, when people spend a bunch of time looking for the cheapest deal, you incentivize all the sellers selling at a high rate to lower their rates and make it better for everyone else.

Good News: Principles of Good Journalism

12 Mar

March 12, 2020 editor

If fake news—deliberate disinformation, not uncongenial news—is one end of the spectrum, what is the other end of the spectrum?

To get at the question, we need a theory of what news should provide. A theory of news, in turn, needs a theory of citizenship, which prescribes the information people need to execute their role, and an empirically supported behavioral theory of how people get that information.

What a democracy expects of people varies by the conception of democracy. Some theories of democracy only require citizens to have enough information to pick the better candidate when differences in candidates are material. Others, like deliberative democracy, expect people to be well informed and to have thought through various aspects of policies.

I opt for deliberative democracy to guide expectations about people for two reasons. Not only does the theory best express the highest ideals of democracy, but it also has the virtue of answering a vital question well. If all news was equally profitable to produce and was as widely read, what kind of news would lead to the best political outcomes, as judged by idealized versions of people—people who have all the information and all the time to think through the issues?

There are two virtues of answering such a question. First, it offers a convenient place to start answering what we mean by ‘good’ news; we can bring in profitability and reader preferences later. Second, engaging with it uncovers some obvious aspects of ‘good’ news.

For news to positively affect political outcomes (not in the shallow, instrumental sense), the news has to be about politics. Rather than news about Kim Kardashian or opinions about the hottest boots this season, ‘good’ news is about policymaker, policy-implementor, and policy-relevant news.

News about politics is a necessary but not a sufficient condition. Switching from discussing Kim Kardashian’s dress to Hillary Clinton’s is very plausibly worse. Thus, we also want the news to be substantive, engaging with real issues rather than cosmetic concerns.

Substantively engaging with real issues is still no panacea. If the information is not correct, it will misinform than inform the debate. Thus, the third quality of ‘good’ news is correctness.

The criterion for “good” news is, however, not just correctness, but it is the correctness of interpretation. ‘Good’ news allows people to draw the right conclusions. For instance, reporting murder rates as say ‘a murder per hour’ without reporting the actual number of murders or comparing the probability of being murdered to other common threats to life may instill greater fear in people than ‘optimal.’ (Optimal, as judged by better-informed versions of ourselves who have been given time to think. We can also judge optimal by correctness—did people form unbiased, accurate beliefs after reading the news?)

Not all issues, however, lend themselves to objective appraisals of truth. To produce ‘good’ news, the best you can do is have the right process. The primary tool that journalists have in the production of news is the sources they use to report on stories. (While journalists increasingly use original data to report, the reliance on people is widespread.) Thus, the way to increase correctness is through manipulating aspects of sources. We can increase correctness by increasing the quality of sources, e.g., source more knowledgeable people with low incentives to cook the books, increase the diversity of sources, e.g., not just government officials but also plausibly major NGOs, and the number of sources.

If we emphasize correctness, we may fall short on timeliness. News has to be timely enough to be useful, aside from being correct enough to guide policy and opinion correctly.

News can be narrowly correct but may commit sins of omission. ‘Good’ news provides information on all sides of the issue. ‘Good’ news highlights and engages with all serious claims. It doesn’t give time to discredited claims for “balance.”

Second-to-last, news should be delivered in the right tone. Rather than speculative ad-hominem attacks, “good” news engages with arguments and data.

Lastly, news contributes to the public kitty only if it is original. Thus, ‘good’ news is original. (Plagiarism reduces the incentives for producing quality news because it eats into the profits.)

Feigning Competence: Checklists For Data Science

25 Jan

January 25, 2020August 28, 2022 editor

You may have heard that most published research is false (Ionnadis). But what you probably don’t know is that most corporate data science is also false.
Gaurav Sood

The returns on data science in most companies are likely sharply negative. There are a few reasons for that. First, as with any new ‘hot’ field, the skill level of the average worker is low. Second, the skill level of the people managing these workers is also low—most struggle to pose good questions, and when they stumble on one, they struggle to answer it well. Third, data science often fails silently (or there is enough corporate noise around it that most failures are well-hidden in plain sight), so the opportunity to learn from mistakes is small. And if that was not enough, many companies reward speed over correctness and in doing that, often obtain neither.

How can we improve on the status quo? The obvious remedy for the first two issues is to increase the skill by improving training or creating specializations. And one remedy for the latter two points is to create incentives for doing things correctly.

Increasing training and creating specializations in data science is expensive and slow. Vital, but slow. Creating the right incentives for good data science work is not trivial either. There are at least two large forces lined up against it: incompetent supervisors and the fluid and collaborative nature of work—work usually involves multiple people, and there is a fluid exchange of ideas. Only the first is fixable—the latter is a property of work. And fixing it comes down to making technical competence a much more important criterion for hiring.

Aside from hiring more competent workers or increasing the competence of workers, you can also simulate the effect by using checklists—increase quality by creating a few “pause points”—times during a process where the person (team) pauses and goes through a standard list of questions.

To give body to the boast, let me list some common sources of failures in DS and how checklists at different pause points may reduce failure.

Learn what you will lose in translation. Good data science begins with a good understanding of the problem you are trying to solve. Once you understand the problem, you need to translate it into a suitable statistical analog. During translation, you need to be aware that you will lose something in the translation.
Learn the limitations. Learn what data you would love to have to answer the question if money was no object. And use it to understand how far you fall short of that ideal and then come to a judgment about whether the question can be answered reasonably with the data at hand.
Learn how good the data are. You may think you have the data, but it is best to verify it. For instance, it is good practice to think through the extent to which a variable captures the quantity of interest.
Learn the assumptions behind the formulas you use and test the assumptions to find the right thing to do. Thou shall only use math formulas when you know the limitations of such formulas. Having a good grasp of when formulas don’t work is essential. For instance, say the task is to describe a distribution. Someone may use the mean and standard deviation to describe it. But we know that these sufficient statistics vary by distribution. For binomial, it may just be p. A checklist for “describing” a variable can be:
1. check skew by plotting: averages are useful when distributions are symmetric and lots of observations are close to the mean. If skewed, you may want to describe various percentiles.
2. how many missing values and what explains the missing values?
3. check for unusual values and what explains the ‘unusual’ values.

Ruling Out Explanations

22 Dec

December 22, 2019December 22, 2019 editor

The paper (pdf) makes the case that the primary reason for electoral cycles in dissents is priming. The paper notes three competing explanations: 1) caseload composition, 2) panel composition, and 3) volume of caseloads. And it “rules them out” by regressing case type, panel composition, and caseload on quarters from the election (see Appendix Table D). The coefficients are uniformly small and insignificant. But is that enough to rule out alternate explanations? No. Small coefficients don’t imply that there is no path from proximity to the election via competing mediators to dissent (if you were to use causal language). We can only conclude that the pathway doesn’t exist if there is a sharp null. The best you can do is bound the estimated effect.

Preference for Sons in the US: Evidence from Business Names

24 Nov

November 24, 2019November 24, 2019 editor

I estimate preference for passing on businesses to sons by examining how common words son and sons are compared to daughter and daughters in the names of businesses.

In the US, all businesses have to register with a state. And all states provide a way to search business names, in part so that new companies can pick names that haven’t been used before.

I begin by searching for son(s) and daughter in states’ databases of business names. But the results of searching son are inflated because of three reasons:

son is part of many English words, from names such as Jason and Robinson to ordinary English words like mason (which can also be a name).
son is a Korean name.
some businesses use the wordson playfully. For instance, sonis a homonym of sun and some people use that to create names like son of a beach.

I address the first concern by using a regex that only looks at words that exactly match son or sons. But not all states allow for regex searches or allow people to download a full set of results. Where possible, I try to draw a lower bound. But still some care is needed in interpreting the results.

Data and Scripts: https://github.com/soodoku/sonny_side

In all, I find that a conservative estimate of son to daughter ratio is between 4 to 1 to 26 to 1 across states.

Learning From the Future with Fixed Effects

6 Nov

November 6, 2019November 6, 2019 editor

Say that you want to predict wait times at restaurants using data with four columns: wait times (wait), the restaurant name (restaurant), time and date of observation. Using the time and date of the observation, you create two additional columns: time of the day (tod) and day of the week (dow). And say that you estimate the following model:

$\text{wait} \sim \text{restaurant} + tod + dow + \epsilon$

Assume that the number of rows is about 100 times the number of columns. There is little chance of overfitting. But you still do an 80/20 train/test split and pick the model that works the best OOS.

You have every right to expect the model’s performance to be close to its OOS performance. But when you deploy the model, the model performs much worse than that. What could be going on?

In the model, we estimate a restaurant level intercept. But in estimating the intercept, we use data from all wait times, including those that happened after the date. One fix is to using rolling averages or last X wait times in the regression. Another is to more formally construct the data in such a way that you are always predicting the next wait time.

Rehabilitating Forward Stepwise Regression

6 Nov

November 6, 2019 editor

Forward Stepwise Regression (FSR) is hardly used today. That is mostly because regularization is a better way to think about variable selection. But part of the reason for its disuse is that FSR is a greedy optimization strategy with unstable paths. Jigger the data a little and the search paths, variables in the final set, the performance of the final model, all can change dramatically. The same issues, however, affect another greedy optimization strategy—CART. The insight that rehabilitated CART was bagging—build multiple trees using random subspaces (sometimes on randomly sampled rows) and average the results. What works for CART should principally also work for FSR. If you are using FSR for prediction, you can build multiple FSR models using random subspaces and random samples of rows and then average the results. If you are using it for variable selection, you can pick variables with the highest batting average (n_selected/n_tried). (LASSO will beat it on speed but there is little reason to expect that it will beat it on results.)

Faites Attention! Dealing with Inattentive and Insincere Respondents in Experiments

11 Jul

July 11, 2019July 22, 2020 editor

Respondents who don’t pay attention or respond insincerely are in vogue (see the second half of the note). But how do you deal with such respondents in an experiment?

To set the context, a toy example. Say that you are running an experiment. And say that 10% of the respondents in a rush to complete the survey and get the payout don’t read the survey question that measures the dependent variable and respond randomly to it. In such cases, the treatment effect among the 10% will be centered around 0. And including the 10% would attenuate the Average Treatment Effect (ATE).

More formally, in the subject pool, there is an ATE that is E[Y(1)] – E[Y(0)]. You randomly assign folks, and under usual conditions, they render a random sample of Y(1) or Y(0), which in expectation retrieves the ATE. But when there is pure guessing, the guess by subject i is not centered around Y_i(1) in the treatment group or Y_i(0) in the control group. Instead, it is centered on some other value that is altogether unresponsive to treatment.

Now that we understand the consequences of inattention, how do we deal with it?

We could deal with inattentive responding under compliance, but it is useful to separate compliance with the treatment protocol, which can be just picking up the phone, from attention or sincerity with which the respondent responds to the dependent variables. On a survey experiment, compliance plausibly adequately covers both, but cases where treatment and measurement are de-coupled, e.g., happen at different times, it is vital to separate the two.

On survey experiments, I think it is reasonable to assume that:

the proportion of people paying attention are the same across Control/Treatment group, and
there is no correlation between who pays attention and assignment to the control group/treatment group, e.g., men are inattentive in the treatment group and women in the control group.

If the assumptions hold, then the worst we get is an estimate on the attentive subset (principal stratification). To get at ATE with the same research design (and if you measure attention pre-treatment), we can post-stratify after estimating the treatment effect on the attentive subset and then re-weight to account for the inattentive group. (One potential issue with the scheme is that variables used to stratify may have a fair bit of measurement error among inattentive respondents.)

The experimental way to get at attenuation would be to manipulate attention, e.g., via incentives, after the respondents have seen the treatment but before the DV measurement has begun. For instance, see this paper.

Attenuation is one thing, proper standard errors another. People responding randomly will also lead to fatter standard errors, not just because we have fewer respondents but because as Ed Haertel points out (in personal communication):

“The variance of the random responses could be [in fact, very likely is: GS] different [from] the variances in the compliant groups.”
Even “if the variance of the random responses was zero, we’d get noise because although the proportions of random responders in the T and C groups are equal in expectation, they will generally not be exactly the same in any given experiment.”

The Declining Value of Personal Advice

27 Jun

June 27, 2019July 12, 2019 editor

There used to be a time when before buying something, you asked your friends and peers about advice, and it was the optimal thing to do. These days, it is often not a great use of time. It is generally better to go online. Today, the Internet abounds with comprehensive, detailed, and trustworthy information, and picking the best product, judging by its quality, price, appearance, or what have you, in a slew of categories is easy to do.

As goes for advice about products, so goes for much other advice. For instance, if a coding error stumps you, your first move should be to search StackOverflow than Slack a peer. If you don’t understand a technical concept, look for a YouTube video or a helpful blog or a book than “leverage” a peer.

The fundamental point is that it is easier to get high-quality data and expert advice today than it has ever been. If your network includes the expert, bless you! But if it doesn’t, your network no longer damns you to sub-optimal information and advice. And that likely has welcome consequences for equality.

The only cases where advice from people near you may edge ahead of readily available help online is where the advisor has access to private information about your case or where the advisor is willing to expend greater elbow grease to get to the facts and think of advice that aptly takes account of your special circumstances. For instance, you may be able to get good advice on how to deal with alcoholic parents from an expert online but probably not about alcoholic parents with the specific set of deficiencies that your parents have. Short of such cases, the value of advice from people around is lower today than before, and probably lower than what you can get online.

The declining value of interpersonal advice has one significant negative externality. It takes out a big way we have provided value to our loved ones. We need to think harder about how we can fill that gap.

Maximal Persuasion

21 Jun

June 21, 2019April 5, 2021 editor

Say that you want to persuade a group of people to go out and vote. You can reach people by phone, mail, f2f, or email. And the cost of reaching out f2f > phone > mail > email. Your objective is to convert as many people as possible. How would you do it?

Thompson sampling provides one answer. Thompson sampling “randomly allocates subjects to treatment arms according to their probability of returning the highest reward under a Bayesian posterior.”

To exploit it, start by predicting persuasion (or persuasion/$) based on whatever you know about the person, and assignment to treatment or control. Conventionally, this means using a random forest model to estimate heterogeneous treatment effects but really use whatever gets you the best fit after including interactions in the inputs. (Make sure you get calibrated probabilities back.) Use the forecasted probabilities to find the treatment arm with the highest reward and probabilistically assign people to that.

Here’s the fun part: the strategy also accounts for compliance. The kinds of people who don’t ‘comply’ with one method, e.g., don’t pick up the phone, will be likelier to be assigned to another method.

Deliberation as Tautology

18 Jun

June 18, 2019June 21, 2019 editor

We take deliberation to be elevated discussion, meaning at minimum, discussion that is (1) substantive, (2) inclusive, (3) responsive, and (4) open-minded. That is, (1) the participants exchange relevant arguments and information. (2) The arguments and information are wide-ranging in nature and policy implications—not all of one kind, not all on one side. (3) The participants react to each other’s arguments and information. And (4) they seriously (re)consider, in light of the discussion, what their own policy attitudes should be.
Deliberative Distortions?

One way to define deliberation would be: “the extent to which the discussion is substantive, inclusive, responsive, and open-minded.” But here, we state the top-end of each as the minimum criteria. So defined, deliberation runs into two issues:

1. It’s posited beneficient effects become becomes a near tautology. If the discussion meets that high bar, how could it not refine preferences?

2. The bar for what counts as deliberation is high enough that I doubt that most deliberative mini-publics come anywhere close to meeting the ideal.

The Value of Bad Models

18 Jun

June 18, 2019 editor

This is not a note about George Box’s quote about models. Neither is it about explainability. The first is trite. And the second is a mug’s game.

Imagine the following: you get hundreds of emails a day, and someone must manually sort which emails are urgent and which are not. The process is time-consuming. So you want to build a model. You estimate that a model with an error rate of 5% or less will save time—the additional work from addressing the erroneous five will be outweighed by the “free” correct classification of the other 95.

Say that you build a model. And if you dichotomize at p = .5, the model accurately classifies 70% of all emails. Even though the accuracy is less than 95%, should we put the model in production?

Often, the answer is yes. When you put such a model in production, it generally saves effort right away. Here’s how. If you get people to (continue to) manually classify the emails that the model is uncertain about, say with p-values between .3 and .7, the accuracy of the model on the rest of rows is generally vastly higher. More generally, you can choose the cut-offs for which humans need to code in a way that reduces the error to an acceptable level. And then use a hybrid approach to capitalize on the savings and like Matthew 22:21, render to model the region where the model does well, and to humans the rest.

Snakes on Ladders: Encouraging People to Climb the Engagement Ladder

3 Jun

June 3, 2019June 4, 2019 editor

Marketers love engagement ladders. To increase engagement with a product, many companies segment their users based on usage, for instance, into heavy (super), medium (average), and light, and prod their users to climb the ladder by suggesting they do things that people in the segment above them are doing and which they aren’t doing (as frequently).

At first blush, it sounds reasonable, even obvious. The trouble with the seemingly obvious, however, is that a) it gives the illusion of understanding, which prevents us from thinking carefully (because there is nothing more to understand!), and b) it doesn’t always make sense.

Let’s start by assuming that the ladder metaphor makes sense. The only thing that we need to do is to implement it correctly.

The ladder metaphor is built on the idea of stable rungs. If the classification into “light”, “medium”, and “heavy” is not durable—for instance, if someone classified as “heavy” can move to “light” next month on their own accord—what we learn by comparing “heavy” users to “medium” users may prove deleterious for the “medium” users.

Thus, it is useful to have stable rungs. To build stable rungs, start by assessing the stability of rungs by building transition matrices over time. If the rungs are not durable over time frames over which you want to see an effect, bolster them by extending the observation time over which usage is measured or using multiple measures. For instance, if usage over the last month does not produce durable rungs, it may be because usage is heavily seasonal. To fix that, switch to usage over multiple months or a seasonally adjusted number.

Once you have stable rungs, the next task is to come up with a set of actions that marketers can encourage users to take. The popular method to arbitrate between potential actions is to regress adjacent rungs on the set of potential actions and find the ones that are most highly correlated or have the highest beta. The popular method may seem reasonable but it isn’t. Assume away causality and you still care about how useful, actionable, and easy a recommended action is. The highest beta doesn’t mean the lowest cost per incremental improvement (again, assuming away causal concerns and taking betas at face value). And there is no way to address such concerns without experimenting and finding out what works best. (The message that works the best is a sum of the action being recommended and how that action is being encouraged.)

There is one minor nuance to the above. It pays to have ‘no action’ as an action if ‘no action’ isn’t your control group. Usage-based sorting merely sorts the users by kinds of people—by people who don’t need to use the product more often than thrice a month versus those who do. Who are we to say that they need to use the product more? Fact is that often enough the correlation between usage and retention is small. And doing nothing may prove better than annoying people with unwanted emails.

Lastly, the ladder metaphor leads some to believe that we need to stand up the same ladder for everyone. Using the highest beta or the most effective treatment means recommending the same (best) action to everyone. This is what I call the ‘mail merge’ heuristic. Mail merge is plausibly very highly correlated with the usage of MS-Word. But it would be an utter disaster if MSFT recommended it to me—I plan to quit the MSFT ecosystem if it comes to pass. Ideally, we want to encourage people to cross rungs by using more things in the software that are useful for them. (In fact, it isn’t clear how else we can induce a user to use the software more.) You can learn different ladders by modeling heterogeneity in treatment effects and then use simple algebra to find the best one for each person.

Why do We Fail? And What to do About It?

28 May

May 28, 2019June 4, 2019 editor

I recently read Gawande’s The Checklist Manifesto. (You can read my review of the book here and my notes on the book here.) The book made me think harder about failure and how to prevent it. Here’s a result of that thinking.

We fail because we don’t know or because we don’t execute on what we know (Gorovitz and MacIntyre). Of the things that we don’t know are things that no else knows either—they are beyond humanity’s reach for now. Ignore those for now. This leaves us with things that “we” know but the practitioner doesn’t.

Practitioners do not know because the education system has failed them, because they don’t care to learn, or because the production of new knowledge outpaces their capacity to learn. Given that, you can reduce ignorance by 1) increase the length of training, b) improving the quality of training, c) setting up continued education, d) incentivizing knowledge acquisition, e) reducing the burden of how much to know by creating specializations, etc. On creating specialties, Gawande has a great example: “there are pediatric anesthesiologists, cardiac anesthesiologists, obstetric anesthesiologists, neurosurgical anesthesiologists, …”

Ignorance, however, ought not to damn the practitioner to error. If you know that you don’t know, you can learn. Ignorance, thus, is not a sufficient condition for failure. But ignorance of ignorance is. To fix overconfidence, leading people through provocative, personalized examples may prove useful.

Ignorance and ignorance about ignorance are but two of the three reasons for why we fail. We also fail because we don’t execute on what we know. Practitioners fail to apply what they know because they are distracted, lazy, have limited attention and memory, etc. To solve these issues, we can a) reduce distractions, b) provide memory aids, c) automate tasks, d) train people on the importance of thoroughness, e) incentivize thoroughness, etc.

Checklists are one way to work toward two inter-related aims: educating people about the necessary steps needed to make a decision and aiding memory. But awareness of steps is not enough. To incentivize people to follow the steps, you need to develop processes to hold people accountable. Audits are one way to do that. Meetings set up at appropriate times during which people go through the list is another way.

Wanted: Effects That Support My Hypothesis

8 May

May 8, 2019May 17, 2019 editor

Do survey respondents account for the hypothesis that they think people fielding the survey have when they respond? The answer, according to Mummolo and Peterson, is not much.

Their paper also very likely provides the reason why—people don’t pay much attention. Figure 3 provides data on manipulation checks—the proportion guessing the hypothesis being tested correctly. The change in proportion between control and treatment ranges from -.05 to .25, with a bulk of changes in Qualtrics between 0 and .1. (In one condition, authors even offer an additional 25 cents to give a result consistent with the hypothesis. And presumably, people need to know the hypothesis before they can answer in line with it.) The faint increase is especially noteworthy given that on average, the proportion of people in the control group who guess the hypothesis correctly—without the guessing correction—is between .25–.35 (see Appendix B; pdf).

So, the big thing we may have learned from the data is how little attention survey respondents pay. The numbers obtained here are similar to those in Appendix D of Jonathan Woon’s paper (pdf). The point is humbling and suggests that we need to: a) invest more in measurement, and b) have yet larger samples, which is an expensive way to overcome measurement error—a point Gelman has made before.

There is also the point about the worthiness of including ‘manipulation checks.’ Experiments tell us ATE of what we manipulate. The role of manipulation checks is to shed light on ‘compliance.’ If conveying experimenter demand clearly and loudly is a goal, then the experiments included probably failed. If the purpose was to know whether clear but not very loud cues about ‘demand’ matter—and for what it’s worth, I think it is a very reasonable goal; pushing further, in my mind, would have reduced the experiment to a tautology—the paper provides the answer.

Interview with InfoQ

26 Apr

April 26, 2019April 26, 2019 editor

I recently gave an interview to InfoQ about my paper (and associated open source software) on predicting the race and ethnicity of a person using the sequence of characters in a name.

Here a relevant excerpt:

InfoQ: Can you discuss how we can learn from names? What ML/DL algorithms can we use?

Gaurav Sood: Learning more about a person from their name is no different from tackling any other supervised ML problem. It all starts with getting (or creating) a large labeled corpus. For instance, one key innovation in ethnicolr is the training data—we use voting registration files to get a large labeled corpus. In another project on learning from names, I scraped Google Image Search results to build the training data for inferring the gender from a name.
Once you have the data, find ways to exploit patterns in the data to learn a model. Some early ventures exploited the fact that names of different kinds of people began/ended differently. For instance, female names in India often end with an ‘a,’ and you can exploit that pattern to infer gender from Indian names. In ethnicolr, we generalize this intuition and use patterns in sequences of characters. (I am also working on exploiting sequences of sounds.) Like Ye et al., you could also rely on the fact that we correspond more frequently with co-ethnics and exploit email networks for building your models.
To exploit the patterns in the data, the full-range of DL/ML tools is available to you. Use what works best.

Estimating the Trend at a Point in a Noisy Time Series

17 Apr

April 17, 2019April 27, 2019 editor

Trends in time series are valuable. If the cost of a product rises suddenly, it likely indicates a sudden shortfall in supply or a sudden rise in demand. If the cost of claims filed by a patient rises sharply, it plausibly suggests rapidly worsening health.

But how do we estimate the trend at a particular time in a noisy time series? The answer is simple: smooth the time series using any one of the many methods, local polynomials or via GAMs or similar such methods, and then estimate the derivative(s) of the function at the chosen point in time. Smoothing out the noise is essential. If you don’t smooth and instead go with a naive estimate of the derivative, it can be heavily negatively correlated with derivatives gotten from smoothed time series. For instance, in an example we present, the correlation is –.47.

Clarification

Sometimes we want to know what the “trend” was over a particular time window. But what that means is not 100% clear. For a synopsis of the issues, see here.

Python Package

incline provides a couple of ways of approximating the underlying function for the time series:

fitting a local higher order polynomial via Savitzky-Golay over a window of choice
fitting a smoothing spline

The package provides a way to estimate the first and second derivative at any given time using either of those methods. Beyond these smarter methods, the package also provides a way a naive estimator of slope—average change when you move one-step forward (step = observed time units) and one-step backward. Users can also calculate the average or maximum slope over a time window (over observed time steps).

Rocks and Scissors for Papers

17 Apr

April 17, 2019April 17, 2019 editor

Zach and Jack* write:

What sort of papers best serve their readers? We can enumerate desirable characteristics: these papers should
(i) provide intuition to aid the reader’s understanding, but clearly distinguish it from stronger conclusions supported by evidence;
(ii) describe empirical investigations that consider and rule out alternative hypotheses [62];
(iii) make clear the relationship between theoretical analysis and intuitive or empirical claims [64]; and
(iv) use language to empower the reader, choosing terminology to avoid misleading or unproven connotations, collisions with other definitions, or conflation with other related but distinct concepts [56].
Recent progress in machine learning comes despite frequent departures from these ideals. In this paper, we focus on the following four patterns that appear to us to be trending in ML scholarship:
1. Failure to distinguish between explanation and speculation.
2. Failure to identify the sources of empirical gains, e.g. emphasizing unnecessary modifications to neural architectures when gains actually stem from hyper-parameter tuning.
3. Mathiness: the use of mathematics that obfuscates or impresses rather than clarifies, e.g. by confusing technical and non-technical concepts.
4. Misuse of language, e.g. by choosing terms of art with colloquial connotations or by overloading established technical terms.

Funnily Zach and Jack fail to take their own advice, forgetting to distinguish between anecdotal evidence (they claim a ‘troubling trend’ without presenting systematic evidence for it). But the points they make are compelling. The second and third points are especially applicable to economics though they apply to a lot of scientific production.

* It is Zachary and Jacob.

What Clicks With the Users? Maximizing CTR

17 Apr

April 17, 2019April 17, 2019 editor

Given a pool of messages, how can you maximize CTR?

The problem of maximizing CTR reduces to the problem of estimating the probability that a person in a specific context will click on each of the messages. Once you have the probabilities, all you need to do is apply the max operator and show the message with the highest probability. Technically, you don’t need to get the point estimates right—you just need to get the ranking right.

Abstracting out, there are four levers for increasing CTR:

Better models and data: Posed as a supervised problem, we are aiming to learn clicks as a function of a) the kind of content, b) the kind of context, and c) the kinds of people. (And, of course, interactions between all three are included.) To learn preferences well, we need to improve your understanding of the content, context, and kinds of people. For instance, to understanding content more finely, you may need to code font size, font color, etc.
Modeling externalities (user learning): It sounds funny when you say that CTR of a system that shows no messages to some people some of the time can be better than a system that shows at least some message to everyone every time they log in. But it can be true. If you need to increase CTR over longer horizons, you need to be able to model the impact of showing one message on a person opening another message. If you do that, you may realize that the best option is to not even show a message this time. (The other way you could ‘improve’ CTR is by losing people—you may lose people you bombard with irrelevant messages and the only people who ‘survive’ are those who like what you send.)
Experimenting With How to Present a Message: Location on the webpage, the font, etc. all may matter. Experiment to learn.
Portfolio: This let’s go of the fixed portfolio. Increase your portfolio of messages so that you have a reasonable set of things for everyone. It is easy enough to mistake people dismissing a message with disinterest in receiving messages. Don’t make the mistake. If you want to learn where you are failing, find out for which kinds of people you have the lowest (calibrated) probability scores for and think hard about what kinds of messages will appeal to these kinds of people.