Beyond Anomaly Detection: Supervised Learning from `Bad’ Transactions

20 Sep

Nearly every time you connect to the Internet, multiple servers log a bunch of details about the request. For instance, details about the connecting IPs, the protocol being used, etc. (One popular software for collecting such data is Cisco’s Netflow.) Lots of companies analyze this data in an attempt to flag `anomalous’ transactions. Given the data are low quality—IPs do not uniquely map to people, information per transaction is vanishingly small—the chances of building a useful anomaly detection algorithm using conventional unsupervised methods are extremely low.

One way to solve the problem is to re-express it as a supervised problem. Google, various security firms, security researchers, etc. everyday flag a bunch of IPs for various nefarious activities, including hosting malware (passive DNS), scanning, or actively . Check to see if these IPs are in the database, and learn from the transactions that include the blacklisted IPs. Using the model, flag transactions that look most similar to the transactions with blacklisted IPs. And validate the worthiness of flagged transactions with the highest probability of being with a malicious IP by checking to see if the IPs are blacklisted at a future date or by using a subject matter expert.

Bad Science: A Partial Diagnosis And Some Remedies

3 Sep

Lack of reproducibility is a symptom of science in crisis. An eye-catching symptom to be sure, but hardly the only one vying for attention. Recent analyses suggest that nearly two-thirds of the (relevant set of) articles published in prominent political science journals condition on post-treatment variables (see here.) Another set of analysis suggests that half of the relevant set of articles published in prominent neuroscience journals treat difference in significant and non-significant result as the basis for the claim that difference between the two is significant (see here). What is behind this? My guess: poor understanding of statistics, poor editorial processes, and poor strategic incentives.

  1. Poor understanding of statistics: It is likely the primary reason. For it would be good harsh to impute bad faith on part of those who use post-treatment variables as control or treating difference between significant and non-significant result as significant. There is likely a fair bit of ignorance — be it on the part of authors or reviewers. If it is ignorance, then the challenge doesn’t seem as daunting. Let us devise good course materials, online lectures, and teach. And for more advanced scholars, some outreach. (And it may involve teaching scientists how to write-up their results.)

  2. Poor editorial processes: Whatever the failings of authors, they aren’t being caught during the review process. (It would be good to know how often reviewers are actually the source of bad recommendations.) More helpfully, it may be a good idea to create small questionnaires before submission that alert authors about common statistical issues.

  3. Poor strategic incentives: If authors think that journals are implicitly biased towards significant findings, we need to communicate effectively that it isn’t so.

Unlisted False Negatives: Are 11% Americans Unlisted?

21 Aug

A recent study by Simon Jackman and Bradley Spahn claims that 11% of Americans are ‘unlisted.’ (The paper has since been picked up by liberal media outlets like the Think Progress.)

When I first came across the paper, I thought that the number was much too high for it to have any reasonable chance of being right. My suspicions were roused further by the fact that the paper provided no bounds on the number — no note about measurement error in matching people across imperfect lists. A galling omission when the finding hinges on the name matching procedure, details of which are left to another paper. What makes it to the paper is this incredibly vague line: “ANES collects …. bolstering our confidence in the matches of respondents to the lists.” I take that to mean that the matching procedure was done with the idea of reducing false positives. If so, the estimate is merely an upper bound on the percentage of Americans who could be unlisted. That isn’t a very useful number.

But reality is a bit worse. To my questions about false positive and negative rates, Bradley Spahn responded on Twitter, “I think all of the contentious cases were decided by me. What are my decision-theoretic properties? Hard to say.” That line covers one of the most essential details of the matching procedure, a detail they say the readers can find “in a companion paper.” The primary issue is subjectivity. But not taking adequate account of the relevance of ‘decision theoretic’ properties to the results in the paper grates.

Optimal Cost Function When Cost of Misclassification is Higher for the Customer than for the Business

15 Apr

Consider a bank making decisions about loans. For the bank, making lending decisions optimally means reducing prediction errors*(cost of errors) minus the cost of making predictions (Keeping things simple here). The cost of any one particular error — especially, denial of loan when eligible– is typically small for the bank, but very consequential for the applicant. So the applicant may be willing to pay the bank money to increase the accuracy of their decisions. Say, willing to compensate the bank for the cost of getting a person to take a closer look at the file. If customers are willing to pay the cost, accuracy rates can increase without reducing profits. (Under some circumstances, a bank may well be able to increase profits.) Customer’s willingness to pay for increasing accuracy is typically not exploited by the lending institutions. It may be well worth exploring it.

The Human and the Machine: Semi-automated approaches to ML

12 Apr

For a class of problems, a combination of algorithms and human input makes for the most optimal solution. For instance, three years ago software to recreate shredded documents that won the DARPA award used “human[s] [to] verify what the computer was recommending.” The insight is used in character recognition tasks. I have used it to create software for matching dirty data — the software was used to merge shape files with electoral returns at precinct level.

The class of problems for which human input proves useful has one essential attribute — humans produce unbiased, if error-prone, estimates for these problems. So for instance, it would be unwise to use humans for making the ‘last mile’ of lending decisions (see also this NYT article). (And that is something you may want to verify with training data.)

Estimating Hillary’s Missing Emails

11 Apr

Note:

55000/(365*4) ~ 37.7. That seems a touch low for Sec. of state.

Caveats:
1. Clinton may have used more than one private server
2. Clinton may have sent emails from other servers to unofficial accounts of other state department employees

Lower bound for missing emails from Clinton:

  1. Take a small weighted random sample (weighting seniority more) of top state department employees.
  2. Go through their email accounts on the state dep. server and count # of emails from Clinton to their state dep. addresses.
  3. Compare it to # of emails to these employees from the Clinton cache.

To propose amendments, go to the Github gist

(No) Value Added Models

6 Jul

This note is in response to some of the points raised in the Agnoff Lecture by Ed Haertel.

The lecture makes two big points:
1) Teacher effectiveness ratings based on current Value Added Models are ‘unreliable.’ They are actually much worse than just unreliable; see below.
2) Simulated counterfactuals of gains that can be got from ‘firing bad teachers’ are upwardly biased.

Three simple tricks (one discussed; two not) that may solve some of the issues:
1) Estimating teaching effectiveness: Where possible, random assignment of children to classes. I would only do within school comparisons. Inference will still not be clean (SUTVA violations, though they can be dealt with). Simply cleaner.

2) Experiment with teachers. Teach some teachers some skills. Estimate the impact. Rather than teacher level VAM, do a skill level VAM. Teachers = sum of skills + idiosyncratic variation.

3) For current VAMs: To create better student level counterfactuals, use modern ML techniques (SVM, Neural Networks..), lots of data (past student outcomes, past classmate outcomes etc.), cross-validate to tune. Have a good idea about how good the prediction is. The strategy may be applicable to other venues.

Other points:
1) Haertel says, “Obviously, teachers matter enormously. A classroom full of students with no teacher would probably not learn much — at least not much of the prescribed curriculum.” A better comparison perhaps would be to self-guided technology. My sense is that as technology evolves, teachers will come up short in a comparison between teachers and advanced learning tools. In most of the third world, I think it is already true.

2) It appears no model for calculating teacher effectiveness scores yields identified estimates. And it appears we have no clear understanding of the nature of bias. Pooling biased estimates over multiple years doesn’t recommend itself to me as a natural fix to this situation. And I don’t think calling this situation as ‘unreliability’ of scores is right. These scores aren’t valid. The fact that pooling across years ‘works’ may suggest issues are smaller. But then again, bad things may be happening to some kinds of teachers, especially if people are doing cross-school comparisons.

3) Fade-out concern is important given the earlier 5*5 =25 analysis. My suspicion would be that attenuation of effects varies depending on when the timing of the shock. My hunch would be that shocks at an earlier age matter more – they decay slower.

Impact of selection bias in experiments where people treat each other

20 Jun

Selection biases in the participant pool generally have limited impact on inference. One way to estimate population treatment effect from effects estimated using biased samples is to check if treatment effect varies by ‘kinds of people’, and then weight the treatment effect to population marginals. So far so good.

When people treat each other, selection biases in participant pool change the nature of the treatment. For instance, in a Deliberative Poll, a portion of the treatment is other people. Naturally then, the exact treatment depends on the pool of people. Biases in the initial pool of participants mean treatment is different. For inference, one may exploit across group variation in composition.

Sampling on M-Turk

13 Oct

In many of the studies that use M-Turk, there appears to be little strategy to sampling. A study is posted (and reposted) on M-Turk till a particular number of respondents take the study. If the pool of respondents reflects true population proportions, if people arrive in no particular order, and all kinds of people find the monetary incentive equally attractive, the method should work well. There is reasonable evidence to suggest that at least points 1 and 3 are violated. One costly but easy fix for the third point is to increase payment rates. We can likely do better.

If we are agnostic about variable on which we want precision, here’s one way to sample: Start with a list of strata, and their proportions in the population of interest. If the population of interest is sample of US adults, the proportions are easily known. Set up screening questions, and recruit. Raise price to get people in cells that are running short. Take simple precautions. For one, to prevent gaming, do not change the recruitment prompt to let people know that you want X kinds of people.

On balance, let there be imbalance on observables

27 Sep

For whatever reason, some people are concerned with imbalance when analyzing data from randomized experiments. The concern may be more general, but its fixes devolve into reducing imbalance on observables. Such fixes may fix things or break things. More generally, it is important to keep in mind what one experiment can show. If randomization is done properly, and other assumptions hold, the most common estimator of experiment effects – difference in means – is unbiased. We also have a good idea of how often the true estimate will be in the bounds. For tightening those bounds, relying on sample size is the way to go. General rules apply. Larger is better. But some refinements to that general rule. When everyone/thing is the same – for instance, neutrons* (in most circumstances) – and if measurement error isn’t a concern, samples of 1 will do just fine. The point holds for a potentially easier to obtain case than everybody/thing being same – when treatment effect is constant across things/people. When everyone is different w.r.t treatment effect, randomization won’t help. Though one can always try to quantify the difference. More generally, sample size required for a particular level of balance is greater, greater the heterogeneity. Stratified random assignment (or blocking) will help. This isn’t to say that raw difference estimator will be biased. It won’t. Just that variance will be higher.

* Based on discussion in Gerber and Green on why randomization is often not necessary in physics.