Where’s the news?: Classifying News Domains

23 Jul

We select an initial universe of news outlets (i.e., web domains) via the Open Directory Project (ODP, dmoz.org), a collective of tens of thousands of editors who hand-label websites into a classification hierarchy. This gives 7,923 distinct domains labeled as: news, politics/news, politics/media, and regional/news. Since the vast majority of these news sites receive relatively little traffic, to simplify our analysis we restrict to the one hundred domains that attracted the largest number of unique visitors from our sample of toolbar users. This list of popular news sites includes every major national news source, well-known blogs and many regional dailies, and
collectively accounts for over 98% of page views of news sites in the full ODP list (as estimated via our toolbar sample). The complete list of 100 domains is given in the Appendix.

From Filter Bubbles, Echo Chambers, and Online News Consumption by Flaxman, Goel and Rao.

When using rich browsing data, scholars often rely on ad hoc lists of domains to estimate consumption of certain kind of media. Using these lists to estimate consumption raises three obvious concerns – 1) Even sites classified as ‘news sites,’ such as the NYT, carry a fair bit of non-news 2) (speaking categorically) There is the danger of ‘false positives’ 3) And (speaking categorically again) there is a danger of ‘false negatives.’

FGR address the first concern by exploiting the URL structure. They exploit the fact that the URL of NY Times story contains information about the section. (The classifier is assumed to be perfect. But likely isn’t. False positive and negative rates for this kind of classification can be estimated using raw article data.) This leaves us with concern about false positives and negatives at the domain level. Lists like those published by DMOZ appear to be curated well-enough to not contain too many false-positives. The real question is about how to calibrate false negatives. Here’s one procedure. Take a large random sample of the browsing data (at least 10,000 unique domain names). Compare it to a large comprehensive database like Shallalist. Of the domains that aren’t in the database, query a URL classification service such as Trusted Source. (The initial step of comparing against Shallalist is to reduce the amount of querying.) Using the results, estimate the proportion of missing domain names (the net number of missing domain names is likely much much larger). Also estimate missed visitation time, page views etc.

Towards a Two-Sided Market on Uber

17 Jun

Uber prices rides based on the availability of Uber drivers in the area, the demand, and the destination. This price is the same for whoever is on the other end of the transaction. For instance, Uber doesn’t take in account that someone in a hurry may be willing to pay more for a quicker service. By the same token, Uber doesn’t allow drivers to distinguish themselves on price (AirBnB, for instance, allows this). It leads to a simpler interface but it produces some inefficiency – some needs go unmet, some drivers go underutilized etc. It may make sense to try out a system that allows for bidding on both sides.

Rape in India

16 Jun

According to crime reports, in India, rape is about 15 times less common than in the US. A variety of concerns apply. For one, definition of rape varies considerably. But differences aren’t always in the expected direction. For instance, Indian Penal Code considers sex under the following circumstance as rape: “With her consent, when the man knows that he is not her husband, and that her consent is given because she believes that he is another man to whom she is or believes herself to be lawfully married.” In 2013, however, the definition of rape under the Indian Penal Code was updated to what is generally about par with the international definitions except for two major exceptions: the above clause, and more materially, the continued exclusion of marital rape. For two, there are genuine fears about rape being yet more severely underreported in India.

Evidence from Surveys
Given that rape is underreported, anonymous surveys of people are better indicators of prevalence of rape. In the US, comparing CDC National Intimate Partner and Sexual Violence Survey data to FBI crime reports suggests that only about 6.6% of rapes are reported. (Though see also, comparison to National Crime Victimization Survey (NCVS) which suggests that one of every 3 rapes are reported.) In India, according to Lancet (citing numbers from American Journal of Epidemiology (Web Table 4)), “1% of victims of sexual violence report the crime to the police.” Another article based on data from the 2005 National Family Health Survey (NFHS) finds that the corresponding figure for India is .7%. (One odd thing about the article that caught my eye — The proportion of marital rape reported ought to be zero given the Indian Penal Code doesn’t recognize marital rape. But perhaps reports can still be made.) This pegs the rate of rape in India at either about 60% of the rate in US (if CDC numbers are more commensurable to NFHS numbers) or at about three times as much (if NCVS numbers are more commensurable to NFHS numbers).

The article based on NFHS data also estimates that nearly 98% of rapes are committed by husbands. This compares to 26% in US according to NCVS. So one startling finding is that risk of rape for unmarried women is startlingly low in India. Or chances of being raped as a married women, astonishingly high. Data also show that rapes by husbands in India are especially unlikely to be reported. The low risk of rape for unmarried women may be a consequence of something equally abhorrent — fearful of sexual harassment or as a consequence of patriarchy at home, Indian women may be much less likely to be outdoors than men (it would be good to quantify this).

Partisan Retrospection?: Partisan Gaps in Retrospection are Highly Variable

11 Jun

The difference between partisans’ responses on retrospection items is highly variable, ranging from over 40% to nearly 0. For instance, in 1988 nearly 30% fewer Democrats than Republicans reported that the inflation rate between 1980 and 1988 had declined. (It had.) However, similar proportions of Republicans and Democrats got questions about changes in the size of the budget deficit and defense spending between 1980 and 1988 right. The median partisan gap across 20 items asked in the NES over 5 years (1988, 1992, 2000, 2004, and 2008) was about 15 points (the median was about 12 points), and the standard deviation was about 13 points. (See the tables.) This much variation suggests that observed bias in partisans’ perceptions depends on a variety of conditioning variables. For one, there is some evidence to suggest that during severe recessions, partisans do not differ much in their assessment of economic conditions (See here.) Even when there are partisan gaps, however, they may not be real (see paper (pdf)).

Enabling FOIA: Getting Better Access to Government Data at a Lower Cost

28 May

Freedom of Information Act is vital for holding the government to account. But little effort has gone into building tools that enable fulfilment of FOIA requests. As a consequence of this underinvestment, fulfilling FOIA requests often means onerous, costly work for government agencies, and long delays for the requesters. Here, I discuss a couple of alternatives. One tool that can prove particularly effective in lowering the cost of fulfilling FOIA requests is an anonymizer — a tool that detects proper nouns, addresses, telephone numbers, etc. and blurs them. This is easily achieved using modern machine learning methods. To ensure 100% accuracy, humans can quickly vet the suggestions by the algorithm along with ‘suspect words.’ Or a captcha like system that asks people in developing countries to label suspect words as nouns etc. can be created to further reduce costs. This covers one conventional solution.

Another way to solve the problem would be to create a sandbox architecture. Rather than give requesters stacks of redacted documents – often unnecessary – one can allow people the right to run certain queries on the data. Results of these queries can be vetted internally. A classic example would be to allow people to query government emails for total number of times porn domains are accessed via servers at government offices.

Optimal Cost Function When Cost of Misclassification is Higher for the Customer than for the Business

15 Apr

Consider a bank making decisions about loans. For the bank, making lending decisions optimally means reducing prediction errors*(cost of errors) minus the cost of making predictions (Keeping things simple here). The cost of any one particular error — especially, denial of loan when eligible– is typically small for the bank, but very consequential for the applicant. So the applicant may be willing to pay the bank money to increase the accuracy of their decisions. Say, willing to compensate the bank for the cost of getting a person to take a closer look at the file. If customers are willing to pay the cost, accuracy rates can increase without reducing profits. (Under some circumstances, a bank may well be able to increase profits.) Customer’s willingness to pay for increasing accuracy is typically not exploited by the lending institutions. It may be well worth exploring it.

The Human and the Machine: Semi-automated approaches to ML

12 Apr

For a class of problems, a combination of algorithms and human input makes for the most optimal solution. For instance, three years ago software to recreate shredded documents that won the DARPA award used “human[s] [to] verify what the computer was recommending.” The insight is used in character recognition tasks. I have used it to create software for matching dirty data — the software was used to merge shape files with electoral returns at precinct level.

The class of problems for which human input proves useful has one essential attribute — humans produce unbiased, if error-prone, estimates for these problems. So for instance, it would be unwise to use humans for making the ‘last mile’ of lending decisions (see also this NYT article). (And that is something you may want to verify with training data.)

Big Data Algorithms: Too Complicated to Communicate?

11 Apr

“A decision is made about you, and you have no idea why it was done,” said Rajeev Date, an investor in data-science lenders and a former deputy director of Consumer Financial Protection Bureau

From NYT: If Algorithms Know All, How Much Should Humans Help?

The assertion that there is no intuition behind decisions made by algorithms strikes me as silly. So does the related assertion that such intuition cannot be communicated effectively. We can back out the logic for most algorithms. Heuristic accounts of the logic — e.g. which variables were important — can be given yet more easily. For instance, for inference from seemingly complicated-to-interpret methods such as ensemble methods, intuition for what variables are important can be gotten in the same way as it is gotten for methods like bagging. However, even when specific points are hard to convey, the meta-logic of the system can be explained to the end user.

What is true, however, is that it isn’t being done. For instance, WSJ covering Orion routing system at UPS reports:

“For example, some drivers don’t understand why it makes sense to deliver a package in one neighborhood in the morning, and come back to the same area later in the day for another delivery. …One driver, who declined to speak for attribution, said he has been on Orion since mid-2014 and dislikes it, because it strikes him as illogical.”

WSJ: At UPS, the Algorithm Is the Driver

Communication architecture is an essential part of all human focused systems. And what to communicate when are important questions that deserve careful thought. The default cannot be no communication.

The lack of systems that communicate intuition behind algorithms strikes me as a great opportunity. HCI people — make some money.

Estimating Hillary’s Missing Emails

11 Apr


55000/(365*4) ~ 37.7. That seems a touch low for Sec. of state.

1. Clinton may have used more than one private server
2. Clinton may have sent emails from other servers to unofficial accounts of other state department employees

Lower bound for missing emails from Clinton:

  1. Take a small weighted random sample (weighting seniority more) of top state department employees.
  2. Go through their email accounts on the state dep. server and count # of emails from Clinton to their state dep. addresses.
  3. Compare it to # of emails to these employees from the Clinton cache.

To propose amendments, go to the Github gist

Some Hard Feelings: Feelings Towards Some Racial and Ethnic Groups in 4 Countries

8 Aug

According to YouGov surveys in Switzerland, Netherlands and Canada, and the 2008 ANES in the US, Whites, on average, in each of the four countries feel fairly coldly — giving an average thermometer rating of less than 50 on a 0 to 100 scale — toward Muslims, and people from Muslim-majority regions (Feelings towards different ethnic, racial, and religious groups). However, in Europe, Whites’ feelings toward Romanians, Poles, and Serbs and Kosovars are scarcely any warmer, and sometimes cooler. Meanwhile, Whites feel relatively warmly towards East Asians.