About 85% Problematic: The Trouble With Human-In-The-Loop ML Systems

26 Apr

MIT researchers recently unveiled a system that combines machine learning with input from users to ‘predict 85% of the attacks.’ Each day, the system winnows down millions of rows to a few hundred atypical data points and passes these points on to ‘human experts’ who then label the few hundred data points. The system then uses the labels to refine the algorithm.

At the first blush, using data from users in such a way to refine the algorithm seems like the right thing to do, even the obvious thing to do. And there exist a variety of systems that do precisely this. In the context of cyber data (and a broad category of similar such data), however, it may not be the right thing to do. There are two big reasons for that. A low false positive rate can be much more easily achieved if we do not care about the false negative rate. And there are good reasons to worry a lot about false negative rates in cyber data. And second, and perhaps more importantly, incorporating user input on complex tasks (or where data is insufficiently rich) reduces to the following: given a complex task with inadequate time, the users use cheap heuristics to label the data, and supervised aspect of the algorithm reduces to learning cheap heuristics that humans use.

Clarifai(ng) the Future of Clarifai

31 Dec

Clarifai is a promising AI start-up. In a short(ish) time, it has made major progress on an important problem. And it is rapidly rolling out products with lots of business potential. But there are still some things that it could do.

As I understand it, the base version of Clarifai API is trying to do two things at once: a) learn various recognizable patterns in images b) rank the patterns based on ‘appropriateness’ and probab_true. I think Clarifai would have to split these two things over time and allow people to input what abstract dimensions are ‘appropriate’ for them. As the idiom goes, an image is a thousand words. In an image, there can be information about social class, race, and country, but also shapes, patterns, colors, perspective, depth, time of the day etc. And Clarifai should allow people to pick dimensions appropriate for the task. Though, defining dimensions would be hard. But that shouldn’t stymie the efforts. And ad hoc advances may be useful. For instance, one dimension could be abstract shapes and colors. Another could be the more ‘human’ dimension etc.

Extending the logic, Clarifai should support the building of abstract data science applications that solve a particular problem. For instance, say a user is only interested in learning about whether the photo features a man or a woman. And the user wants to build a Clarifai based classifier. (That person is me. The task is inferring gender of first names. See here.) Clarifai could in principle allow the user to train a classifier that uses all other information in the images, including jewelry, color, perspective, etc. and provide an out of sample error for that particular task. The crucial point is allowing users fuller access to what Clarifai can do and then letting the users manage it at their ends. To that end again, input about user objectives needs to be built into the API. Basic hooks could be developed for classification and clustering inputs.

More generally, Clarifai should eventually support more user inputs and a greater variety of outputs. Limiting the product to tagging is a mistake.

There are three other general directions for Clarifai to go into. A product that automatically sections an image into multiple images and tags each section would be useful. This would allow, for instance, to count the number of women in a photo. Another direction to go would be to provide the ‘best’ set of tags that collectively describe a set of images. (It may seem like violating the spirit of what I noted above but it needn’t — a user could want just this.) By the same token, Clarifai could build general-purpose discrimination engines — a list of tags that distinguishes image(s) the best.

Beyond this, the obvious. Clarifai can also provide synonyms of tags to make tags easier to use. And it could allow users to specify if they want, say tags in ‘UK English’ etc.

I Recommend It: A Recommender System For Scholarship Discovery

1 Oct

The rate of production of scholarship has never been higher. And while our ability to discover relevant scholarship per unit of time has never kept pace with the production of knowledge, it has also risen sharply—most recently, due to Google scholar.

The efficiency of discovery of relevant scholarship, however, has plateaued over the last few years, even as the rate of production of scholarship has kept its steady upward climb. Part of the reason why the growth has plateaued is because current ways of doing things cannot be made considerably more efficient very quickly. New growth in the rate of discovery will need knowledge discovery systems to get more data on the user’s needs, and access to structured databases of academic production.

The easiest next step would perhaps be to build a collaborative recommendation system. In particular, think of a system that takes your .bib file, trawls a large database of citation lists, for example, JSTOR or PLoS, and produces recommendations for scholarship you may have missed. The logic of a collaborative recommender system is pretty simple: learn from articles which have cited similar scholarship as you. If we have metadata on scholarship, for instance, sub-field, the actual text of an article or even the abstract, we could recommend based on the extent to which two articles cite the same kind of scholarly article. Direct elicitations (search terms are but one form) from the user can also be added to guide the recommendations. And meta-characteristics, for instance, page rank of a piece of scholarship, could be used to order recommendations.

Beyond Anomaly Detection: Supervised Learning from `Bad’ Transactions

20 Sep

Nearly every time you connect to the Internet, multiple servers log a bunch of details about the request. For instance, details about the connecting IPs, the protocol being used, etc. (One popular software for collecting such data is Cisco’s Netflow.) Lots of companies analyze this data in an attempt to flag `anomalous’ transactions. Given the data are low quality—IPs do not uniquely map to people, information per transaction is vanishingly small—the chances of building a useful anomaly detection algorithm using conventional unsupervised methods are extremely low.

One way to solve the problem is to re-express it as a supervised problem. Google, various security firms, security researchers, etc. everyday flag a bunch of IPs for various nefarious activities, including hosting malware (passive DNS), scanning, or actively . Check to see if these IPs are in the database, and learn from the transactions that include the blacklisted IPs. Using the model, flag transactions that look most similar to the transactions with blacklisted IPs. And validate the worthiness of flagged transactions with the highest probability of being with a malicious IP by checking to see if the IPs are blacklisted at a future date or by using a subject matter expert.

Some Aspect of Online Learning Re-imagined

3 Sep

Trevor Hastie is a superb teacher. He excels at making complex things simple. Till recently, his lectures were only available to those fortunate enough to be at Stanford. But now he teaches a free introductory course on statistical learning via the Stanford online learning platform. (I browsed through the online lectures — they are superb, the quality of his teaching, undiminished.)

Online learning, pregnant with rich possibilities, however, has met its cynics and its first failures. The proportion of students who drop-off after signing up for these courses is astonishing (but then so are the sign-up rates). And some very preliminary data suggest that compared to some face-to-face teaching, students enrolled in online learning don’t perform as well. But these are early days and much can be done to refine the systems and to also promote the interests of thousands upon thousands of teachers who have much to contribute to the world. In that spirit, some suggestions:

Currently, Coursera, edX and similar such ventures mostly attempt to replicate an off-line course online. The model is reasonable but doesn’t do justice to the unique opportunities of teaching online:

The Online Learning Platform: For Coursera etc. to be true learning platforms, they need to provide rich APIs for allowing development of both learning and teaching tools (and easy ways for developers to monetize these applications). For instance, professors could access to visualization applications and students access to applications that puts them in touch with peers interested in peer-to-peer teaching. The possibilities are endless. Integration with other social networks and other communication tools at the discretion of students are obvious future steps.

Learning from Peers: The single teacher model is quaint. It locks out contributions from other teachers. And it locks out teachers from potential revenues. We could turn it into a win-win. Scout ideas and materials from teachers and pay them for their contributions, ideally as a share of the revenue — so that they have a steady income.

The Online Experimentation Platform: So many researchers in education and psychology are curious and willing to contribute to this broad agenda of online learning. But no protocols and platforms have been established to exploit this willingness. The online learning platforms could release more data at one end. And at the other end, they could create a platform for researchers to option ideas that the community and company can vet. Winning ideas can form the basis of research that is implemented on the platform.

Beyond these ideas, there are ideas to do with courses that enhance student’s ability to succeed in these courses. Providing students ‘life skills’ either face-to-face or online, teaching them ways to become more independent students, can allow them to better adapt to the opportunities and challenges of learning online.

Toward a Two-Sided Market on Uber

17 Jun

Uber prices rides based on the availability of Uber drivers in the area, the demand, and the destination. This price is the same for whoever is on the other end of the transaction. For instance, Uber doesn’t take into account that someone in a hurry may be willing to pay more for a quicker service. By the same token, Uber doesn’t allow drivers to distinguish themselves on price (Airbnb, for instance, allows this). It leads to a simpler interface but it produces some inefficiency—some needs go unmet, some drivers go underutilized, etc. It may make sense to try out a system that allows for bidding on both sides.

Enabling FOIA: Getting Better Access to Government Data at a Lower Cost

28 May

Freedom of Information Act is vital for holding the government to account. But little effort has gone into building tools that enable fulfillment of FOIA requests. As a consequence of this underinvestment, fulfilling FOIA requests often means onerous, costly work for government agencies and long delays for the requesters. Here, I discuss a couple of alternatives. One tool that can prove particularly effective in lowering the cost of fulfilling FOIA requests is an anonymizer—a tool that detects proper nouns, addresses, telephone numbers, etc. and blurs them. This is easily achieved using modern machine learning methods. To ensure 100% accuracy, humans can quickly vet the suggestions by the algorithm along with ‘suspect words.’ Or a captcha like system that asks people in developing countries to label suspect words as nouns etc. can be created to further reduce costs. This covers one conventional solution. Another way to solve the problem would be to create a sandbox architecture. Rather than give requesters stacks of redacted documents, often unnecessary, one can allow people the right to run certain queries on the data. Results of these queries can be vetted internally. A classic example would be to allow people to query government emails for the total number of times porn domains are accessed via servers at government offices.

The Human and the Machine: Semi-automated approaches to ML

12 Apr

For a class of problems, a combination of algorithms and human input makes for the most optimal solution. For instance, three years ago software to recreate shredded documents that won the DARPA award used “human[s] [to] verify what the computer was recommending.” The insight is used in character recognition tasks. I have used it to create software for matching dirty data — the software was used to merge shape files with electoral returns at precinct level.

The class of problems for which human input proves useful has one essential attribute — humans produce unbiased, if error-prone, estimates for these problems. So for instance, it would be unwise to use humans for making the ‘last mile’ of lending decisions (see also this NYT article). (And that is something you may want to verify with training data.)

Big Data Algorithms: Too Complicated to Communicate?

11 Apr

“A decision is made about you, and you have no idea why it was done,” said Rajeev Date, an investor in data-science lenders and a former deputy director of Consumer Financial Protection Bureau

From NYT: If Algorithms Know All, How Much Should Humans Help?

The assertion that there is no intuition behind decisions made by algorithms strikes me as silly. So does the related assertion that such intuition cannot be communicated effectively. We can back out the logic for most algorithms. Heuristic accounts of the logic — e.g. which variables were important — can be given yet more easily. For instance, for inference from seemingly complicated-to-interpret methods such as ensemble methods, intuition for what variables are important can be gotten in the same way as it is gotten for methods like bagging. However, even when specific points are hard to convey, the meta-logic of the system can be explained to the end user.

What is true, however, is that it isn’t being done. For instance, WSJ covering Orion routing system at UPS reports:

“For example, some drivers don’t understand why it makes sense to deliver a package in one neighborhood in the morning, and come back to the same area later in the day for another delivery. …One driver, who declined to speak for attribution, said he has been on Orion since mid-2014 and dislikes it, because it strikes him as illogical.”

WSJ: At UPS, the Algorithm Is the Driver

Communication architecture is an essential part of all human focused systems. And what to communicate when are important questions that deserve careful thought. The default cannot be no communication.

The lack of systems that communicate intuition behind algorithms strikes me as a great opportunity. HCI people — make some money.

Idealog: Monetizing Websites using Mechanical Turk

14 Jul

Mechanical Turk, a system for crowd-sourcing complex Human Intelligence Tasks (HITs) through parceling a large task and putting the parcels on a marketplace, has seen a recent resurgence, partly on the back of the success of easier to work with clones like crowdflower.com.

Given that on the Internet people find it easier to part with labor than with money, one way to monetize websites may be to request users to help pay for the site by doing a task on Mechanical Turk, made easily available as part of the website. Websites can ask users to fill out the kind of tasks they would like to work on as part of their profiles.

For example, a news website may also provide its users a choice between watching ads and doing a small number of tasks for every X number of articles that they read. Similarly, a news website may ask its users to proofread a sentence or two of its own articles, thereby reducing costs of its production.