Beyond Anomaly Detection: Supervised Learning from `Bad’ Transactions

20 Sep

Nearly every time you connect to the Internet, multiple servers log a bunch of details about the request. For instance, details about the connecting IPs, the protocol being used, etc. (One popular software for collecting such data is Cisco’s Netflow.) Lots of companies analyze this data in an attempt to flag `anomalous’ transactions. Given the data are low quality—IPs do not uniquely map to people, information per transaction is vanishingly small—the chances of building a useful anomaly detection algorithm using conventional unsupervised methods are extremely low.

One way to solve the problem is to re-express it as a supervised problem. Google, various security firms, security researchers, etc. everyday flag a bunch of IPs for various nefarious activities, including hosting malware (passive DNS), scanning, or actively . Check to see if these IPs are in the database, and learn from the transactions that include the blacklisted IPs. Using the model, flag transactions that look most similar to the transactions with blacklisted IPs. And validate the worthiness of flagged transactions with the highest probability of being with a malicious IP by checking to see if the IPs are blacklisted at a future date or by using a subject matter expert.