American PII: Lapses in Securing Confidential Data

23 Sep

At least 83% of Americans have had their confidential data shared with a company breached (see here and here). The list of most frequently implicated companies in the loss of confidential data makes for sobering reading. Reputable companies like Linkedin (Microsoft), Adobe, Dropbox, etc., are among the top 20 worst offenders. 

Source: Pwned: The Risk of Exposure From Data Breaches

There are two other seemingly contradictory facts. First, many of the companies that haven’t been able to safeguard confidential data have some kind of highly regarded security certification like SOC-2 (see, e.g., here). The second is that many data breaches are caused by elementary errors, e.g., “the password cryptography was poorly done and many were quickly resolved back to plain text” (here).

The explanation for why companies with highly regarded security certifications fail to protect the data is probably mundane. Supporters of these certifications may rightly claim that these certifications dramatically reduce the chances of a breach without eliminating it. And a 1% error rate can easily lead to the observation we started with.

So, how do we secure data? Before discussing solutions, let me describe the current state. In many companies, PII data is spread across multiple databases. Data protection is based on processes set up for controlling access to data. The data may also be encrypted, but it generally isn’t. Many of these processes to secure the data are also auditable and certifications are granted based on audits.

Rather than relying on adherence to processes, a better bet might be to not let PII data percolate across the system. The primary options for prevention are customer-side PII removal and ingestion-time PII removal. (Methods like differential privacy can be used at either end and in how automated data collection services are setup.) Beyond these systems, you need a system for cases where PII data is shown in the product. One way to handle such cases is to build a system where the PII is hashed during ingest and looked up right before serving from a system that is yet more tightly access controlled. All of these things are well known. Their lack of adoption is partly due to the fact that these services have yet to be abstracted out enough that adding them is as easy as editing a YAML file. And there lies an opportunity.

Estimating the Trend at a Point in a Noisy Time Series

17 Apr

Trends in time series are valuable. If the cost of a product rises suddenly, it likely indicates a sudden shortfall in supply or a sudden rise in demand. If the cost of claims filed by a patient rises sharply, it plausibly suggests rapidly worsening health.

But how do we estimate the trend at a particular time in a noisy time series? The answer is simple: smooth the time series using any one of the many methods, local polynomials or via GAMs or similar such methods, and then estimate the derivative(s) of the function at the chosen point in time. Smoothing out the noise is essential. If you don’t smooth and instead go with a naive estimate of the derivative, it can be heavily negatively correlated with derivatives gotten from smoothed time series. For instance, in an example we present, the correlation is –.47.

Clarification

Sometimes we want to know what the “trend” was over a particular time window. But what that means is not 100% clear. For a synopsis of the issues, see here.

Python Package

incline provides a couple of ways of approximating the underlying function for the time series:

  • fitting a local higher order polynomial via Savitzky-Golay over a window of choice
  • fitting a smoothing spline

The package provides a way to estimate the first and second derivative at any given time using either of those methods. Beyond these smarter methods, the package also provides a way a naive estimator of slope—average change when you move one-step forward (step = observed time units) and one-step backward. Users can also calculate the average or maximum slope over a time window (over observed time steps).

Stemming Link Rot

23 Mar

The Internet gives many things. But none that are permanent. That is about to change. Librarians got together and recently launched https://perma.cc/ which provides a permanent link to stuff.

Why is link rot important?

Here’s an excerpt from a paper by Gertler and Bullock:

“more than one-fourth of links published in the APSR in 2013 were broken by the end of 2014”

If what you are citing evaporates, there is no way to check the veracity of the claim. Journal editors: pay attention!

countpy: Incentivizing more and better software

22 Mar

Developers of Python packages sometimes envy R developers for the simple perks they enjoy. For example, a reliable web service that gives a reasonable indication of the total number of times an R package has been downloaded (albeit only from one of the mirrors). To achieve the same, Python developers need to do a Google BigQuery (which costs money) and wait for 30 or so seconds.

Then there are sore spots that are shared by all developers. Downloads are a shallow metric. Developers often want to know how often people use their packages. Without such a number, it is hard to defend against accusations like, “the total number of downloads is unreliable because they can be padded by numerous small releases,” “the total number of downloads doesn’t reflect how often people use the software,” etc. We partly solve this problem for Python developers by providing a website that tallies how often a package is used in repositories on Github, the largest open-source software hosting platform. http://countpy.com (Defunct. Code.) provides the total number of times a package has been called in the requirements file and in the import statement in files in Python language repositories.

The net benefit (loss) of a piece of software is, of course, greater than tallied by counts of how many people use it directly in the software they build. We don’t yet count indirect use: software that uses software that uses the software of interest. Ideally, we would like to tally the total time saved, the increase in the number of projects started, projects that wouldn’t have started had the software not been there, the impact on the style in which other code is written, and such. We also want to tally the cost of errors in the original software. To the extent that people don’t produce software because they can’t be credited reasonably for it, better metrics about the impact of software can increase the production of software andincrease the quality of the software that is being provided. 

Canonical Insights

20 Oct

If the canonical insight of computer science is automating repetition, the canonical insight of data science is optimization. It isn’t that computer scientists haven’t thought about optimization. They have. But computer scientists weren’t the first to think about automation, just like economists weren’t the first to think that incentives matter. Automation is just the canonical, foundational, purpose of computer science.

Similarly, optimization is the canonical, foundational purpose of data science. Data science aims to provide the “optimal” action at time t conditional on what you know. And it aims to do that by learning from data optimally. For instance, if the aim is to separate apples from oranges, the aim of supervised learning is to give the best estimate of whether the fruit is an apple or an orange given data.

For certain kinds of problems, the optimal way to learn from data is not to exploit found data but to learn from new data collected in an optimal way. For instance, randomized inference also us to compare two arbitrary regimes. And say if you want to optimize persuasiveness, you need to continuously experiment with different pitches (the number of dimensions on which pitches can be generated can be a lot), some of which exploit human frailties (which vary by people) and some that will exploit the fact that people need to be pitched the relevant value and that relevant value differs across people.

Once you know the canonical insight of a discipline, it opens up all the problems that can be “solved” by it. It also tells you what kind of platform you need to build to make optimal decisions for that problem. For some tasks, the “platform” may be supervised learning. For other tasks, like ad persuasiveness, it may be a platform that combines supervised learning (for targeting) and experimentation (for optimizing the pitch).

Town Level Data on Cable Operators and Cable Channels

12 Sep

I am pleased to announce the release of TV and Cable Factbook Data (1997–2002; 1998 coverage is modest). Use of the data is restricted to research purposes.

Background

In 2007, Stefano DellaVigna and Ethan Kaplan published a paper that used data from Warren’s Factbook to identify the effect of the introduction of Fox News Channel on Republican vote share (link to paper). Since then, a variety of papers exploiting the same data and identification scheme have been published (see, for instance, Hopkins and Ladd, Clinton and Enamorado, etc.)

In 2012, I embarked on a similar such project—trying to use the data to study the impact of the introduction of Fox News Channel on attitudes and behaviors related to climate change. However, I found the original data to be limited—DellaVigna and Kaplan had used a team of research assistants to manually code a small number of variables for a few years. So I worked on extending the data. I planned on extending the data in two ways: adding more years, and adding ‘all’ the data for each year. To that end, I developed custom software. The data collection and parsing of a few thousand densely packed, inconsistently formatted, pages (see below) to a usable CSV (see below) finished sometime early in 2014. (To make it easier to create a crosswalk with other geographical units, I merged the data with Town lat/long (centroid) and elevation data from http://www.fallingrain.com/world/US/.)

Sample Page
cable_factbook_example
Snapshot of the Final CSV
csv_snap

Soon after I finished the data collection, however, I became aware of a paper by Martin and Yurukoglu. They found some inconsistencies between the Nielsen data and the Factbook data (see Appendix C1 of paper), tracing the inconsistencies to delays in updating the Factbook data—“Updating is especially poor around [DellaVigna and Kaplan] sample year. Between 1999 and 2000, only 22% of observations were updated. Between 1998 and 1999, only 37% of observations were updated.” Based on their paper, I abandoned the plan to use the data, though I still believe the data can be used for a variety of important research projects, including estimating the impact of the introduction of Fox News. Based on that belief, I am releasing the data.

Where’s the news?: Classifying News Domains

23 Jul

We select an initial universe of news outlets (i.e., web domains) via the Open Directory Project (ODP, dmoz.org), a collective of tens of thousands of editors who hand-label websites into a classification hierarchy. This gives 7,923 distinct domains labeled as: news, politics/news, politics/media, and regional/news. Since the vast majority of these news sites receive relatively little traffic, to simplify our analysis we restrict to the one hundred domains that attracted the largest number of unique visitors from our sample of toolbar users. This list of popular news sites includes every major national news source, well-known blogs and many regional dailies, and
collectively accounts for over 98% of page views of news sites in the full ODP list (as estimated via our toolbar sample). The complete list of 100 domains is given in the Appendix.

From Filter Bubbles, Echo Chambers, and Online News Consumption by Flaxman, Goel and Rao.

When using rich browsing data, scholars often rely on ad hoc lists of domains to estimate consumption of certain kind of media. Using these lists to estimate consumption raises three obvious concerns – 1) Even sites classified as ‘news sites,’ such as the NYT, carry a fair bit of non-news 2) (speaking categorically) There is the danger of ‘false positives’ 3) And (speaking categorically again) there is a danger of ‘false negatives.’

FGR address the first concern by exploiting the URL structure. They exploit the fact that the URL of NY Times story contains information about the section. (The classifier is assumed to be perfect. But likely isn’t. False positive and negative rates for this kind of classification can be estimated using raw article data.) This leaves us with concern about false positives and negatives at the domain level. Lists like those published by DMOZ appear to be curated well-enough to not contain too many false-positives. The real question is about how to calibrate false negatives. Here’s one procedure. Take a large random sample of the browsing data (at least 10,000 unique domain names). Compare it to a large comprehensive database like Shallalist. Of the domains that aren’t in the database, query a URL classification service such as Trusted Source. (The initial step of comparing against Shallalist is to reduce the amount of querying.) Using the results, estimate the proportion of missing domain names (the net number of missing domain names is likely much much larger). Also estimate missed visitation time, page views etc.

A Quick Scan: From Paper to Digital

28 May

Additional Note: See this presentation on paper to digital (pdf).

There are two ways of converting paper to digital data: ‘human OCR and input,’ and machine-assisted. Here are some useful pointers about the latter.

Scanning

Since the success of so much of what comes after depends on the quality of the scanned document, invest a fair bit of effort in obtaining high-quality scans. If the paper input is in the form of a book, and if the book is bound, and especially if it is thick, it is hard to copy text close to the spine without destroying the spine. If the book can be bought at a reasonable price, do exactly that – destroy the spine – cut the book and then scan. Many automated scanning services will cut the spine by default. Other things to keep in mind: scan at a reasonably high resolution (since storage is cheap, go for at least 600 dpi), and if choosing PDF as an output option, see if the scan can be saved as a “searchable PDF.” (So scanning + OCR in one go.)

An average person working without too much distraction can scan 60-100 images per hour. If two pages of the book can be scanned at once, this means 120-200 pages can be scanned in an hour. Assuming you can hire someone to scan for $10/hr, it comes to 12-20 pages per dollar, which translates to 5 to 8 cents per page. However, scanning manually is boring, and people who are punctilious about quality page after page are far and few between. Relying on automated scanning companies may be better. And cheaper. 1dollarscan.com charges 2 cents per page with OCR. But there is no free lunch. Most automated scanning services cut the spines of the book, and many places don’t send back the paper copy for reasons to do with copyright. So you may have to factor in the cost of the book. And typically, the scanning services do all or nothing. You can’t give directions to scan the first 10 pages, followed by middle 120, etc. Thus per relevant page costs may exceed those of manual scanning.

Scan to Text

If the text is clear and laid out simply, most commonly available OCR software will do just fine. Acrobat Professional’s own facility for recognizing text in images, found under ‘tools,’ is reasonable. Acrobat also notes words that it is unsure about – it calls them ‘suspects’; you can click through the line up of ‘suspects,’ correcting them as needed. The interface for making corrections is suboptimal, but it is likely to prove satisfactory for small documents with few errors.

Those without ready access to Adobe Professional can extract text from `searchable PDF’ using xpdf or any of the popular programming languages. In Python, pyPdf (see script) or pdfminer (or other libraries) are popular. If the document is a set of images, one can use libraries based on Tesseract (see script). PDF documents need to be converted to images using Ghostscript or similar such rasterization software before being fed to Tesseract.

But what if the quality of scans is poor or the page layout complicated? First, try enhancing images – fixing orientation, using filters to enhance readability, etc. This process can be automated if the images are distorted in similar ways. Second, try extracting boundary boxes for columns/paragraphs and words/characters (position/size) and font styles (name/size) by choosing XML/HTML as the output format. This information can be later exploited for aligning etc. However, how much you gain from extracting style and boundary box information depends heavily on the quality of the original pdf. For low quality pdfs, mislabeling of font size and style can be common, which means the information cannot be used reliably. Third, explore training the OCR. Tesseract can be trained to improve OCR though training it isn’t straightforward. Fourth, explore good professional OCR engines such as Abbyy FineReader (See R Package connecting to Abbyy FineReader API). OCR in AbbyyFine can be easily improved by adding training data, and tuning various options for identifying the proper ‘area order’ (which text area follows which, which portion of the page isn’t part of text area etc.).

Post-processing: Correcting Errors

The OCR process makes certain kinds of errors. For instance, ‘i’ may be confused for an ‘l’ or for a ‘pipe.’ And these errors are often repeated. One consequence of these errors is that some words are misspelled systematically. One way to deal with such errors is to simply search and replace particular strings (see script). When using search and replace, it pays to be mindful of problems that may ensue from searching and replacing short strings. For instance, replacing ‘lt’ with ‘it’ may mean converting ‘salt’ to ‘sait.’

Note: For a more general account on matching dirty data, read this article.

It is typically useful to script a search and replace script alongside a database of search terms and terms you propose to replace them with. For one it allows you to apply the same set of corrections to many documents. For two, it can be easily amended and re-rerun. While writing these scripts (see script), it is important to keep issues to do with text encoding in mind; OCR yields some ligatures (e.g., fi) and some other Unicode characters.

Searching and replacing particular strings can prove time-consuming as the same word can often be misspelled in tens of different ways. For instance, in a recent project, the word “programming” was misspelled in no less than 28 ways. The easiest extension to this algorithm is to search for patterns. For instance, some number of sequential errors at any point in the string. One can extend it to include various ‘edit distances’, e.g., Levenshtein distance, which is the number of characters you need to switch to convert one word to another, allowing the user to handle non-sequential errors (see script). Again the trick is to keep the length of the string in mind as false positives may abound. One can do that by choosing a metric that factors in the size of the string and the number of errors. For instance, a metric like (Levenshtein distance)/(size of the original string). Lastly, rather than use edit distance, one can apply ‘better’ pattern matching algorithms, such as Ratcliff/Obershelp.

Spell checks, a combination of pattern matching libraries and databases (dictionaries), are another common way of searching and replacing strings. There are various freely available spell-check databases (to be used with your own pattern matching algorithm) and libraries, including pyEnchant for Python. One can even use VB to call MS-Word’s fairly reasonable spell checking functions. However, if the document contains lots of unique proper nouns, spell check is liable to create more problems than it solves. One way to reduce these errors is to (manually) amend the database. Another is to limit corrections to words of a certain size, or to cases where the suggested words and the words in text differ only by certain kinds of letters (or non-letters, ‘pipe’ for the letter l). One can also leverage ‘Google suggest’ to fix spelling errors. Lastly, spell checks built for particular OCR errors, such as ocrspell, can also be used. If these methods still yield too many false corrections, one can go for a semi-automated approach: use spell-checks to harvest problematic words and recommended replacements and then let people pick the right version. A few tips for creating human-assisted spell check versions: eliminate duplicates, and provide the user with neighboring words (2 words before and after can work for some projects). Lastly, one can use M-Turk to iteratively proof-read the document (see TurkIt).

Bad Weather: Getting Data on Weather in a ZIP Code on a Particular Date

27 Jun

High-quality weather data are public. But they aren’t easy to make use of.

Some thoughts and some software for finding out the weather in a particular ZIP Code on a particular day (or a set of dates).

Some brief ground clearing before we begin. Weather data come from weather stations, which can belong to any of the five or more “networks,” each of which collects somewhat different data, sometimes label the same data differently and have different reporting protocols. The only geographic information that typically comes with weather stations is their latitude and longitude. By “weather,” we may mean temperature, rain, wind, snow, etc. and we may want data on these for every second, minute, hour, day, month, etc. It is good to keep in mind that not all weather stations report data for all units of time, and there can be a fair bit of missing data. Getting data at coarse time units like day, month, etc. typically involves making some decisions about what statistic is the most useful. For instance, you may want minimum and maximum for daily temperature and totals for rainfall and snow. With that primer, let’s begin.

We begin with what not to do. Do not use the NOAA web service. The API provides a straightforward way to get “weather” data for a particular ZIP Code for a particular month. Except, the requests often return nothing. It isn’t clear why. The documentation doesn’t say whether the search for the closest weather station is limited to X kilometers because without that, one should have data for all ZIP Codes and all dates. Nor does the API bother to return how far the weather station is from which it got the data, though one can get that post hoc using Google Geocoding API. However, given the possibility that the backend for the API would improve over time, here’s a script for getting the daily weather data, and hourly precipitation data.

On to what can be done. The “web service” that you can use is Farmer’s Almanac’s. Sleuthing using scripts that we discuss later reveal that The Almanac reports data from the NWS-USAF-NAVY stations (ftp link to the data file). And it appears to have data for most times though no information is provided on the weather station from which it got the data and the distance to the ZIP Code.

If you intend to look for data from GHCND, COOP, or ASOS, there are two kinds of crosswalks that you can create: 1) from ZIP codes to weather stations, and 2) from weather stations to ZIP Codes. I assume that we don’t have access to shapefiles (for census ZIP Codes) and that postal ZIP Codes encompass a geographic region. To create a weather station to ZIP Code crosswalk, web service such as Geonames or Google Geocoding API can be used. If the station lat,./long. is in the zip code, the distance comes up as zero. Otherwise the distance is calculated as distance from the centroid of the ZIP Code (see geonames script that finds 5 nearest ZIPs for each weather station). For creating a ZIP code to weather station crosswalk, we get centroids of each ZIP using a web service such as Google (or use already provided centroids from free ZIP databases). And then find the “nearest” weather stations by calculating distances to each of the weather stations. For a given set of ZIP Codes, you can get a list of closest weather stations (you can choose to get n closest stations, or say all weather stations within x kilometers radius, and/or choose to get stations from particular network(s)) using the following script. The output lists for each ZIP Code weather stations arranged by proximity. The task of getting weather data from the closest station is simple thereon—get data (on a particular set of columns of your choice) from the closest weather station from which the data are available. You can do that for a particular ZIP Code and date (and date range) combination using the following script.