Interpeting Clusters and ‘Outliers’ from Clustering Algorithms

19 Feb

Assume that the data from the dominant data generating process are structured so that they occupy a few small portions of a high-dimensional space. Say we use a hard partition clustering algorithm to learn the structure of the data. And say that it does—learn the structure. Anything that lies outside the few narrow pockets of high-dimensional space is an ‘outlier,’ improbable (even impossible) given the dominant data generating process. (These ‘outliers’ may be generated by a small malicious data generating processes.) Even points on the fringes of the narrow pockets are suspicious. If so, one reasonable measure of suspiciousness of a point is its distance from the centroid of the cluster to which it is assigned; the further the point from the centroid, the more suspicious it is. (Distance can be some multivariate distance metric, or proportion of points assigned to the cluster that are further away from the cluster centroid than the point whose score we are tallying.)

How can we interpret an outlier (score)? Tautological explanations—it is improbable given the dominant data generating process—aside.

Simply providing distance to the centroid doesn’t give enough context. And for obvious reasons, for high-dimensional vectors, providing distance on each feature isn’t reasonable either. A better approach involves some feature selection. This can be done in various ways, all of which take the same general form. Find distance to the centroid on features on which the points assigned to the cluster have the least variation. Or, on features that discriminate the cluster from other clusters the best. Or, on features that predict distance from the cluster centroid the best. Limit the features arbitrarily to a small set. On this limited feature set, calculate cluster means and standard deviations, and give standardized distance (for categorical variable, just provide ) to the centroid.

Read More (pdf with pseudo code)