Previously, I wrote about an autoencoder based price outlier detection algorithm. In this post, I will introduce another useful algorithm, cluster analysis based detector.

Clustering is a non-supervised algorithm used in machine learning for categorization. Being non-supervised, you don’t have to label each price as normal or outlier. You just provide a hyperparameter, in this case usually the number of clusters (k), and clustering software tries to classify price data so that similar prices flock together to form up to k clusters. Roughly speaking, there are cluster centers (for example, the average of the prices within each cluster). When a new price data arrives, distances from the price to the centers of clusters are calculated, and the data is assigned to the nearest cluster having minimal center-to-data distance. Then the center location of the cluster is updated accordingly using the data. Thus, if there are outliers that lie apart from other regular prices, we observe multiple clusters where regular prices form a cluster with the largest number of elements ( let’s call it the main cluster ), whereas outliers will form separate clusters. By examining cluster label, one can easily sort out outliers. Continue Reading in LinkedIn