ruk·si

Clustering

Updated at 2017-11-13 14:37

Clustering the task of gathering samples into groups of similar samples according to some similarity or measure.

Common applications of clustering:

  • Building customer profiles for market analysis.
  • Grouping related web pages.
  • Grouping related stock quotes for investment portfolio management.
  • Used as a preprocessing step for recommender systems.
  • Building prototype samples for unsupervised feature extraction for supervised learning algorithms.

Terms:

  • Distance Function aka. Metric: Defines how we calculate a distance between two elements of a set. A set with a distance function is called a metric space.
  • Clustering: a collection of clusters or the act of creating clusters.
  • Outlier: an element not belonging to any cluster.
  • Hard Clustering: each element belongs to a cluster or not.
  • Soft/Fuzzy Clustering: each element belongs to a cluster to a certain degree e.g. percentage.
  • Strict Partitioning Clustering: each element belongs to exactly one cluster.
  • Strict Partitioning Clustering with Outliers: each element belongs to exactly one cluster or no cluster. Elements not belonging to a cluster are called outliers.
  • Overlapping Clustering: each element may belong to more than one cluster.
  • Hierarchical Clustering: elements that belong to a child cluster also belong to the parent cluster.

Cluster is always a group of data items. But clusters can have vastly different additional properties, depending on the algorithm that produced them. These sets of additional properties are called cluster models.

Cluster models:

  • Connectivity Models: grouped based on distance connectivity.
  • Centroid Models: define clustering centroids in metric space and each element belongs to the cluster with the shortest centroid distance.
  • Distribution Models: grouped based on statistical distribution of elements.
  • Density Models: connected dense regions in metric space.
  • Subspace Models: groups have relevant attributes in addition to elements.
  • Group Models: instead of forming element groups, provides grouping information.
  • Graph-based Models: two nodes in a graph are in the same cluster if the share an edge.

Clustering methods that work when you have samples and features:

  • k-means: Partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
  • Mean Shift: Locate the maxima of a density function given discrete data sampled from that function. Can work better than k-means but not scalable to high number of samples.
  • DBSCAN: Density-Based Spatial Clustering of Applications with Noise. Points are classified as core points, reachable points and outliers. Can detect irregularly shaped clusters and outliers based on density.
  • OPTICS: Ordering Points to Identify the Clustering Structure. DBSCAN variant that handles clusters with different densities much better.

TODO:

  • Locally Linear Embedding: Neighbor-based
  • Isomap: Neighbor-based

Clustering methods that work with affinity matrix of two sample sets:

  • Affinity Propagation: Based on message passing between samples.
  • Spectral Clustering: Finds normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.
  • Ward: Hierarchical clustering based on the Ward algorithm.

Sources