ruk·si

🗄️ Datasets
Clustering

Updated at 2024-12-21 01:55

Clustering is the task of grouping samples into sets of similar samples based on some similarity measure.

Common applications of clustering:

  • Building customer profiles for market analysis
  • Grouping related web pages
  • Grouping related stock quotes for investment portfolio management
  • Preprocessing step for recommender systems
  • Prototyping for feature extraction in supervised learning

Terminology

Distance Function aka. Metric
how we calculate a distance between two elements of a set
Metric Space
a set with a distance function
Clustering
a collection of clusters or the act of creating clusters
Outlier
an element not belonging to any cluster
Hard Clustering
each element belongs to a cluster or not
Soft/Fuzzy Clustering
each element belongs to a cluster to a certain degree e.g. percentage
Strict Partitioning Clustering
each element belongs to exactly one cluster
Strict Partitioning Clustering with Outliers
each element belongs to exactly one cluster or no cluster
Overlapping Clustering
each element may belong to more than one cluster
Hierarchical Clustering
elements that belong to a child cluster also belong to the parent cluster

Models

Cluster is always a group of data items. But clusters can have vastly different additional properties, depending on the algorithm that produced them. These sets of additional properties are called cluster models.

Cluster models:

  • Connectivity Models: grouped based on distance connectivity
  • Centroid Models: define clustering centroids in metric space and each element belongs to the cluster with the shortest centroid distance
  • Distribution Models: grouped based on statistical distribution of elements
  • Density Models: connected dense regions in metric space
  • Subspace Models: groups have relevant attributes in addition to elements
  • Group Models: instead of forming groups, provides grouping information
  • Graph-based Models: nodes are in the same cluster if they share an edge

Clustering methods that work when you have samples and features:

  • k-means: Partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
  • Mean Shift: Locate the maxima of a density function given discrete data sampled from that function. Can work better than k-means but not scalable to high number of samples.
  • DBSCAN: Density-Based Spatial Clustering of Applications with Noise. Points are classified as core points, reachable points and outliers. Can detect irregularly shaped clusters and outliers based on density.
  • OPTICS: Ordering Points to Identify the Clustering Structure. DBSCAN variant that handles clusters with different densities much better.
Locally Linear Embedding
manifold learning technique that is used in the preprocessing step for clustering to reduce dimensionality before e.g., k-means; it's based on local linear approximations of the manifold
Isomap (Isometric Mapping)
manifold learning technique that is used in the preprocessing step for clustering to reduce dimensionality before e.g., k-means; it's based on geodesic distances in the manifold

Clustering methods that work with affinity matrix of two sample sets:

  • Affinity Propagation: Based on message passing between samples.
  • Spectral Clustering: Finds normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.
  • Ward: Hierarchical clustering based on the Ward algorithm.

Sources