🗄️ Datasets -
Clustering

Updated at 2024-12-21 01:55

Clustering is the task of grouping samples into sets of similar samples based on some similarity measure.

Common applications of clustering:

Building customer profiles for market analysis
Grouping related web pages
Grouping related stock quotes for investment portfolio management
Preprocessing step for recommender systems
Prototyping for feature extraction in supervised learning

Terminology

Distance Function aka. Metric: how we calculate a distance between two elements of a set
Metric Space: a set with a distance function
Clustering: a collection of clusters or the act of creating clusters
Outlier: an element not belonging to any cluster
Hard Clustering: each element belongs to a cluster or not
Soft/Fuzzy Clustering: each element belongs to a cluster to a certain degree e.g. percentage
Strict Partitioning Clustering: each element belongs to exactly one cluster
Strict Partitioning Clustering with Outliers: each element belongs to exactly one cluster or no cluster
Overlapping Clustering: each element may belong to more than one cluster
Hierarchical Clustering: elements that belong to a child cluster also belong to the parent cluster

Models

Cluster is always a group of data items. But clusters can have vastly different additional properties, depending on the algorithm that produced them. These sets of additional properties are called cluster models.

Cluster models:

Connectivity Models: grouped based on distance connectivity
Centroid Models: define clustering centroids in metric space and each element belongs to the cluster with the shortest centroid distance
Distribution Models: grouped based on statistical distribution of elements
Density Models: connected dense regions in metric space
Subspace Models: groups have relevant attributes in addition to elements
Group Models: instead of forming groups, provides grouping information
Graph-based Models: nodes are in the same cluster if they share an edge

Clustering methods that work when you have samples and features:

k-means: Partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
Mean Shift: Locate the maxima of a density function given discrete data sampled from that function. Can work better than k-means but not scalable to high number of samples.
DBSCAN: Density-Based Spatial Clustering of Applications with Noise. Points are classified as core points, reachable points and outliers. Can detect irregularly shaped clusters and outliers based on density.
OPTICS: Ordering Points to Identify the Clustering Structure. DBSCAN variant that handles clusters with different densities much better.

Locally Linear Embedding: manifold learning technique that is used in the preprocessing step for clustering to reduce dimensionality before e.g., k-means; it's based on local linear approximations of the manifold
Isomap (Isometric Mapping): manifold learning technique that is used in the preprocessing step for clustering to reduce dimensionality before e.g., k-means; it's based on geodesic distances in the manifold

Clustering methods that work with affinity matrix of two sample sets:

Affinity Propagation: Based on message passing between samples.
Spectral Clustering: Finds normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.
Ward: Hierarchical clustering based on the Ward algorithm.

🗄️ Datasets - Clustering

Terminology

Models

Sources

🗄️ Datasets -
Clustering