🗄️ Datasets - Clustering
Clustering
Updated at 2024-12-21 01:55
Clustering is the task of grouping samples into sets of similar samples based on some similarity measure.
Common applications of clustering:
- Building customer profiles for market analysis
- Grouping related web pages
- Grouping related stock quotes for investment portfolio management
- Preprocessing step for recommender systems
- Prototyping for feature extraction in supervised learning
Terminology
- Distance Function aka. Metric
- how we calculate a distance between two elements of a set
- Metric Space
- a set with a distance function
- Clustering
- a collection of clusters or the act of creating clusters
- Outlier
- an element not belonging to any cluster
- Hard Clustering
- each element belongs to a cluster or not
- Soft/Fuzzy Clustering
- each element belongs to a cluster to a certain degree e.g. percentage
- Strict Partitioning Clustering
- each element belongs to exactly one cluster
- Strict Partitioning Clustering with Outliers
- each element belongs to exactly one cluster or no cluster
- Overlapping Clustering
- each element may belong to more than one cluster
- Hierarchical Clustering
- elements that belong to a child cluster also belong to the parent cluster
Models
Cluster is always a group of data items. But clusters can have vastly different additional properties, depending on the algorithm that produced them. These sets of additional properties are called cluster models.
Cluster models:
- Connectivity Models: grouped based on distance connectivity
- Centroid Models: define clustering centroids in metric space and each element belongs to the cluster with the shortest centroid distance
- Distribution Models: grouped based on statistical distribution of elements
- Density Models: connected dense regions in metric space
- Subspace Models: groups have relevant attributes in addition to elements
- Group Models: instead of forming groups, provides grouping information
- Graph-based Models: nodes are in the same cluster if they share an edge
Clustering methods that work when you have samples and features:
- k-means: Partition
nobservations intokclusters in which each observation belongs to the cluster with the nearest mean. - Mean Shift: Locate the maxima of a density function given discrete data sampled from that function. Can work better than k-means but not scalable to high number of samples.
- DBSCAN: Density-Based Spatial Clustering of Applications with Noise. Points are classified as core points, reachable points and outliers. Can detect irregularly shaped clusters and outliers based on density.
- OPTICS: Ordering Points to Identify the Clustering Structure. DBSCAN variant that handles clusters with different densities much better.
- Locally Linear Embedding
- manifold learning technique that is used in the preprocessing step for clustering to reduce dimensionality before e.g., k-means; it's based on local linear approximations of the manifold
- Isomap (Isometric Mapping)
- manifold learning technique that is used in the preprocessing step for clustering to reduce dimensionality before e.g., k-means; it's based on geodesic distances in the manifold
Clustering methods that work with affinity matrix of two sample sets:
- Affinity Propagation: Based on message passing between samples.
- Spectral Clustering: Finds normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.
- Ward: Hierarchical clustering based on the Ward algorithm.