Clustering
Updated at 2017-11-13 14:37
Clustering the task of gathering samples into groups of similar samples according to some similarity or measure.
Common applications of clustering:
- Building customer profiles for market analysis.
- Grouping related web pages.
- Grouping related stock quotes for investment portfolio management.
- Used as a preprocessing step for recommender systems.
- Building prototype samples for unsupervised feature extraction for supervised learning algorithms.
Terms:
- Distance Function aka. Metric: Defines how we calculate a distance between two elements of a set. A set with a distance function is called a metric space.
- Clustering: a collection of clusters or the act of creating clusters.
- Outlier: an element not belonging to any cluster.
- Hard Clustering: each element belongs to a cluster or not.
- Soft/Fuzzy Clustering: each element belongs to a cluster to a certain degree e.g. percentage.
- Strict Partitioning Clustering: each element belongs to exactly one cluster.
- Strict Partitioning Clustering with Outliers: each element belongs to exactly one cluster or no cluster. Elements not belonging to a cluster are called outliers.
- Overlapping Clustering: each element may belong to more than one cluster.
- Hierarchical Clustering: elements that belong to a child cluster also belong to the parent cluster.
Cluster is always a group of data items. But clusters can have vastly different additional properties, depending on the algorithm that produced them. These sets of additional properties are called cluster models.
Cluster models:
- Connectivity Models: grouped based on distance connectivity.
- Centroid Models: define clustering centroids in metric space and each element belongs to the cluster with the shortest centroid distance.
- Distribution Models: grouped based on statistical distribution of elements.
- Density Models: connected dense regions in metric space.
- Subspace Models: groups have relevant attributes in addition to elements.
- Group Models: instead of forming element groups, provides grouping information.
- Graph-based Models: two nodes in a graph are in the same cluster if the share an edge.
Clustering methods that work when you have samples and features:
- k-means: Partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
- Mean Shift: Locate the maxima of a density function given discrete data sampled from that function. Can work better than k-means but not scalable to high number of samples.
- DBSCAN: Density-Based Spatial Clustering of Applications with Noise. Points are classified as core points, reachable points and outliers. Can detect irregularly shaped clusters and outliers based on density.
- OPTICS: Ordering Points to Identify the Clustering Structure. DBSCAN variant that handles clusters with different densities much better.
TODO:
- Locally Linear Embedding: Neighbor-based
- Isomap: Neighbor-based
Clustering methods that work with affinity matrix of two sample sets:
- Affinity Propagation: Based on message passing between samples.
- Spectral Clustering: Finds normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.
- Ward: Hierarchical clustering based on the Ward algorithm.