ruk·si

# Clustering

Updated at 2017-11-13 16:37

Clustering the task of gathering samples into groups of similar samples according to some similarity or measure.

Common applications of clustering:

• Building customer profiles for market analysis.
• Grouping related web pages.
• Grouping related stock quotes for investment portfolio management.
• Used as a preprocessing step for recommender systems.
• Building prototype samples for unsupervised feature extraction for supervised learning algorithms.

Terms:

• Distance Function aka. Metric: Defines how we calculate a distance between two elements of a set. A set with a distance function is called a metric space.
• Clustering: a collection of clusters or the act of creating clusters.
• Outlier: an element not belonging to any cluster.
• Hard Clustering: each element belongs to a cluster or not.
• Soft/Fuzzy Clustering: each element belongs to a cluster to a certain degree e.g. percentage.
• Strict Partitioning Clustering: each element belongs to exactly one cluster.
• Strict Partitioning Clustering with Outliers: each element belongs to exactly one cluster or no cluster. Elements not belonging to a cluster are called outliers.
• Overlapping Clustering: each element may belong to more than one cluster.
• Hierarchical Clustering: elements that belong to a child cluster also belong to the parent cluster.

Cluster is always a group of data items. But clusters can have vastly different additional properties, depending on the algorithm that produced them. These sets of additional properties are called cluster models.

Cluster models:

• Connectivity Models: grouped based on distance connectivity.
• Centroid Models: define clustering centroids in metric space and each element belongs to the cluster with the shortest centroid distance.
• Distribution Models: grouped based on statistical distribution of elements.
• Density Models: connected dense regions in metric space.
• Subspace Models: groups have relevant attributes in addition to elements.
• Group Models: instead of forming element groups, provides grouping information.
• Graph-based Models: two nodes in a graph are in the same cluster if the share an edge.

Clustering methods that work when you have samples and features:

• k-means: Partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
• Mean Shift: Locate the maxima of a density function given discrete data sampled from that function. Can work better than k-means but not scalable to high number of samples.
• DBSCAN: Density-Based Spatial Clustering of Applications with Noise. Points are classified as core points, reachable points and outliers. Can detect irregularly shaped clusters and outliers based on density.
• OPTICS: Ordering Points to Identify the Clustering Structure. DBSCAN variant that handles clusters with different densities much better.

TODO:

• Locally Linear Embedding: Neighbor-based
• Isomap: Neighbor-based

Clustering methods that work with affinity matrix of two sample sets:

• Affinity Propagation: Based on message passing between samples.
• Spectral Clustering: Finds normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.
• Ward: Hierarchical clustering based on the Ward algorithm.