*Clustering*

*Clustering*

Updated at 2017-11-13 14:37

Clustering the task of gathering samples into groups of similar samples according to some similarity or measure.

Common applications of clustering:

- Building customer profiles for market analysis.
- Grouping related web pages.
- Grouping related stock quotes for investment portfolio management.
- Used as a preprocessing step for recommender systems.
- Building prototype samples for unsupervised feature extraction for supervised learning algorithms.

Terms:

**Distance Function aka. Metric:**Defines how we calculate a distance between two elements of a set. A set with a distance function is called a metric space.**Clustering:**a collection of clusters or the act of creating clusters.**Outlier:**an element not belonging to any cluster.**Hard Clustering:**each element belongs to a cluster or not.**Soft/Fuzzy Clustering:**each element belongs to a cluster to a certain degree e.g. percentage.**Strict Partitioning Clustering:**each element belongs to exactly one cluster.**Strict Partitioning Clustering with Outliers:**each element belongs to exactly one cluster or no cluster. Elements not belonging to a cluster are called outliers.**Overlapping Clustering:**each element may belong to more than one cluster.**Hierarchical Clustering:**elements that belong to a child cluster also belong to the parent cluster.

**Cluster is always a group of data items.** But clusters can have vastly different additional properties, depending on the algorithm that produced them. These sets of additional properties are called cluster models.

Cluster models:

**Connectivity Models:**grouped based on distance connectivity.**Centroid Models:**define clustering centroids in metric space and each element belongs to the cluster with the shortest centroid distance.**Distribution Models:**grouped based on statistical distribution of elements.**Density Models:**connected dense regions in metric space.**Subspace Models:**groups have relevant attributes in addition to elements.**Group Models:**instead of forming element groups, provides grouping information.**Graph-based Models:**two nodes in a graph are in the same cluster if the share an edge.

Clustering methods that work when you have samples and features:

**k-means:**Partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.**Mean Shift:**Locate the maxima of a density function given discrete data sampled from that function. Can work better than k-means but not scalable to high number of samples.**DBSCAN:**Density-Based Spatial Clustering of Applications with Noise. Points are classified as core points, reachable points and outliers. Can detect irregularly shaped clusters and outliers based on density.**OPTICS:**Ordering Points to Identify the Clustering Structure. DBSCAN variant that handles clusters with different densities much better.

TODO:

- Locally Linear Embedding: Neighbor-based
- Isomap: Neighbor-based

Clustering methods that work with affinity matrix of two sample sets:

**Affinity Propagation**: Based on message passing between samples.**Spectral Clustering**: Finds normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.**Ward**: Hierarchical clustering based on the Ward algorithm.