Nearest Neighbor Methods

Updated at 2017-06-16 01:23

Nearest neighbor methods (NN) try to find a number of pre-placed training samples closest in distance to the new queried point and predict the label from these. This is a form of clustering.

Neighbors-based methods are known as non-generalizing machine learning methods since they simply "remember" all of the training data.

Nearest neighbors have been successful in a large number of both classification and regression problems. And many learning routines use nearest neighbor approaches at their core.

Most common NN methods:

  • Brute Force: Computing all distances between input and the dataset. Long query time when there is a large number of training data.
  • K-D Tree: Using cached aggregate information to reduce distance calculation e.g. if we know that A is distant from B and B is close to C, then we know that A is distant from C. Long query time when there is a large number of distance dimensions.
  • Ball Tree: using nested hyper-spheres to partition data instead of along Cartesian axes, which makes them efficient in high dimensions. Slower to train and provide less benefit if you don't use a lot of dimensions.
  • Nearest Centroid: represents each class by the centroid of its members.
  • Nearest Shrunken Centroid: the value of each feature for each centroid is divided by the within-class variance of that feature.

Sparse data with small intrinsic dimensionality is fast to query with tree algorithms. Intrinsic dimensionality means how many variables are needed to represent each data point.

Tree algorithm query time will increase with the number of neighbors requested for per query point. This is to be expected as then more neighbors must be searched in the parameter space. Brute force will of course stay unaffected as all distances need to be calculated anyway.

Brute force works better if the model is queried rarely, with large number of neighbors and with small number of training data.

Nearest neighbors algorithms start to perform badly when faced with dimensions over 50. This is where approximate nearest neighbor methods with their "good guesses" come into play.

  • Locality Sensitive Hashing Forest: Hash each data point using multiple hash functions to form a digest. Digests of closely located points will collide and you can control the accuracy by algorithm parameters. Query times stay fast even with training sets of 100 000+ samples, but the accuracy slowly decreases.