Overfitting

Updated at 2017-06-16 16:08

Never train a model using the same data you will later test it against. The model will learn to always "predict" right labels on this data but won't be generalized enough for any real world use. This is called overfitting.

You should have two sets of data: "training set" and "test set". Sometimes using three sets is advisable; the third set is called "holdout set" that is used as the final step to validate the model, but it's not used in the actual training.

With a small amount of labeled data, use k-fold cross-validation. Divide the dataset into k = 10 parts of equal size, for each of the 10 parts, train the model on the 9 parts and use the final part to test it.

Sources

sklearn - ML General Concepts
The Master Algorithm, Pedro Domingos