ruk·si

Overfitting

Updated at 2017-06-16 16:08

Never train a model using the same data you will later test it against. The model will learn to always "predict" right labels on this data but won't be generalized enough for any real world use. This is called overfitting.

You should have two sets of data: "training set" and "test set". Sometimes using three sets is advisable; the third set is called "holdout set" that is used as the final step to validate the model but it's not used as a step in the actual training evaluation.

If you have only a small amount of labeled data, you can use k-fold cross validation. Divide the dataset into k = 10 parts of equal size, for each of the 10 parts, train the model on the 9 parts and use the final part to test it.

Sources