🗄️ Datasets

Updated at 2023-12-30 16:13

Dataset terminology:

Train Dataset: model learns by seeing only this dataset
Track Dataset: used to track progress during training
Test Dataset: used at the end of the training to evaluate the performance
frequently there is no separate track dataset, and the train dataset is used for tracking by utilizing some of the approaches described below

Small Dataset:
- 60% train
- 20% track
- 20% test

Big Dataset (over million samples):
- 98% train
- 1% track
- 1% test

The common approaches to evaluate machine learning model performance:

Hold-out Validation: You split your dataset into train and track sets once.
Cross-validation: Change the train and track sets after each iteration.
Progressive Validation: Compare the predictions side-by-side with live data and production model.
Visualization: A huge average prediction accuracy can still cause significant loss in a single use-case of a big machine learning model. This is really hard for machines to understand so human-readable visualizations are better approach to start debugging the problem. The best approach is to visualize slices by use-cases like "country" or "time of day" (e.g. how GridViz at Google does).

You should prefer cross-validation to hold-out.

Make sure that all datasets have the same distribution. It will be a bad model if you train all on cat images and tests are done on dog images.

Dataset segmentation can be used to improve fairness. Especially important if the predictions are based on personal information. Your best bet it to remove anything morally or legally questionable from the datasets like names which can indirectly show person's ethnicity or cultural background. But even better is to keep this information but use it only for segmentation as then you can track and prove model fairness.

Even home address can be discriminatory.