Data Preparation

Updated at 2024-11-18 06:11

In machine learning, data preparation is the process of cleaning and transforming raw data before it is fed into feature extraction or a machine learning algorithm.

Research if there exists previous work on similar data, model or use case.

Learn from previous work, but don't let it constrain your approach.

Explore and get to know the data:

Document how is the data collected
Explain the data columns, both for others and your future self:
- What the value represents
- Highlight and explain missing values
- Highlight and explain obvious mistakes
- Based on your previous research, are there important columns missing?
- Find strange outliers or explanation why there are no outliers
Take a close look at the distribution of the data
- Take a guess if some subpopulation should be handled differently
- Does the distribution align with your previous research?
Clean, fill, reshape, filter and other data manipulation
Inspect for any obvious correlations between columns
Use intuition to find correlations through statistical analysis
- Be extra careful to look for correlations inside subpopulations
Check if dimensionality reduction is possible e.g. principal component analysis

Pearson and Spearman correlation coefficients are two widely used statistical measures when measuring the relationship between variables.

Much of this is related to feature engineering.

Start working on automating the data preprocessing After all of this, you should have a good idea of what the data looks like and how pre-processing of it should be handled.