Data Anonymization
Updated at 2017-09-10 10:35
Even a trusted partner shouldn't be shared the raw data. You will still get into legal trouble and they might accidentally leak it forward.
Sensitive data is difficult to share while keeping it meaningful. How can you share the data outside your organization for monetization or open data?
Data anonymization techniques:
- k-anonymity
- suppression: take a column away, destroys part of your data
- generalization: create buckets for the data like age 20-30
- l-diversity
- t-closeness
- differential privacy
Problem is that there is no measurable criterion for anonymity of a data set. How to you guarantee data is still meaningful after anonymization? You may only apply information theoretics.
It is trivial to identify outliers on anonymized dataset. Removing outliers will remove information, but it frequently must be done.
You can also use data synthesization.
- Randomize your existing data.
- Add noise to your data.
- Learn probability distributions of your data and generate new data.
Sources
- Yoan Miche, Nokia Bell Labs, Shift 2017