How to use Synthetic Data to Remove Bias in your Training Data?
Unconscious Bias creeps into Machine Learning in several insidious ways. You can address bias effectively using Synthetic Data. While creating Synthetic Data out of your Real Data you can make sure that the training set you are using is free of bias, as much as possible. Sources of Bias are several. Here are some of them:
- Underrepresentation – Let’s say you are creating a training set for arriving at salary ranges for a job. Your training set may contain a disproportionate number of Males compared to Females or Vice Versa depending upon profession (Female Nurses for example). Creating a Machine Learning model out of this, makes your model skewed.
- Sample Sizes – You may have a training data set that contains very few Millennials compared to Baby Boomers. You may be a creating an ML model that is based on Baby Boomers and may be leading to bad conclusions about millennials.
- Skewed Sample – You may have a training data set that is skewed in some features. Let’s say your training data set has a disproportionate number of renters compared to those who own houses. The features you are analyzing may be skewed as a result – say ownership of lawn mowers. Homeowners are more likely to have a lawn mower compared to renters.
- Tainted Examples – This comes up often in cases where approval is made – loans, insurance coverage – whether to cover or not, setting up of insurance premiums etc. If in the past your decisions have been biased, consciously or unconsciously, your training set may also be biased. Training your ML model using this data creates a biased ML model.
- Limited Features – Your training data set may not have enough distinguishing features that compensate for bias. Let’s say in the past you have not used certifications earned by a person in your loan decisions but only their earned college degrees. The certifications column may have been added to your software in a recent version. The older data may not have these fields filled in or may be filled with NULL values or be empty. ML models built using these are likely to be biased as well.
Synthetic data sets have the potential to remove this bias by taking existing real production data and extrapolating the missing data to fill in the gaps. For example, they can use real production data to create more representative males or females in your data set and dropping the data that cause the bias in the first place. Synthetic Data can address the Sample Size issue by using the data that exists about millennials and create additional millennials data.
Synthetic data can address the missing columns issue by using the data present in those columns to backfill the columns that have NULL or empty values. Synthetic data tools can help you encode that characteristics you desire in your data set and plug the data holes that cause bias in the first place.
All data is man-made. Somebody, at some point, decided what data to collect, how to organize it, how to present it, and how to infer meaning from it—and it embeds all kinds of false rigor into the process. ~ Clayton M. Christensen, Competing Against Luck