Creating Synthetic Data out of Production Data

Published on
March 3, 2022
Creating Synthetic Data out of Production Data

Real Production Data is not as useful for testing as we usually think. For testing software functionality or for performing market analyses, Production Data comes up short for the following reasons:

  • Different Demographics – Let’s say your customer profile is predominantly Baby Boomers but you need to have more Millennials in your customer data mix for marketing purposes. You do have some millennial data, but you need to have more.  Or you may need to have data that has all these demographics represented equally but your current data is skewed. 
  • Fill up Sparse Columns - New software versions may have added additional tables and columns or have rearranged some tables and columns due to Normalization. Older data may have NULL values in columns or be empty because they were created using a previous software version.
  • Not Enough Rows for Performance Testing – You don’t have enough realistic data for performance testing of your software. This is a commonly encountered condition especially if you are gearing up for scaling and want to test your software with your current production data. You just need more of it and your current Real Data is not enough.
  • Edge Cases Representation - Real Data rarely has enough edge cases that help you test your software fully. You may need to fill those data holes so that you are testing your software, using all the edge use cases thoroughly.
  • Analyzing Markets with New Demographics – Let’s say you are contemplating pursuing new markets more seriously. You may have seen some interest in the past from those markets and your production data includes some data from those market segments. You may need to extrapolate that data pertaining to those markets in a realistic way using the production data you already have.

The above are only some use cases when creating Synthetic data out of your Real Production Data comes in handy. You take the following steps to solve this problem:

  • Build a Model of Your Current Production Data – Using Machine learning you build a Model of your Current Production Data. This represents your current demographics.
  • Build a Model of the Desired Demographics and Include the Edge Cases you desire – You use a tool to express the Demographics you desire and input information about the Edge Cases you need to see.
  • Generate Synthetic Data – You generate Synthetic Data using the above models.
  • Assess the Fit of the Synthetic Data to what you needed to see – You assess whether the Synthetic Data has ALL the characteristics you needed.
  • Iterate and generate more Synthetic Data till the Fit is achieved – Let’s say you still have too many Baby Boomers in your Synthetic Data set and not enough Millennials. You drop some of the Baby Boomer data, generate more Synthetic Data and iterate, till the fit is achieved.

Fortunately, with the right software tools the above steps can be automated! Real Production Data rarely has the exact characteristics you need. However, with the right modelling and iterative approaches, Synthetic Data generated from your Real Production Data can have these exact characteristics you needed!

No data is clean, but most is useful. ~ Dean Abbott

Start your project with brewdata

Try out our tools for free by signing up!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.