Data Libre! Use Generative AI to Liberate the Data you already have!
According to a survey done by DB-Engines, Relational Databases account for 71.9% of databases in the world, Document Stores- 10.4%, Key-Value stores – 5.7% and the rest in other semi-structured databases such as Object Oriented, Graph, Time-Series databases, Wide Column and Spatial ones.
Generative AI is being used increasingly to create Synthetic images and Synthetic videos and rightfully so! Enough real life images and videos don’t exist for the variety and scale of Machine Learning instances, some applications like Image Recognition and Autonomous Driving demand.
However, Relational and Semi-Structured data represent a gold mine that’s right under your nose. So, what’s the problem in using them? Here are the reasons:
- Privacy Issues Leading to Possible Identity Thefts– Personally Identifiable Information (PII) or Personal Health Information (PHI) in data that organizations have are sensitive data that needs to be protected diligently for fear of Identity Theft.
- Compliance Requirements and Financial Implications of Breaches – Compliance regimes like GDPR and California Consumer Privacy Assurance (CCPA) impose heavy fines for data breaches. So much so, many organizations are reluctant to share their production data even internally.
- Consumer Financial Data – Organizations, especially online or brick and mortar stores, store Consumer Financial Data like Names, Social Security Numbers, Credit Card Information, Addresses, etc. In the wrong hands, they can lead to some serious financial fraud and theft.
- Exchange of Data within an Organization – Marketing departments within organizations need their own consumer and sales data for market segmentation, product performance analysis, analysis of consumer behavior, etc. Financial departments may need the sales and consumer data for their own profitability analysis of segments, products and consumers. All these internal uses do not really care for individuals’ names, addresses, credit card numbers etc. But today there are no easy ways of sharing all of this data internally without substantial custom processing and masking.
- Exchange of Data with Partners and Suppliers – These days, organizations have complex commercial ecosystems made up of suppliers, subcontractors, contract manufacturers, logistics and distribution providers. Lots of optimizations, beneficial to all parties, can be achieved with freer sharing of data. But today, because of privacy concerns, that may not be possible.
Almost all of the above data are in Relational or Semi-Structured formats already and really Zettabytes in size around the globe! Generative AI has the potential to liberate all this Relational and Semi-Structured data for precisely the uses listed above. How does it do this?
- Pseudonymizing, Masking or Redacting Private Data – Many organizations write custom scripts for each use case for each database today. Generative AI has the potential of automating this process with Generational Adversarial Networks (GAN) algorithms in Machine Learning.
- Enrichment – Generative AI can add additional synthetic data in certain market segments, if you do not have enough data in those. For example, if you do not have enough Baby Boomers in your production data but have data about GenY and GenZ generations, Generative AI can use the data that exists and can create additional synthetic data that have the same characteristics of your current data.
- Adjustable Privacy Assurance Levels – Generative AI allows different levels of Privacy Assurance. Synthetic Data Utility (How close the generated data is to the real data in characteristics?) and Privacy Assurance (Can you reidentify people from their pseudonymous data?) are two ends of a continuum. If you need more Utility, you lose a certain level of Privacy Assurance and vice-versa. Generative AI just needs more computing to be done, if you need to increase the level of Privacy Assurance. Since this is adjustable, you can get to the right balance between Utility and Privacy Assurance. If the data is used for internal testing, you can use a higher level of Utility and lower level of Privacy Assurance. If you are sharing the data externally, you can use a lower level of Utility and a higher level of Privacy Assurance.
Although Generative AI, when used for generating images and videos make for visual appeal, the real Data Gold in organizations are in relational and semi-structured databases mainly made up of alpha-numeric values. Generative AI has the potential to free all this data for sharing and use internally and externally! Data Libre!
"Freedom is something that dies unless it’s used." ― Hunter S. Thompson