Data, Data, everywhere! Not a Drop to Use!

Published on
March 10, 2022
Data, Data, everywhere! Not a Drop to Use!

Statista Research Department in a report dated Sep 8, 2022, estimates that the total amount of data created, captured, copied, and consumed globally by 2020 was 64.2 zettabytes. Over the next five years up to 2025, global data creation they project to grow to more than 180 zettabytes.

However, reality is that the bulk of this data cannot even be shared internally within organizations without increasing several expensive risks for them. Some of them are:

  • Personally Identifiable Information (PII) Risks – Personally Identifiable Information such as Names, Addresses, Social Security Numbers if leaked, quickly lead to them being sold on the Dark Web or in the hands of unscrupulous people, can lead to credit card fraud or identity theft.  For businesses that involve European citizens, severe violations, under GDPR regulations, fines can reach up to 20 million euros, or up to 4 % of their total global turnover of the preceding fiscal year, whichever is higher.
  • Personal Health Information (PHI) – Personal Health Information (PHI) in the US is protected under HIPAA (Health Insurance Portability and Accountability Act) in the US. HIPAA violation fines can be issued up to a maximum level of $25,000 per violation category, per calendar year. The minimum fine applicable is $100 per violation. A covered entity suffering a data breach affecting residents in multiple states may be ordered to pay HIPAA violation fines to attorneys general in multiple states.
  • CCPA Fines - The California Consumer Privacy Act (CCPA) was drafted to protect an individual’s personal data. This Act was designed to make organizations responsible custodians of the data they hold. If an organization fails to protect this data, it can face serious penalties and fines. Civil penalties can go up to $2,500 for each violation or $7,500 for repeated intentional violation after due notice and a 30-day cure period.
  • Data Use Agreements – In theory, Financial, Marketing, Personal, Health or Clinical Trial Data collected can be used for purposes other than what they were intended for depending upon which part of the world we are talking about. Strict Data Use agreements may prevent data being used for purposes other than what they were intended for. Publicly funded research in theory, could have the data collected to eventually become Open Data. However, Data Use agreements define very narrow usage of this data. Open Data today does not contain PII or PHI anywhere and is only aggregated tabular data and not raw data anyway.
  • Internal Fraud and Abuse Fears – In theory, Production Data collected be it – Financial, Marketing or Healthcare related can be used at least internally within an organization. Leakage or misuse risks prevent many organizations from sharing this data even internally with their marketers, financial analysts or software developers or testers. The larger the organization the larger the risks and data is shared sparingly with a lot of safeguards with a small number of employees.

In short, Data is everywhere but not much can be used. At least not without converting them into Synthetic Data!

Synthetic data is artificial data generated with the purpose of preserving privacy, testing systems or creating training data for machine learning algorithms. This data can be generated from real data using Machine Learning algorithms such as Generational Adversarial Networks (GANs).

Synthetic Data generation has been tried out on Images and Videos for Machine Learning. Structured Data such as Relational Databases as well as Semi-Structured Data such as JSON, XML can also be used to generate Synthetic Data.

Sensitive fields such as Names, Addresses, Social Security Numbers, PHI data such as Disease Condition Descriptions, Tests and Procedures, Treatments and Drugs can all be rendered safe by first identifying the PII associated with them, and appropriately pseudonymizing, masking or redacting them. 

Of course, there are questions of whether using the Synthetic Data can someone re-identify the people involved? That’s where Privacy Assurance algorithms provide us the probability of re-identification given some Synthetic Data. The closer this probability to 0 (Zero Risk of Re-identification) the better (1 representing near certainty that re-identification is possible). Usually, a number between 0 and 1 is acceptable and that value depends upon what the intended use is.

Data can be everywhere and not easily usable freely because of various financial, reputational, and other risks. Synthetic Data generation out of real data offers the chance for liberating all that data and making them available for use, without violating privacy or individuals or other organizations.

Information wants to be free. ― Stewart Brand

Start your project with brewdata

Try out our tools for free by signing up!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.