Unlock Open Data’s Potential with Generative AI !
Here’s a brief history of Open Data as described in the Open Data site Data.Gov – “In December 2007, in Sebastopol, California, a group of thought leaders gathered to discuss and define the concept of open public data. Those who met in Sebastopol understood the Internet’s potential and the value of making data, particularly government data, understood and available as a public resource, just as our natural resources are shared for the common good.”
The monitoring, evaluation, research, and learning (MERL) Center is a community creating resources about the intersection of MERL and open source, data science and human-centered design, for people of any technology capacity. The Merl Center discusses the pros and cons of Open Data here. Here’s an interesting paper exploring Privacy Issues in Open Data from The Berkman Klein Center for Internet and Society of Harvard University.
Themes that come up again in again in looking at what is preventing Open Data from realizing its full potential are Privacy, Consent, and Identity Theft!
The Data.Gov contains about a quarter million datasets already. Most of the current data sets are aggregated data and not raw data! Open Data’s potential could be even more fully realized, if the problems of Personally Identifiable Information (PII) and Personal Health Information (PHI) could be addressed effectively. Privacy regimes like GDPR, CPAA, and penalties associated with leaks of PII and PHI, prevent a lot of organizations from putting the data they have in Open Data sites. Around the globe, publicly funded research spends billions of dollars on diseases and possible cures for them. A handful of researchers get access to this data because of PII and PHI concerns. The promise of Open Data and the possibility of combining all this data in “study of studies” dies on the vine!
Generative AI has the potential of replacing real data about people with synthetic data – names, addresses and other identity information like Social Security Numbers (SSN). Differential Privacy approaches can algorithmically guarantee Privacy to a large extent. Various Internet databases and APIs like the US Postal Service’s address verification system can verify that addresses created are not real ones! Even with PHI, PII can be replaced with synthetic names, addresses and SSNs, making them safer to share publicly!
Open Data’s potential as a Public Good can be realized once PII and PHI issues are addressed effectively. GenerativeAI makes this possible finally by turning PII and PHI into synthetic data. Differential Privacy algorithms can provide a level of guarantee of privacy that converts data that contains PII and PHI into Open Data that can be shared safely and effectively.
Data are just summaries of thousands of stories—tell a few of those stories to help make the data meaningful - Dan Heath