Before government data can be published as open data, agencies need to be sure they have masked or removed personally identifiable information (PII) effectively. There is continuing concern about the mosaic effect, through which disparate datasets can be combined to identify individuals. Combining different datasets from different agencies can raise this concern. Potential negative impacts include breaches of individual privacy and the chance that data will be used in a discriminatory way.
Data scientists are also realizing the limits of de-identification technology, which is a useful approach but not a complete solution. While technologies exist to remove identifying information from datasets, they are not fully effective. The technology is difficult to apply to the range of data now available, including geospatial, medical and genomic, body-camera, and other data. Finally, even if it is possible to de-identify data today, it’s impossible to predict whether it will become possible to re-identify individuals as technology evolves in the future.
In this context, the best approaches seem to acknowledge the risk that individuals may be linked with their data, minimize that risk as much as possible, and look at the risk in the context of the potential both for individual harm and public good. Although there are risks to opening data, policymakers can create programs and assessment tools that reduce these risks by taking the degree of risk into account. The potential damage from someone breaking the code and learning where an individual went to college, for example, is much less than the potential harm from revealing that same person’s medical history. For that reason, each agency should assess the risk for every dataset that contains PII and choose strategies for managing those datasets accordingly.
When truly sensitive data is at stake, agencies or cross-agency programs will need to develop thorough, coordinated plans for privacy protection. For example, the Precision Medicine Initiative, which is intended to help patients personalize their health care, has developed a framework for protecting privacy without inhibiting this scientific work.
Whatever kinds of data are involved, agencies should consider a number of potentially complementary strategies for privacy protection and consider using them in combination. These include:
Restricted access. It may be necessary to consider gradations of openness under different circumstances. For example, some kinds of data could be made “open” only for sharing between federal agencies under certain conditions, with strict security measures. Or a federated model using a cloud repository with security features can limit data access only to trusted users; for example, sensitive medical data might be shared only with qualified and vetted researchers. Tiered access data-sharing programs can also allow levels of access to multiple types of users.
De-Identification. It may be technically impossible to create a method of de-identification, removing PII from public datasets, that retains the full value of the data and is completely effective at anonymizing it. However, there are many situations where a high level of de-identification is sufficient, even if it does not provide absolute, 100% privacy protection. Conversely, it may be possible to completely de-identify data if researchers can accept less-than-perfect accuracy in the result.
There are many technical approaches to de-identification that can often provide a sufficient degree of protection. For example, it’s possible to identify individuals using unique ID numbers that make it possible to connect data about them in different datasets without revealing their identities. Dropping non-critical information - for example, dropping the last three digits of a person’s zip code - can make re-identification more difficult.
Newer, more sophisticated technical approaches. Several new approaches to data privacy are showing promise. Differential privacy is a technical approach using cryptography to minimize the chance of identifying records from statistical databases. Synthetic data involves creating a dataset that contains no information on real individuals, but has the same characteristics as real data for purposes of analysis; the Department of Veterans Affairs has recently used synthetic data to analyze veterans’ risk of suicide. While these and other approaches require more technical sophistication than some agencies may have, they should become more widely applicable over time.
Coordinating Disclosure Review Boards, Chief Privacy Officers, and other governance structures. New data governance structures can help manage privacy concerns. Many agencies now handle privacy issues through a Chief Privacy Officer, a Disclosure Review Board, or other of ces and organizational structures. To make these as effective as possible, their work needs to be integrated and aligned with the agency’s goals for data release. Options include: For example, the office of the Chief Data Officer can centralize an agency’s management of open government data and address privacy concerns. The role of Disclosure Review Boards within agencies and the way they operate can also be strengthened, including participation from the General Counsel’s office and subject matter experts.
Building trust with the community around data use. Individual privacy should be treated in the context of public good. Many datasets that include PII also include information that can have great public benefit. In these cases, it will be essential to craft approaches to privacy protection that respect individuals’ rights while also making data available to the public, or to selected researchers, in a way that supports social and scientific goals.
It is also essential to communicate the goals of open data, and privacy safeguards for the data, to the community and individuals that have provided it. Individuals are understandably concerned that data about their health, education, employment, financial status, or other sensitive data should not be exposed or misused. Federal agencies and others that plan to use the data with appropriate privacy protections will need to be sure that the communities involved understand and are satisfied with their approach.