Data Quality

Improving data quality leads to a number of social, scientific, economic, and operational benefits. High-quality data can lead to better public services, the advancement of scientific research,and an increase in business opportunities. Operational benefits that result from improved data quality include more usability and reusability, better organizational efficiency, increased collaboration with and within organizations, and higher organizational value from increased trust.

Data quality is comprised of many factors that can be addressed individually or in combination. These include data accuracy, metadata, machine-readability, timeliness, and interoperability, among others. Federal agencies, private-sector companies, and researchers have developed solutions that can improve each of these factors throughout the data lifecycle. Some best practices:

Develop user feedback systems to identify quality problems for agencies. User feedback is critical to improving data quality. Data users can identify quality problems and can often help improve quality by providing data of their own. Community management and feedback can help eliminate data quality problems in the same way that it can address bugs in open-source software. In addition to providing channels for feedback, data stewards could proactively invite the developer community to evaluate data according to basic quality requirements.

Government agencies could follow the example of the DATA Act Broker, which helps maintain quality control for federal spending data collected under the Digital Accountability and Transparency Act (DATA Act). Government agencies could also adapt the model that the U.S. Department of Health and Human Services has developed in its Demand-Driven Open Data (DDOD) project. The HHS IDEA Lab created DDOD to give stakeholders from industry, academia, nonprofits, and other government organizations a feedback pathway to share concerns about HHS data. This systematic approach to gathering, tracking, and acting on user feedback can be a scalable model for data quality improvement. Benefits include prioritizing resources to improve quality of the most widely used and valuable datasets.

Use challenges and competitions to improve data quality. Data quality problems often have technical solutions. Algorithms can be developed to find and clarify instances where data includes some ambiguity, for example, when an individual or organization is not identified consistently. In one recent project, the U.S. Patent and Trademark Office put out a technical challenge for “disambiguation” of patent data to improve its data stores. The result has been PatentsView.org, a data visualization and analysis platform that allows users to interact with 40 years of patent data, improved to make the data more informative and high quality.

Use crowdsourcing to improve data quality. Many citizen science projects now use armies of volunteers to gather or categorize data on the environment, astronomical phenomena, medical images, and more. The White House Office of Science and Technology Policy has recognized crowdsourcing as a legitimate and valuable way to help build scientific data resources. In a similar way, agencies can use crowd-sourcing to help improve data quality, by inviting citizens to review their data and contribute corrections. This approach can be helpful, for example, in improving geospatial data so that it reflects real-world locations.

Featured Examples

Featured

DATA Act Broker

HHS Demand-Driven Open Data

CODEApril 14, 2018Comment