Project Name:

Utilizing Privacy Preserving Record Linkage with Parent Agency Data and Statistical Agency to Inform Programs and Policies (NCSES/NSF)

Contractor: NORC at the University of Chicago

Lessons Learned

During this second quarter, NORC and NSF/NCSES received feedback from leadership on the data sharing agreement concerning linkage assessment and quality assurances. Comments identified the need for specified mechanisms for assessing the quality of linked records. Both parent and statistical agencies providing data for linkage would benefit from metrics and methods for ensuring records in each respective file were linked correctly. NORC and NSF/NCSES learned that identifying and providing detail on these methods as identified quality assurances within the data sharing agreement are necessary. Making this a standard part of our data sharing agreement and linkage strategy as a demonstration will support that it is a standard part of future PPRL efforts.

Additionally, another lesson learned this quarter was the importance of visualizations to aid the

description of complex processes and strategies such as record linkage. In developing and revising the Data Sharing Agreement, we found that creating a process flow diagram greatly assisted the description of the entire record linkage process. The text describes the process as well as ownership during each step of the linkage process. The visualization provided an added level of communication to promote a fuller comprehension by the reader. During review, we found that adding visuals aided in the decision process as well. Going forward, we plan to incorporate visuals when appropriate to make all descriptions clear and easy to understand and we encourage their inclusion in future PPRL efforts.

As we worked through the process of developing the software selection recommendation memo this quarter, we continued to see the importance of understanding available PII when selecting a PPRL tool, which builds off a lesson learned from the prior quarter. PPRL tools are not one-size-fits-all and it is important to not only look at the capabilities of the tool, but to also consider the types of PII available within the source data along with their quality. We recognize that through this demonstration project, we will provide insight for future PPRL efforts into selecting the appropriate tool to best complement the PII data available for linking.

Lastly, through various discussion and progress this quarter, we acknowledge the importance of a multidisciplinary team. Utilizing the diverse skills of a team containing experts in record linkage, data sharing, project management, communication, and an understanding of the federal statistical system have improved the efficiency and quality of all aspects of this linkage project.

During this quarter, four concepts were identified as lessons learned to assist future linkage efforts. They are detailed below.

When selecting an open source PPRL tool it is important to consider the programming language (e.g., Anonlink is Python based) and ensure project team members have capabilities in that language. In addition to the tool selection, it is important to consider if the programming language is available within a shared services environment.

Another lesson learned is the importance of selecting linkage variables that have low levels of missingness on the sources to be linked. Source data exploration is very important prior to encryption. This includes data checks to understand aspects of the data that would influence a linkage approach or submission file preparation. This would also include assessing interoperability and creating variable code crosswalks to ensure variable standardization.

Related to the prior lesson learned, the importance of assessing data quality prior to linkage through clear and concise data quality checks provide benefits and efficiencies. It is important to understand the data ingestion by a trusted third party who does not have access to the unencrypted PII data. When the trusted third party is blind to the source data, established templates of data checks should be developed and implemented in a standardized manner.

Lastly, for data sharing agreements it is important to ensure transparency with all parties involved. Having clear and defined roles and responsibilities in the data sharing agreement is important to mitigate risk.

We have learned the importance of establishing clear communication about data interoperability and determining a method of troubleshooting of open-source PPRL code. As part of our lessons learned we have explored and implemented hosting tutorial meetings with programmers to explain code prior and hosting virtual calls to screenshare and specifically discuss aspects of the code for execution. Open-source code requires skills and knowledge which benefit from collaboration and having supportive methods of communication can improve processing and efficiency.

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.