Project Name:
Utilizing Privacy Preserving Record Linkage with Parent Agency Data and Statistical Agency to Inform Programs and Policies (NCSES/NSF)
Contractor: NORC at the University of Chicago
Lessons Learned
During this second quarter, NORC and NSF/NCSES received feedback from leadership on the data sharing agreement concerning linkage assessment and quality assurances. Comments identified the need for specified mechanisms for assessing the quality of linked records. Both parent and statistical agencies providing data for linkage would benefit from metrics and methods for ensuring records in each respective file were linked correctly. NORC and NSF/NCSES learned that identifying and providing detail on these methods as identified quality assurances within the data sharing agreement are necessary. Making this a standard part of our data sharing agreement and linkage strategy as a demonstration will support that it is a standard part of future PPRL efforts.
Additionally, another lesson learned this quarter was the importance of visualizations to aid the
description of complex processes and strategies such as record linkage. In developing and revising the Data Sharing Agreement, we found that creating a process flow diagram greatly assisted the description of the entire record linkage process. The text describes the process as well as ownership during each step of the linkage process. The visualization provided an added level of communication to promote a fuller comprehension by the reader. During review, we found that adding visuals aided in the decision process as well. Going forward, we plan to incorporate visuals when appropriate to make all descriptions clear and easy to understand and we encourage their inclusion in future PPRL efforts.
As we worked through the process of developing the software selection recommendation memo this quarter, we continued to see the importance of understanding available PII when selecting a PPRL tool, which builds off a lesson learned from the prior quarter. PPRL tools are not one-size-fits-all and it is important to not only look at the capabilities of the tool, but to also consider the types of PII available within the source data along with their quality. We recognize that through this demonstration project, we will provide insight for future PPRL efforts into selecting the appropriate tool to best complement the PII data available for linking.
Lastly, through various discussion and progress this quarter, we acknowledge the importance of a multidisciplinary team. Utilizing the diverse skills of a team containing experts in record linkage, data sharing, project management, communication, and an understanding of the federal statistical system have improved the efficiency and quality of all aspects of this linkage project.
During this quarter, four concepts were identified as lessons learned to assist future linkage efforts. They are detailed below.
When selecting an open source PPRL tool it is important to consider the programming language (e.g., Anonlink is Python based) and ensure project team members have capabilities in that language. In addition to the tool selection, it is important to consider if the programming language is available within a shared services environment.
Another lesson learned is the importance of selecting linkage variables that have low levels of missingness on the sources to be linked. Source data exploration is very important prior to encryption. This includes data checks to understand aspects of the data that would influence a linkage approach or submission file preparation. This would also include assessing interoperability and creating variable code crosswalks to ensure variable standardization.
Related to the prior lesson learned, the importance of assessing data quality prior to linkage through clear and concise data quality checks provide benefits and efficiencies. It is important to understand the data ingestion by a trusted third party who does not have access to the unencrypted PII data. When the trusted third party is blind to the source data, established templates of data checks should be developed and implemented in a standardized manner.
Lastly, for data sharing agreements it is important to ensure transparency with all parties involved. Having clear and defined roles and responsibilities in the data sharing agreement is important to mitigate risk.
We have learned the importance of establishing clear communication about data interoperability and determining a method of troubleshooting of open-source PPRL code. As part of our lessons learned we have explored and implemented hosting tutorial meetings with programmers to explain code prior and hosting virtual calls to screenshare and specifically discuss aspects of the code for execution. Open-source code requires skills and knowledge which benefit from collaboration and having supportive methods of communication can improve processing and efficiency.
During this quarter, five concepts were identified as lessons learned to assist future linkage efforts. They are detailed below.
For the data preprocessing and encoding, it is important to determine which computing environments will be available to implement the code while ensuring sensitive data remains protected. This includes making sure that permissions are granted to the programmers running the code and ensuring that the relevant software is available or can be installed in the environment in a timely manner.
Related to the prior lesson learned, it is also important to ensure that the necessary linkage software is available or can be installed in the shared computing environment where the linkage will be carried out (e.g., the NCSES Secure Data Access Facility (SDAF) in this case). If software installation is required, the project schedule should account for the time required by the IT or security teams to perform the necessary checks. For open-source software, these considerations include the programming language and relevant packages, as well as the integrated development environment (IDE) or platform where the code can be run.
Additionally, for open-source privacy preserving record linkage (PPRL) software, it is important to consider how to construct encoding values such as the salt and secret to ensure they work in the encoding code and avoid potential errors. Minimal guidance or recommendations are provided within the open-source Anonlink documentation, therefore the team learned that things to consider include length, complexity, and the characters to use in construction of the salt and secret which add additional protections to the cryptographic long-term keys.
Another lesson learned concerned designated processes for encrypted data deletion after linkage within the shared computing environment prior to analysis of the final linked data. Approaches, environmental structure, and protocols can vary across different computing environments. As a result, the project team should coordinate with the relevant computing environment team members to determine an appropriate deletion method and establish a clear timeline for its implementation.
Finally, it is key to recognize that open-source linkage software may not offer comprehensive guidance for configuring each linkage step. For example, when using Anonlink, the team had to develop its own methodology to determine the number of bits allocated per feature during the encoding process and to establish cutoff thresholds for distinguishing between linked and non-linked records. While this approach allows for greater customization, it also requires specialized expertise.
Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.