Project Name:

Utilizing Privacy Preserving Record Linkage to Link Data from Two Federal Statistical Agencies (NCSES/NCHS)

Contractor: HealthVerity, Inc.

Lessons Learned

The following include lessons learned for this period:

Lesson 1:
While a government agency cannot define the exact documents/agreements that will be required in future linkage projects, if they ask the following questions, it should help define the needs for each specific project in a quick fashion. Once the questions are addressed and the information is gathered, the government agency or designated entity can determine what needs to be created, and/or revised; who needs to be involved; and what order the activities must go in in order to successfully get all data sharing parameters in place:

  • Who are the stakeholders involved with the project (software owners, data owners, analytics providers, data platform, etc.)?
    (2) From the list of defined stakeholders, does your organization have any established data use agreements or other materials that need to be used?
    (3) What entity(ies) should be involved from your agency in order to craft, review, approve, and sign any agreements established (Office of General Counsel, Contracts, Procurement, etc.)?
    (4) Are there any specific data elements, data uses, or other use cases that are restricted or require higher levels of approval?

Lesson 2:
Linkage projects should consider a longer time period to complete the agreements, while also assessing impact of holiday and/or vacation schedules among the organizations and stakeholders.

Lesson 3:
Ensure that a similar level of involvement and engagement by government agency leadership can be provided throughout the course of projects because decisions cannot be made unilaterally by just one of the partners; decisions must be bi-lateral, and they need to be handled in near real time to keep the project moving forward efficiently.

Lesson 4:
For data linking engagements, it’s critical to understand where the data resides, therefore the government agency should consider asking these questions at the outset of the project:

(1) Where does the data reside?

(2) Is the environment cloud-based or on-prem?

(3) Who maintains the environment (contractor, government, other)?

(4) Can the government agency provide a system diagram that shows where the data resides and how it will be extracted for linkage?

(5) Who is responsible for approving the transfer of data outside of the environment

Lesson 5:
For data linking engagements, it’s critical to understand the details of the data that will be linked, because the data sharing agreement must ensure that all data elements required to link the data are adequately covered.  Questions that should be considered:

(1) What PII fields does the data contain and how is the data formatted (e.g., Name, Address, SSN, zip code, etc.)?

(2) Are there unique features of the PII that the government agency needs to be aware of (e.g., multiple rows for one person and multiple derivatives of their name (Joseph, Joe, Joey); columns with low fill rates; restrictions on certain fields)

(3) If the PII is hashed/encrypted, does the government agency still consider that PII or not?

(4) Is there any sensitivity around the covariate/transactional data (i.e., does it need to stay on government agency servers? could there be quasi-identifiers in the data?)

(5) What is the overall data layout that will be processed?

(6) What are the expected fill rates for each field that will be processed?

(7) What is the source of the data? Where is it being pulled from?

  • Data sharing agreement (DSA)

During the quarter of April 2024 – June 2024, HealthVerity and Mathematica captured and documented several Lessons Learned.

  • PPRL
    • It is important to have discussions about the data flow. The team developed a plan to limit the covariate data exposure to a trusted third party by adding a step in the process to separate the hashed personally identifiable information data from the covariate data until a HealthVerity ID (HVID) has been assigned.
    • Data quality, particularly levels of missingness, should be considered prior to the development of hashed tokens
    • It is important to have discussions about name fields and name variants prior to the creation of hashed tokens. This type of name standardization can aid in the linking process and should be agreed upon by both parties prior to deployment of encryption tools.
  • Validation Statistics and Modeling
    • Prior to the linkage process all parties should discuss the types of questions that can be addressed once the files are linked and identify the key covariates that will be used in the analyses. It is advantageous to share data dictionaries early in the project to allow project team members to better understand the scope of the data. This allows for brainstorming and identifying research questions and analyses that can be conducted with the available data and facilitates discussions on covariate selection prior to starting the analytics work which is particularly important with PPRL.

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.