Project Name:

Creation of Synthetic Data and Development and Use of Verification Metrics (Survey of Earned Doctorates)

Contractor: The Urban Institute

Lessons Learned

In preparation for the creation of synthetic SED data, the team worked together to prepare an outline of the gold standard file. The gold standard file will be used to inform the creation of a synthetic data file. The lesson learned is the importance of consensus by the team on the gold standard file. This includes building off the multi-disciplinary nature of the team to account for different perspectives and expertise.

The main insights emerged from the focus groups are that:

  • When creating a synthetic data file, it is important to talk with stakeholders to discuss key variables that are needed for a wide variety of analytic research.
  • Tiered access to data can be beneficial for educational use for teaching and the opportunity to develop and test analysis code.
  • Users find detailed documentation about the methodology for creating a synthetic data file to be helpful. In addition, validation metrics could be beneficial for users conducting complex statistical analyses.
  • When planning for disclosure risk assessment of synthetic data it is important to identify sources that may have been released in other formats and may increase disclosure risk.

From the “SEDSyn Data User Focus Group Report,” we learned that:

  • Users see potential for synthetic data use in education, training, debugging, developing initial research plans, and as an intermediate step before requesting secure data access.
  • User education materials and standards need to be put in place to ensure proper buy-in/adoption for other uses of synthetic data.
  • Verification/validation servers could help as another tier (not replacement) of secure data access.

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.