Project Name:

Synthetic Data Generation with Large, Real-World Data

Contractor: Westat

Lessons Learned

  • This project aims to generate synthetic data and utilize computing space across multiple agencies. Given the involvement of multiple stakeholders and governance requirements, this project has highlighted the need for ongoing collaboration and effective communication. The team has addressed this by holding biweekly meetings with the full team, biweekly management meetings and creating a detailed timeline with achievable milestones. The involvement of multiple entities and the corresponding legal agreements have required significant time to ensure proper documentation preparation and review. The team has recognized the need for comprehensive documentation of this process for future reference. Clearly identifying obstacles and offering solutions or recommendations will be crucial for similar future projects.
  • Synthetic data generation, which relies on a truth source, requires careful selection of variables, including assessments of missingness and levels of granularity. In addition, data quality of the variables in the truth source may impact what is selected to inform the synthetic data generation. The team has addressed this by involving stakeholders and expert users of the truth data to help inform the variable selection process.

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.