Project Name:

Creating and Validating Synthetic Data (NCSES/Census, Annual Business Survey)

Contractor: Knexus Research LLC

Lessons Learned

There are a variety of tools for generating synthetic data. The lessons learned from this period are the importance of exploring the differences between tools and understanding the strengths and limitations of each. Further work is necessary to tune the algorithms to produce a robust final product.

This quarter focused on iteratively adding features to the target schema and then tuning the project’s two synthetic data generators, CenSyn and R Synthpop, to address utility concerns. We leveraged the evaluation modes each tool excels at: CenSyn’s k-marginal metric gives overall data quality results, with comparison to sampling error. Synthpop’s pair-wise propensity heatmaps trace utility issues back to specific features and correlations. Finally, CenSyn’s stable feature analysis provides separate scores for every partition of the poor performing feature, allowing us to track issues down to specific population subgroups. This is vital for ensuring high fidelity on smaller complex subpopulations expected in the ABS innovation features.

Configuration of synthetic data generators can play a key role in both the usability of the synthesizers as well as the resulting synthetic data products they generate. From a user’s perspective, a synthetic data generator that requires less up front configuration is often desirable as it allows synthetic data to be generated more rapidly and with less time spent on knowledge engineering or other tasks that require underlying knowledge of the generator’s process. However, an “out-of-the-box” synthetic generator may come at a cost, providing less freedom for an experienced user to optimize synthetic data generation for a specific dataset or respond to specific properties of the data (e.g., large datasets or datasets with a high-dimensional feature space). Both options, however, can provide comparable results in terms of synthetic data quality. Ultimately, a user will need to make a choice between a tool with less upfront time costs (e.g., configuration and data knowledge) that allows the generation process to begin quicker, and a more configurable tool that requires more upfront effort but provides increased control over the synthesis process.

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.