Project Name:

Evaluation of Noise Infusion for Large-Scale Demographic Sample Survey (Survey of Doctorate Recipients)

Contractor: Knexus Research LLC

Lessons Learned

The Survey of Doctoral Recipients data has a relatively large number of features, in part because many categorical checkbox options (e.g., reason for pursuing postdoc) are transformed into sets of trinary Y/N/LogicalSkip features. When exploring baseline solutions for privacy, one factor that isn’t always explicitly discussed is that different approaches have a cut off on the number of features they’ll handle gracefully. This quarter we’ve been gradually increasing the size of the demonstration feature set, focusing on features with strong correlations and interest to stakeholders, and determining, even before we get to the full schema, which privacy solutions are truly viable candidates.

Our deidentified data evaluation harness was originally designed for NIST benchmarking efforts that used a 24feature schema.  This quarter we tested it extensively on a curated 50 feature SDR schema subset, and then expanded it to work on more than 150 features.  Evaluations must help users track data fidelity issues back to specific features and subpopulations; in large feature spaces this rapidly becomes both a metrology and a data visualization problem. 150 features have 22,500 pairwise feature correlations.  We used several tricks to make things manageable, including identifying a meaningful hierarchy of feature groups, and using PCA to reduce the dimensionality of some metrics.

This quarter we completed our baseline solution evaluations and presented the results to our data user stakeholder panel, who then selected a small set of candidates to continue with for the second half of the project. Our selection of baseline solutions included traditional statistical disclosure control (cell suppression), non-differentially private synthetic data (CART, Gaussian Copula, and TVAE), and differentially private synthetic data (MST, AIM and DPPGM). Our evaluations included fidelity metrics that directly measured data distribution properties (univariate, kmarginal, PCA, and pairwise correlation), utility metrics we designed in collaboration with our stakeholders (linear regression, logistic regression accuracy and feature importance), and privacy metrics (unique exact match, and kNN membership inference). We provided a broad, robust picture of the space of possible privacy solutions for the SDR public microdata, and the metrology we developed during the process will go on to enable our stakeholders and NCSES staff make an informed selection of the final solution during the next year of the project.

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.