ADC Lessons Learned – SDRN-23-N02 – ADC | America's Datahub Consortium

Project Name:

Evaluation of Noise Infusion for Large-Scale Demographic Sample Survey (Survey of Doctorate Recipients)

Contractor: Knexus Research LLC

Lessons Learned

Reporting Period: January - March 2024

The Survey of Doctoral Recipients data has a relatively large number of features, in part because many categorical checkbox options (e.g., reason for pursuing postdoc) are transformed into sets of trinary Y/N/LogicalSkip features. When exploring baseline solutions for privacy, one factor that isn’t always explicitly discussed is that different approaches have a cut off on the number of features they’ll handle gracefully. This quarter we’ve been gradually increasing the size of the demonstration feature set, focusing on features with strong correlations and interest to stakeholders, and determining, even before we get to the full schema, which privacy solutions are truly viable candidates.

Reporting Period: April - June 2024

Our deidentified data evaluation harness was originally designed for NIST benchmarking efforts that used a 24feature schema. This quarter we tested it extensively on a curated 50 feature SDR schema subset, and then expanded it to work on more than 150 features. Evaluations must help users track data fidelity issues back to specific features and subpopulations; in large feature spaces this rapidly becomes both a metrology and a data visualization problem. 150 features have 22,500 pairwise feature correlations. We used several tricks to make things manageable, including identifying a meaningful hierarchy of feature groups, and using PCA to reduce the dimensionality of some metrics.

Reporting Period: July - September 2024

This quarter we completed our baseline solution evaluations and presented the results to our data user stakeholder panel, who then selected a small set of candidates to continue with for the second half of the project. Our selection of baseline solutions included traditional statistical disclosure control (cell suppression), non-differentially private synthetic data (CART, Gaussian Copula, and TVAE), and differentially private synthetic data (MST, AIM and DPPGM). Our evaluations included fidelity metrics that directly measured data distribution properties (univariate, kmarginal, PCA, and pairwise correlation), utility metrics we designed in collaboration with our stakeholders (linear regression, logistic regression accuracy and feature importance), and privacy metrics (unique exact match, and kNN membership inference). We provided a broad, robust picture of the space of possible privacy solutions for the SDR public microdata, and the metrology we developed during the process will go on to enable our stakeholders and NCSES staff make an informed selection of the final solution during the next year of the project.

Reporting Period: October - December 2024

This quarter we focused on the metrology and data privacy solutions that were identified as preferred candidates by our stakeholders and scaled both the evaluation library and the solutions up from handing our 50 feature demonstration data set to the full 250 features currently released as part of the of the Survey of Doctoral Recipients Public Use Microdatata. We also refined our non-technical fact sheet documentation into a 2-page pamphlet that summarized for a general audience the project’s goals, key approaches, evaluation methods and results. In both cases (scaling technology up to a larger feature set and scaling documentation down to a succinct summary), the primary challenge we addressed was how to capture the dynamics of a complex problem space and distill them without a loss in accuracy or clarity. Both the algorithms and the documentation will be presented for stakeholder review at our next meeting at the end of this month.

Click here to learn more about this project

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.

Project Name:

Evaluation of Noise Infusion for Large-Scale Demographic Sample Survey (Survey of Doctorate Recipients)

Contractor: Knexus Research LLC

Lessons Learned

Reporting Period: January - March 2024

Reporting Period: April - June 2024

Reporting Period: July - September 2024

Reporting Period: October - December 2024

SPONSORED BY

The National Science Foundation’s (NSF)
National Center for Science and Engineering Statistics (NCSES)

MANAGED BY

Advanced Technology International (ATI)

AVADA IT

RECENT TWEETS

CONTACT US

Project Name:

Evaluation of Noise Infusion for Large-Scale Demographic Sample Survey (Survey of Doctorate Recipients)

Contractor: Knexus Research LLC

Lessons Learned

Reporting Period: January - March 2024

Reporting Period: April - June 2024

Reporting Period: July - September 2024

Reporting Period: October - December 2024

SPONSORED BY

The National Science Foundation’s (NSF) National Center for Science and Engineering Statistics (NCSES)

MANAGED BY

Advanced Technology International (ATI)

AVADA IT

RECENT TWEETS

CONTACT US

The National Science Foundation’s (NSF)
National Center for Science and Engineering Statistics (NCSES)