Project Name:

Evaluation of Noise Infusion for Large-Scale Demographic Sample Survey (Survey of Doctorate Recipients)

Contractor: Knexus Research LLC

Lessons Learned

The Survey of Doctoral Recipients data has a relatively large number of features, in part because many categorical checkbox options (e.g., reason for pursuing postdoc) are transformed into sets of trinary Y/N/LogicalSkip features. When exploring baseline solutions for privacy, one factor that isn’t always explicitly discussed is that different approaches have a cut off on the number of features they’ll handle gracefully. This quarter we’ve been gradually increasing the size of the demonstration feature set, focusing on features with strong correlations and interest to stakeholders, and determining, even before we get to the full schema, which privacy solutions are truly viable candidates.

Our deidentified data evaluation harness was originally designed for NIST benchmarking efforts that used a 24feature schema.  This quarter we tested it extensively on a curated 50 feature SDR schema subset, and then expanded it to work on more than 150 features.  Evaluations must help users track data fidelity issues back to specific features and subpopulations; in large feature spaces this rapidly becomes both a metrology and a data visualization problem. 150 features have 22,500 pairwise feature correlations.  We used several tricks to make things manageable, including identifying a meaningful hierarchy of feature groups, and using PCA to reduce the dimensionality of some metrics.

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.