Project Name:
Federated Data Usage Platform
Contractor: Mathematica, Inc.
Lessons Learned
The following are the lessons learned for this quarter and the proposed action to address the lesson learned.
- Data usage statistics help to understand data that are available and used by a wide variety of data users.
- Through initial stakeholder engagement, initial findings show:
- Agencies and data users want to measure impact of the data usage as much as they do the count of uses.
- Federal agencies can leverage a Data Usage Platform (DUP) to analyze performance of their publicly shared data against other similar agencies.
- Non-statistical federal agencies have interest using a DUP to measure impact of their publicly and non-publicly available data in published research and through media mentions.
- Academic researchers can utilize the DUP to identify other research of interest, and seek datasets used in those citations.
- There are challenges in tracking federal data through research publications alone, since many data assets do not have standardized citation parameters and/or data is shared on multiple platforms or only mentioned in gray literature.
The following are the summary of lessons learned for this quarter.
- Perspectives of Federal IT staff and post-secondary students are important to inform the eventual implementation of the DUP and intended use.
- Non-statistical federal agencies have interest for using a DUP to measure impact of their publicly and non-publicly available data in published research and through media mentions.
- Outreach supported the use case for academic researchers utilizing the DUP to identify other research of interest, and seek datasets used in those citations.
- There are challenges in tracking federal data through research publications alone, since a lot of the data assets do not have standardized citation parameters and/or data is shared on multiple platforms or only mentioned in gray literature.
- Agencies and data users want to measure impact of the data usage as much as they do the count of uses.
- ORCID is primarily a Digital Object Identifier (DOI) for disambiguating researchers. There is a US Government ORCID Consortium that the Dept of Energy (DOE) Office of Scientific and Technical Information (OSTI) launched in April 2020 bringing together US government and DOE-affiliated organizations looking to use, adopt, and integrate with ORCID. The National labs primarily make up consortium membership as it is primarily targeted for research. More info can be found here – https://www.osti.gov/pids/orcid-services/us-gov-orcid-consortium
- With respect to the DUP, ORCID can be used as part of the metadata schema when publishing research data or when researchers apply to the Standard Application Process Portal (SAP) to help with screening. For uniquely identifying datasets, another DOI is required.
- “DOIs are a foundational requirement to unambiguously identify and access resources” and is the top principles for Findable and Accessible resources and it is also a key to tracking data usage:
- (Meta)data are assigned a globally unique and persistent identifier.
- (Meta)data are retrievable by their identifier using a standardized communication protocol.
- Recruitment of key stakeholders. Despite the importance of feedback from data journalists, there was limited interest engaging with the project ream for the Data Usage Platform (DUP) prototype in the first round of usability testing. More investment in recruitment is planned for the second and third round of usability testing focusing on journalists. We received interest from a wide range of university research staff participants for the next round of DUP usability testing.
- Data sources. Identifying data assets via Data.gov and the Standard Application Portal (ResearchDataGov), Data.gov provides a centralized location for identifying a broader range of federal data assets. However, a lack of consistency in naming conventions, formatting, and metadata tags presents challenges when using this source to identify assets. Beginning with a subset of agencies and asset types from Data.gov and allowing agencies to supplement these lists manually will improve the quality of asset identification and usability of the DUP.
- Data quality checking and annotation. When identifying data assets within publications, inconsistent naming conventions and abbreviations can lead to many false positives and false negatives. Manual review and iterative refinement are crucial for improving annotation accuracy. Balancing precision and recall in annotation require careful tuning of matching algorithms. Proper handling of abbreviations and context is essential for accurate annotations.
- Federal report identification. When scraping for federal reports, there are limits on API rates for sites such as US Government Publishing Office (GPO). In addition, there are diverse sources and formats of reports across different agencies. Flexibility in scraping approaches is necessary when identifying these diverse data sources and working with federal departments/agencies individually to “onboard” them into a future platform is recommended.
- Leveraging closed-access academic journal articles. Incorporating paywalled or closed-access journal publications into a DUP requires coordination with vendors and can introduce significant costs and challenges. However, at least 50% of academic journal publications are now open-access, and this rate is increasing. For open-access publications, OpenAlex covers 96% of publications indexed by Scopus, as of 2024. We recommend an approach that focuses on acquiring open-access journal articles and white papers though sources such as OpenAlex, PubMed, and ArXiv, while noting potential biases for specific disciplines within domain.
Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.