Genomic Science Program
U.S. Department of Energy | Office of Science | Biological and Environmental Research Program

2024 Abstracts

The Data Transfer Service: FAIR Data Delivery Made Easy

Authors:

A. J. Ireland1* ([email protected]), Jeffrey Johnson2, Kjiersten Fagnan1, Elisha M. Wood-Charlson1 (PI)

Institutions:

1Lawrence Berkeley National Laboratory; 2Cohere Consulting LLC

Goals

Enable integrated scientific research across BER and beyond by facilitating the transfer of data and accompanying metadata between sample, data, and analysis platforms.

Abstract

Researchers working in the DOE BER program have access to a wide range of data and analysis resources across scientific domains. Each platform provides unique capabilities, ranging from user facilities generating data and preliminary analyses (e.g., DOE Joint Genome Institute and Environmental Molecular Sciences Laboratory), linking sample metadata to standardized data analyses (National Microbiome Data Collaborative), enabling complex reproducible analyses for publication [DOE Systems Biology Knowledgebase (KBase)], and serving as a BER project data repository (Environmental System Science Data Infrastructure for a Virtual Ecosystem). However, researchers typically need to use more than one platform for data integration, processing, publication, and/or collaboration with others, and so need to move data from one platform to another.

The lack of cross-platform coordination poses significant challenges to researchers: data transfer is usually performed manually, by downloading from one platform and uploading at another, which tends to be time consuming, difficult to automate or perform at scale, and error prone; it also removes useful metadata, including citation information and data provenance. As established components of the FAIR (findable, accessible, interoperable, and reusable) data principles, it is important that credit information, provenance, and file metadata are preserved; keeping this metadata also enables the tracking and reporting of the impact of samples and data generated by BER researchers and BER-funded data platforms.

This team is embarking on a new effort to build a Data Transfer Service (DTS) (Wood-Charlson et al. 2023) designed to streamline cross-platform research by providing a simple way to search, access, and transfer data between platforms. The DTS will leverage persistent identifiers (e.g., ORCID, DOI) and community-defined standards (e.g., Frictionless; PROV-O) for capturing file information and provenance. The data package delivered by the DTS will include the data file alongside file-level metadata, data citation, and funder information. These will be captured using the KBase Citation Metadata Schema (Ireland and Wood-Charlson 2023), which seamlessly integrates with OSTI and DataCite publishing schemas. The aim is to streamline and incentivize the practice of citing datasets in the same way that one might cite a publication (Wood-Charlson et al. 2022). In the future, the DTS will support the movement not only of public data but also private datasets with secure authentication, and the design is extensible enough that it could be used to connect resources outside BER (e.g., National Center for Biotechnology Information, National Aeronautics and Space Administration), making it a versatile tool for a wide range of scientific investigations.

The ability to easily move data between BER platforms, without the hassle of manual transfers or the risk of losing valuable metadata and provenance information, will be a significant benefit to the many researchers who use more than one platform for data analysis, management, and publication. The DTS not only promises to improve the efficiency of data transfers and facilitate cross-platform collaboration, but also to enhance the integrity and usability of the data itself, paving the way for new insights and facilitating advances in scientific research.

References

Ireland, A. J., and E. M. Wood-Charlson. 2023. “KBase Credit Metadata Schema.” Accessed April 2024. KBase. DOI:10.25982/1984203.

Wood-Charlson, E. M., et al. 2022. “Ten Simple Rules for Getting and Giving Credit for Data,” PLoS Computational Biology 18(9), e1010476. DOI:10.1371/journal. pcbi.1010476.

Wood-Charlson, E. M., et al. 2023. “Data Transfer Service,” DMPTool. DOI:10.48321/D1W96D.

Funding Information

This work is supported as part of the Genomic Science program of BER. The Data Transfer Service project is funded by the DOE, Office of Science, BER program under award number DE-AC02-05CH11231.