Expanding Python Library Scikit-Bio for Efficient Multiomic Data Integration and Complex Community Modeling
Authors:
Qiyun Zhu1* ([email protected], PI), James Morton2, Daniel McDonald3, Matthew Aton1, Lars Hunger2,
J. Gregory Caporaso4, Rob Knight3
Institutions:
1Arizona State University; 2Gutz Analytics; 3University of California–San Diego; 4Northern Arizona University
URLs:
Goals
Project Goals: This team is expanding scikit-bio, a popular and versatile bioinformatics Python library. The team is implementing functionality for large-scale multiomic data analysis to examine complex relationships between plants, microbes, and the environment. Scikit-bio is an open source Python library offering an extensive range of bioinformatics functions to support microbiome research and beyond. The team’s continuous efforts to optimize fundamental algorithms have enabled the analysis of extremely large communities. The GSP award has empowered the team to accelerate scikit-bio development since September 2023.
Abstract
Team Building: A competitive developer team has been successfully reassembled. Software engineer Matthew Aton and bioinformatician Dr. Lars Hunger have been recruited and are effectively working on the project with the senior members. Three undergraduate students from Arizona State University and University of California–San Diego have been engaged in the project. The revived scikit-bio has also attracted multiple community contributors. The team has been meeting monthly to ensure a cohesive overview of progress and plans.
Overall Advancements: In the first six months of this project, a total of 40 pull requests have been merged into the codebase. A redesigned website (scikit.bio) is online, featuring reorganized documentation for users and contributors. The codebase has been rigorously refactored to match modern standards. For example, Ruff was adopted to standardize code style across the project. Support for the latest Python ecosystem, including Python 3.12 and SciPy 1.12, has been unblocked. A new release of scikit-bio is anticipated by the meeting time.
Sparse Matrix: The Biological Observation Matrix (BIOM) library has been integrated into scikit-bio, marking a significant enhancement in its ability to represent and manipulate sparse data matrices, which are characteristic of various omic data types. This integration not only streamlines the handling of large-scale omic data but also optimizes computational resources by focusing on the non-zero values. Efforts are ongoing to further optimize algorithms to take advantage of sparse matrices.
Metadata Object: The team adapted and augmented the metadata module from the popular QIIME 2 package. This improved module supports a wider range of metadata types, extending from numeric and categorical to also include Boolean, ordinal, temporal, and free text, among others. Additionally, it introduces standardization for sample identifiers, like specimen ID and host subject ID. Efforts are underway to develop a data dictionary object, which will provide essential context for metadata values, facilitating harmonization of data across omic layers and studies.
Diversity Metrics: Multiple phylogeny-aware diversity metrics such as balance weighted phylogenetic diversity (BWPD) have been implemented to facilitate modeling of complex communities in light of the evolutionary relationships among microbes. Meanwhile, the team refined the implementation and documentation of existing metrics.
Biological Sequences: The team has expanded the functionality of biological sequences. The sequence alignment function is being redesigned to improve efficiency and usability. Sequences can be converted into tokens to facilitate feature annotation using machine learning frameworks.
Workshop: The team’s proposal for hosting a fullday tutorial of scikit-bio at the Intelligent Systems for Molecular Biology 2024 conference in July has been accepted. The team anticipates enrolling up to 40 mentees, including researchers, educators, and developers. These efforts aim to make scikit-bio increasingly useful to the science community.
Funding Information
This project is currently supported by DOE’s Office of Science under DE-SC0024320, awarded to Dr. Qiyun Zhu (lead principal investigator), Dr. James Morton, and Dr. Rob Knight.