Introduction

Existing biomedical benchmarks do not provide end-to-end infrastructure for training, evaluation, and inference of models that integrate multimodal biological data and a broad range of machine learning tasks in therapeutics. We present PyTDC, an open-source machine-learning platform providing streamlined training, evaluation, and inference software for multimodal biological AI models. PyTDC unifies distributed, heterogeneous, continuously updated data sources and model weights and standardizes benchmarking and inference endpoints.

The components of PyTDC include:

API-first dataset model: A collection of multimodal, continually updated heterogeneous data sources is unified under the introduced "API-first-dataset" architecture. Inspired by API-first design, this microservice architecture uses the model-view-controller design pattern to enable multimodal data views.
PyTDC Model Server: PyTDC presents open-source model retrieval and deployment software that streamlines AI inferencing and exposes state-of-the-art, research-ready models and training setups for biomedical representation learning models across modalities.
- PyTDC Model Hub: PyTDC provides a model hub that serves as a repository for pre-trained models and training setups. The model hub is designed to be extensible, allowing users to add new models and training setups easily.
- PyTDC Model Classes: PyTDC provides a model source code implementations for SoTA biomedical foundation models enabling easy integration of models into existing applications and workflows.
- PyTDC Model Server API: PyTDC provides an interface for retrieving model weights and source code for ease of model deployment
- PyTDC Model Benchmarking APIs: Domain-specific benchmarking modules for evaluating foundation models at downstream therapeutics tasks.
- Research-ready (sc)FMs: Starting with single-cell foundation models, PyTDC presents up-to-date research ready implementations of foundation models for accelerating research workflows
(Py)TDC Datasets and tasks: Building on Therapeutic Data Commons, PyTDC introduces ML tasks across therapeutics with corresponding datasets and benchmarking modules.
- ML tasks: PyTDC provides a collection of ML tasks across therapeutics, including single-instance prediction, multi-instance prediction, and generation tasks.
- Datasets: PyTDC provides a collection of datasets for each ML task, with multiple splits for training, validation, and testing.
- Benchmarking modules: PyTDC provides benchmarking modules for evaluating model performance on each ML task and dataset.
- Single-cell therapeutics: We integrate single-cell analysis with multimodal machine learning in therapeutics via the introduction of contextualized tasks. From perturbation-response prediction to drug-target nomination

PyTDC is designed to be user-friendly and easy to use, with a focus on providing a seamless experience for researchers and developers. The platform is built on top of popular machine learning libraries such as PyTorch and Hugging Face Transformers, making it easy to integrate into existing workflows. PyTDC is also designed to be extensible, allowing users to add new models, datasets, and tasks easily. The platform is open-source and available on GitHub, making it easy for researchers and developers to contribute to the project and share their work with the community. PyTDC is a powerful tool for researchers and developers working in the field of biomedical AI. With its focus on multimodal data, user-friendly interface, and extensibility, PyTDC is poised to become a leading platform for machine learning in therapeutics.

Below is example code for running a standard inference workflow using PyTDC. The example uses the Geneformer model . The example uses the scperturb_drug_AissaBenevolenskaya2021 dataset, which contains single-cell RNA-seq data from a perturbation experiment. This workflow would involve 1000s of lines of code in other libraries. PyTDC provides a streamlined interface for loading the dataset, tokenizing the data, and running inference with the model. and multiple tedious steps. Here the workflow is implemented in less than 30 lines of code, and extracts the hidden states of the model, allowing for further analysis or downstream tasks.


from tdc_ml.multi_pred.perturboutcome import PerturbOutcome
from tdc_ml import tdc_hf_interface
import torch

dataset = "scperturb_drug_AissaBenevolenskaya2021"
data = PerturbOutcome(dataset) # import dataset for chemical perturbation response prediction
adata = data.adata
tokenizer = GeneformerTokenizer(max_input_size=3) # initialize model tokenizer
adata.var["feature_id"] = adata.var.index.map(
    lambda x: tokenizer.gene_name_id_dict.get(x, 0)) # format features using tokenizer data processing functions
x = tokenizer.tokenize_cell_vectors(adata, ensembl_id="feature_id", ncounts="counts") # tokenize custom dataset
cells, _ = x
geneformer = tdc_hf_interface("Geneformer")
model = geneformer.load() # load the Geneformer model
input_tensor = torch.tensor(cells)
# note you'd typically batch the input tensor
attention_mask = torch.tensor([[t != 0 for t in cell] for cell in batch])
model(batch,
    attention_mask=attention_mask,
    output_hidden_states=True)

We present PyTDC, a machine-learning platform providing streamlined training, evaluation, and inference software for single-cell biological foundation models to accelerate research in transfer learning method development in therapeutics. PyTDC introduces an API-first architecture that unifies heterogeneous, continuously updated data sources. The platform introduces a model server, which provides unified access to model weights across distributed repositories and standardized inference endpoints. The model server accelerates research workflows by exposing state-of-the-art, research-ready models and training setups for biomedical representation learning models across modalities. Building upon Therapeutic Data Commons, we present single-cell therapeutics tasks, datasets, and benchmarks for model development and evaluation.

Figure 2. AI inferencing and model evaluation components. The PyTDC model server (sections 3.2 and C) streamlines retrieval, inferencing, and training setup for an array of context-aware biological foundation models and models spanning multiple modalities. A model store retrieval API provides unified access to model weights stored in the Hugging Face Model Hub, Chan-Zuckerberg CELLxGENE Census fine-tuned models, and TDC (Huang et al., 2021; 2022; Velez-Arce et al., 2024) storage. The model server also provides access to model classes, tokenizer functions, and inference endpoints supporting PyTorch (Paszke et al., 2019) and Hugging Face Transformers (Wolf et al., 2020). Extracted embeddings, from either model server inference or pre-computed embedding storage, are ready for downstream use by task-specific benchmarking modules.

Tiered Design of Therapeutics Data Commons: “Problem – ML Task – Dataset”

TDC has an unique three-tiered hierarchical structure. At the highest level, its resources are organized to support three types of problems. For each problem, we give a collection ML tasks. Finally, for each task, we provide a series of datasets.

The Commons outlines three major problems in the first tier:

Single-instance prediction single_pred: Making predictions involving individual biomedical entities.
Multi-instance prediction multi_pred: Making predictions about multiple biomedical entities.
Generation generation: Generating new biomedical entities with desirable properties.

At the second tier, TDC is organized into ML tasks. Researchers across disciplines can use ML tasks for numerous applications, including identifying personalized combinatorial therapies, designing novel class of antibodies, improving disease diagnosis, and finding new cures for emerging diseases.

In the third tier, we provide multiple datasets for each task. For each dataset, we provide several splits of the dataset into training, validation, and test sets to evaluate model performance.

Installation

To install the PyTDC Python package, use the following:

pip install pytdc-nextml

The installation of the package is hassle-free with minimum dependency on external packages.

Data Loaders

TDC provides intuitive, high-level APIs for both beginners and experts to create ML models in Python. Building off the modularized "Problem--ML Task--Dataset" structure, TDC provides a three-layer API to access any ML task and dataset.

As an example, to obtain the Caco2 dataset from ADME task in the single-instance prediction problem do as follows:

from tdc_ml.single_pred import ADME
data = ADME(name = 'Caco2_Wang')
df = data.get_data()
splits = data.get_split()

The variable df is a Pandas object holding the entire dataset. By default, the variable splits is a dictionary with keys train, val, and test whose values are all Pandas DataFrames with Drug IDs, SMILES strings and labels. For detailed information about outputs, see Datasets documentation.

The user only needs to specify "Problem -- ML Task -- Dataset." TDC then automatically retrieves the processed ML-ready dataset from the TDC server and generates a data object, exposing numerous data functions that can be directly applied to the dataset.

Ecosystem of Data Functions, Tools, Libraries, and Community Resources

TDC includes numerous data functions to support the development of novel ML methods and theory:

Model Evaluation: TDC implements a series of metrics and performance functions to debug ML models, evaluate model performance for any task in TDC, and assess whether model predictions generalize to out-of-distribution datasets.
Dataset Splits: Therapeutic applications require ML models to generalize to out-of-distribution samples. TDC implements various data splits to reflect realistic learning settings.
Data Processing: As therapeutics ML covers a wide range of data modalities and requires numerous repetitive processing functions, TDC implements wrappers and useful data helpers for them.
Molecule Generation Oracles: Molecular design tasks require oracle functions to measure the quality of generated entities. TDC implements over 17 molecule generation oracles, representing the most comprehensive colleciton of molecule oracles. Each oracle is tailored to measure the quality of AI-generated molecules in a specific dimension.

Explore Therapeutics Data Commons

Overview of Datasets

Overview of Data Functions

Overview of Benchmarks