BioML-bench¶

A benchmark suite for evaluating LLM agents on biomedical machine learning tasks.

BioML-bench Overview

📄 Paper: BioML-bench: Evaluation of AI Agents for End-to-End Biomedical ML

BioML-bench is built on top of MLE-bench and provides a comprehensive framework for benchmarking LLM agents on biomedical machine learning (BioML) tasks including protein engineering, drug discovery, single cell omics, medical imaging, and clinical biomarkers.

Agents autonomously read task descriptions, analyze biomedical data, design appropriate ML approaches, and implement complete solutions from scratch.

🧬 Features¶

Diverse Biomedical Tasks: Protein engineering, drug discovery, single cell omics, medical imaging, clinical biomarkers
Agent-Agnostic Evaluation: Any LLM agent that can read task descriptions and produce file/folder submissions can be evaluated
Human Baselines: Built-in human performance benchmarks for comparison
Extensible Framework: Easy to add new biomedical tasks
Biomedical Libraries: Pre-installed RDKit, BioPython, and other domain-specific tools for use by agents

🚀 Quick Start¶

Install the package (requires uv):

# Install with uv
git clone https://github.com/science-machine/biomlbench.git
cd biomlbench
uv sync

Pull prebuilt Docker images and run a benchmark:

# Pull prebuilt images (recommended - saves build time)
./scripts/pull_prebuilt_images.sh

# Prepare a task
biomlbench prepare -t polarishub/tdcommons-caco2-wang

# Run an agent (in this case, a dummy agent that returns null predictions)
biomlbench run-agent --agent dummy --task-id polarishub/tdcommons-caco2-wang

# Grade results (submission.jsonl is auto-generated)
biomlbench grade --submission <run-group-dir>/submission.jsonl --output-dir results/

Full v0.1a Benchmark¶

To evaluate on our complete 24-task benchmark from the preprint:

# Prepare all v0.1a tasks (downloads several GB of data)
./scripts/prepare_tasks_from_file.sh experiments/biomlbench_v0.1a.txt

# Run agents on the full benchmark
biomlbench run-agent --agent aide --task-list experiments/biomlbench_v0.1a.txt

📊 Available Tasks¶

Medical Imaging¶

manual/histopathologic-cancer-detection: Cancer detection in histopathology patches

Drug Discovery¶

polarishub/tdcommons-caco2-wang: Molecular property prediction (intestinal permeability)
polarishub/* : 80+ drug discovery and molecular property prediction tasks from Polaris

🏗️ Architecture¶

BioML-bench follows a modular architecture:

Core Framework (biomlbench/) - Task management, grading, data handling
Agent System (agents/) - Agent registry and execution framework
Environment (environment/) - Containerized execution environment
Tasks (biomlbench/tasks/) - Individual biomedical benchmark tasks, each of which contains at least 1 dataset
Scripts (scripts/) - Build, test, and deployment automation

📚 Documentation Structure¶

Quick Start - Get up and running
API Reference - Complete API documentation
Components - Deep dive into system components
Developer Guide - Extend the framework

🤝 Contributing¶

We welcome contributions! See our Contributing Guide for details on:

Adding new biomedical tasks
Creating custom agents
Extending data sources
Improving documentation