BioML-bench¶
A benchmark suite for evaluating LLM agents on biomedical machine learning tasks.
📄 Paper: BioML-bench: Evaluation of AI Agents for End-to-End Biomedical ML
BioML-bench is built on top of MLE-bench and provides a comprehensive framework for benchmarking LLM agents on biomedical machine learning (BioML) tasks including protein engineering, drug discovery, single cell omics, medical imaging, and clinical biomarkers.
Agents autonomously read task descriptions, analyze biomedical data, design appropriate ML approaches, and implement complete solutions from scratch.
🧬 Features¶
- Diverse Biomedical Tasks: Protein engineering, drug discovery, single cell omics, medical imaging, clinical biomarkers
- Agent-Agnostic Evaluation: Any LLM agent that can read task descriptions and produce file/folder submissions can be evaluated
- Human Baselines: Built-in human performance benchmarks for comparison
- Extensible Framework: Easy to add new biomedical tasks
- Biomedical Libraries: Pre-installed RDKit, BioPython, and other domain-specific tools for use by agents
🚀 Quick Start¶
Install the package (requires uv):
Pull prebuilt Docker images and run a benchmark:
# Pull prebuilt images (recommended - saves build time)
./scripts/pull_prebuilt_images.sh
# Prepare a task
biomlbench prepare -t polarishub/tdcommons-caco2-wang
# Run an agent (in this case, a dummy agent that returns null predictions)
biomlbench run-agent --agent dummy --task-id polarishub/tdcommons-caco2-wang
# Grade results (submission.jsonl is auto-generated)
biomlbench grade --submission <run-group-dir>/submission.jsonl --output-dir results/
Full v0.1a Benchmark¶
To evaluate on our complete 24-task benchmark from the preprint:
# Prepare all v0.1a tasks (downloads several GB of data)
./scripts/prepare_tasks_from_file.sh experiments/biomlbench_v0.1a.txt
# Run agents on the full benchmark
biomlbench run-agent --agent aide --task-list experiments/biomlbench_v0.1a.txt
📊 Available Tasks¶
Medical Imaging¶
- manual/histopathologic-cancer-detection: Cancer detection in histopathology patches
Drug Discovery¶
- polarishub/tdcommons-caco2-wang: Molecular property prediction (intestinal permeability)
- polarishub/* : 80+ drug discovery and molecular property prediction tasks from Polaris
🏗️ Architecture¶
BioML-bench follows a modular architecture:
- Core Framework (
biomlbench/
) - Task management, grading, data handling - Agent System (
agents/
) - Agent registry and execution framework - Environment (
environment/
) - Containerized execution environment - Tasks (
biomlbench/tasks/
) - Individual biomedical benchmark tasks, each of which contains at least 1 dataset - Scripts (
scripts/
) - Build, test, and deployment automation
📚 Documentation Structure¶
- Quick Start - Get up and running
- API Reference - Complete API documentation
- Components - Deep dive into system components
- Developer Guide - Extend the framework
🤝 Contributing¶
We welcome contributions! See our Contributing Guide for details on:
- Adding new biomedical tasks
- Creating custom agents
- Extending data sources
- Improving documentation