Skip to content

BioML-bench

A benchmark suite for evaluating LLM agents on biomedical machine learning tasks.

BioML-bench Overview

📄 Paper: BioML-bench: Evaluation of AI Agents for End-to-End Biomedical ML

BioML-bench is built on top of MLE-bench and provides a comprehensive framework for benchmarking LLM agents on biomedical machine learning (BioML) tasks including protein engineering, drug discovery, single cell omics, medical imaging, and clinical biomarkers.

Agents autonomously read task descriptions, analyze biomedical data, design appropriate ML approaches, and implement complete solutions from scratch.

🧬 Features

  • Diverse Biomedical Tasks: Protein engineering, drug discovery, single cell omics, medical imaging, clinical biomarkers
  • Agent-Agnostic Evaluation: Any LLM agent that can read task descriptions and produce file/folder submissions can be evaluated
  • Human Baselines: Built-in human performance benchmarks for comparison
  • Extensible Framework: Easy to add new biomedical tasks
  • Biomedical Libraries: Pre-installed RDKit, BioPython, and other domain-specific tools for use by agents

🚀 Quick Start

Install the package (requires uv):

# Install with uv
git clone https://github.com/science-machine/biomlbench.git
cd biomlbench
uv sync

Pull prebuilt Docker images and run a benchmark:

# Pull prebuilt images (recommended - saves build time)
./scripts/pull_prebuilt_images.sh

# Prepare a task
biomlbench prepare -t polarishub/tdcommons-caco2-wang

# Run an agent (in this case, a dummy agent that returns null predictions)
biomlbench run-agent --agent dummy --task-id polarishub/tdcommons-caco2-wang

# Grade results (submission.jsonl is auto-generated)
biomlbench grade --submission <run-group-dir>/submission.jsonl --output-dir results/

Full v0.1a Benchmark

To evaluate on our complete 24-task benchmark from the preprint:

# Prepare all v0.1a tasks (downloads several GB of data)
./scripts/prepare_tasks_from_file.sh experiments/biomlbench_v0.1a.txt

# Run agents on the full benchmark
biomlbench run-agent --agent aide --task-list experiments/biomlbench_v0.1a.txt

📊 Available Tasks

Medical Imaging

  • manual/histopathologic-cancer-detection: Cancer detection in histopathology patches

Drug Discovery

  • polarishub/tdcommons-caco2-wang: Molecular property prediction (intestinal permeability)
  • polarishub/* : 80+ drug discovery and molecular property prediction tasks from Polaris

🏗️ Architecture

BioML-bench follows a modular architecture:

  • Core Framework (biomlbench/) - Task management, grading, data handling
  • Agent System (agents/) - Agent registry and execution framework
  • Environment (environment/) - Containerized execution environment
  • Tasks (biomlbench/tasks/) - Individual biomedical benchmark tasks, each of which contains at least 1 dataset
  • Scripts (scripts/) - Build, test, and deployment automation

📚 Documentation Structure

🤝 Contributing

We welcome contributions! See our Contributing Guide for details on:

  • Adding new biomedical tasks
  • Creating custom agents
  • Extending data sources
  • Improving documentation