Skip to content

Grading API

Evaluation and scoring functionality for agent submissions on biomedical tasks.

High-level grading functionality

extract_agent_from_submission_path

extract_agent_from_submission_path(submission_path)

Extract agent ID from submission path by parsing the run group directory structure.

grade_jsonl

grade_jsonl(
    path_to_submissions,
    output_dir,
    registry=DEFAULT_REGISTRY,
)

Grades multiple submissions stored in a JSONL file. Saves the aggregated report as a JSON file. Also saves individual task reports for easier access.

grade_submission

grade_submission(path_to_submission, task)

Grades a submission file (CSV or h5ad) for the given task.

validate_submission

validate_submission(submission, task)

Validates a submission for the given task by actually running the task grader. This is designed for end users, not developers (we assume that the task grader is correctly implemented and use that for validating the submission, not the other way around).

aggregate_reports

aggregate_reports(task_reports)

Builds the summary report from a list of task reports.

calculate_human_performance_metrics

calculate_human_performance_metrics(
    agent_score, human_df, is_lower_better
)

Calculate how an agent's performance compares to human baselines.

Args: agent_score: The agent's score to compare human_df: DataFrame with human baseline scores (must have 'score' column) is_lower_better: Whether lower scores are better for this metric

Returns: Tuple of (beats_human, human_percentile) where: - beats_human: True if agent beats median human performance - human_percentile: Percentile of human performance that agent achieves (0-100)

Helper classes related to grading

Grader

Grader(name, grade_fn)

is_lower_better

is_lower_better(leaderboard)

Determines if a lower score is better based on the leaderboard. Returns True if lower scores are better, False otherwise.

rank_score

rank_score(score, leaderboard)

Ranks a score based on the leaderboard. Returns a dictionary with the following keys: - gold_medal: bool - silver_medal: bool - bronze_medal: bool - above_median: bool - gold_threshold: float - silver_threshold: float - bronze_threshold: float - median_threshold: float - leaderboard_percentile: float or None (0-100, higher = better performance) - leaderboard_size: int (total number of entries on leaderboard)

TaskReport dataclass

TaskReport(
    task_id,
    score,
    gold_threshold,
    silver_threshold,
    bronze_threshold,
    median_threshold,
    any_medal,
    gold_medal,
    silver_medal,
    bronze_medal,
    above_median,
    submission_exists,
    valid_submission,
    is_lower_better,
    created_at,
    submission_path,
    leaderboard_percentile=None,
    leaderboard_size=None,
    beats_human=None,
    human_percentile=None,
)

Report for a single biomedical task evaluation.

Extended from MLE-bench CompetitionReport with biomedical-specific fields.

InvalidSubmissionError

Bases: Exception

A custom exception for when the agent submission cannot be graded.

Medal System

BioML-bench uses a Kaggle-style medal system that varies based on leaderboard size:

For small leaderboards (1-99 teams): - 🥇 Gold: Top 10% of submissions - 🥈 Silver: Top 20% (but not gold)
- 🥉 Bronze: Top 40% (but not silver/gold)

For medium leaderboards (100-249 teams): - 🥇 Gold: Top 10 positions (fixed) - 🥈 Silver: Top 20% (but not gold) - 🥉 Bronze: Top 40% (but not silver/gold)

For large leaderboards (250-999 teams): - 🥇 Gold: Top (10 + 0.2% of teams) positions - 🥈 Silver: Top 50 positions (fixed) - 🥉 Bronze: Top 100 positions (fixed)

For very large leaderboards (1000+ teams): - 🥇 Gold: Top (10 + 0.2% of teams) positions - 🥈 Silver: Top 5% of submissions - 🥉 Bronze: Top 10% of submissions

Medal thresholds follow the official Kaggle competition progression system.

Usage Examples

Single Task Evaluation

from biomlbench.grade import grade_csv
from biomlbench.registry import registry

# Grade a single submission
task = registry.get_task("caco2-wang")
submission_path = Path("submission.csv")

report = grade_csv(submission_path, task)

print(f"Score: {report.score}")
print(f"Medal: {'🥇' if report.gold_medal else '🥈' if report.silver_medal else '🥉' if report.bronze_medal else '❌'}")
print(f"Beats human: {report.beats_human}")

Multi-Task Evaluation

from biomlbench.grade import grade_jsonl

# Grade multiple tasks from submission.jsonl
grade_jsonl(
    path_to_submissions=Path("runs/my-run-group/submission.jsonl"),
    output_dir=Path("results/")
)