Grading API¶

Evaluation and scoring functionality for agent submissions on biomedical tasks.

High-level grading functionality

extract_agent_from_submission_path ¶

extract_agent_from_submission_path(submission_path)

Extract agent ID from submission path by parsing the run group directory structure.

grade_jsonl ¶

grade_jsonl(
    path_to_submissions,
    output_dir,
    registry=DEFAULT_REGISTRY,
)

Grades multiple submissions stored in a JSONL file. Saves the aggregated report as a JSON file. Also saves individual task reports for easier access.

grade_submission ¶

grade_submission(path_to_submission, task)

Grades a submission file (CSV or h5ad) for the given task.

validate_submission ¶

validate_submission(submission, task)

Validates a submission for the given task by actually running the task grader. This is designed for end users, not developers (we assume that the task grader is correctly implemented and use that for validating the submission, not the other way around).

aggregate_reports ¶

aggregate_reports(task_reports)

Builds the summary report from a list of task reports.

calculate_human_performance_metrics ¶

calculate_human_performance_metrics(
    agent_score, human_df, is_lower_better
)

Calculate how an agent's performance compares to human baselines.

Args: agent_score: The agent's score to compare human_df: DataFrame with human baseline scores (must have 'score' column) is_lower_better: Whether lower scores are better for this metric

Returns: Tuple of (beats_human, human_percentile) where: - beats_human: True if agent beats median human performance - human_percentile: Percentile of human performance that agent achieves (0-100)

Helper classes related to grading

Grader ¶

Grader(name, grade_fn)

is_lower_better ¶

is_lower_better(leaderboard)

Determines if a lower score is better based on the leaderboard. Returns True if lower scores are better, False otherwise.

rank_score ¶

rank_score(score, leaderboard)

Ranks a score based on the leaderboard. Returns a dictionary with the following keys: - gold_medal: bool - silver_medal: bool - bronze_medal: bool - above_median: bool - gold_threshold: float - silver_threshold: float - bronze_threshold: float - median_threshold: float - leaderboard_percentile: float or None (0-100, higher = better performance) - leaderboard_size: int (total number of entries on leaderboard)

TaskReport `dataclass` ¶

TaskReport(
    task_id,
    score,
    gold_threshold,
    silver_threshold,
    bronze_threshold,
    median_threshold,
    any_medal,
    gold_medal,
    silver_medal,
    bronze_medal,
    above_median,
    submission_exists,
    valid_submission,
    is_lower_better,
    created_at,
    submission_path,
    leaderboard_percentile=None,
    leaderboard_size=None,
    beats_human=None,
    human_percentile=None,
)

Report for a single biomedical task evaluation.

Extended from MLE-bench CompetitionReport with biomedical-specific fields.

InvalidSubmissionError ¶

Bases: Exception

A custom exception for when the agent submission cannot be graded.

Medal System¶

BioML-bench uses a Kaggle-style medal system that varies based on leaderboard size:

For small leaderboards (1-99 teams): - 🥇 Gold: Top 10% of submissions - 🥈 Silver: Top 20% (but not gold)
- 🥉 Bronze: Top 40% (but not silver/gold)

For medium leaderboards (100-249 teams): - 🥇 Gold: Top 10 positions (fixed) - 🥈 Silver: Top 20% (but not gold) - 🥉 Bronze: Top 40% (but not silver/gold)

For large leaderboards (250-999 teams): - 🥇 Gold: Top (10 + 0.2% of teams) positions - 🥈 Silver: Top 50 positions (fixed) - 🥉 Bronze: Top 100 positions (fixed)

For very large leaderboards (1000+ teams): - 🥇 Gold: Top (10 + 0.2% of teams) positions - 🥈 Silver: Top 5% of submissions - 🥉 Bronze: Top 10% of submissions

Medal thresholds follow the official Kaggle competition progression system.

Usage Examples¶

Single Task Evaluation¶

from biomlbench.grade import grade_csv
from biomlbench.registry import registry

# Grade a single submission
task = registry.get_task("caco2-wang")
submission_path = Path("submission.csv")

report = grade_csv(submission_path, task)

print(f"Score: {report.score}")
print(f"Medal: {'🥇' if report.gold_medal else '🥈' if report.silver_medal else '🥉' if report.bronze_medal else '❌'}")
print(f"Beats human: {report.beats_human}")

Multi-Task Evaluation¶

from biomlbench.grade import grade_jsonl

# Grade multiple tasks from submission.jsonl
grade_jsonl(
    path_to_submissions=Path("runs/my-run-group/submission.jsonl"),
    output_dir=Path("results/")
)