Tasks API¶

Task preparation, management, and data handling functionality.

Data preparation and management for BioML-bench.

This module provides high-level functions for downloading and preparing datasets from various sources (Kaggle, Polaris, etc.) using a pluggable data source architecture.

create_prepared_dir ¶

create_prepared_dir(task)

Create public and private directories for a task.

download_and_prepare_dataset ¶

download_and_prepare_dataset(
    task,
    keep_raw=True,
    overwrite_checksums=False,
    overwrite_leaderboard=False,
    skip_verification=False,
)

Download and prepare a dataset using the appropriate data source.

Args: task: Task to prepare keep_raw: Whether to keep raw downloaded data overwrite_checksums: Whether to overwrite existing checksums overwrite_leaderboard: Whether to overwrite existing leaderboard skip_verification: Whether to skip checksum verification

is_dataset_prepared ¶

is_dataset_prepared(task, grading_only=False)

Checks if the task has non-empty public and private directories with the expected files.

ensure_leaderboard_exists ¶

ensure_leaderboard_exists(task, force=False)

Ensures the leaderboard for a given task exists.

Args: task: Task to ensure leaderboard for force: Whether to force download/update of leaderboard

Returns: Path to the leaderboard file

Raises: FileNotFoundError: If leaderboard cannot be found or created

is_valid_prepare_fn ¶

is_valid_prepare_fn(preparer_fn)

Checks if the preparer_fn takes three arguments: raw, public and private, in that order.

generate_checksums ¶

generate_checksums(target_dir, exts=None, exclude=None)

Generate checksums for the files directly under the target directory with the specified extensions.

Args: target_dir: directory to generate checksums for. exts: List of file extensions to generate checksums for. exclude: List of file paths to exclude from checksum generation.

Returns: A dictionary of form file: checksum.

get_last_modified ¶

get_last_modified(fpath)

Return the last modified time of a file.

file_cache ¶

file_cache(fn)

A decorator that caches results of a function with a Path argument, invalidating the cache when the file is modified.

get_checksum ¶

get_checksum(fpath)

Compute MD5 checksum of a file.

get_leaderboard ¶

get_leaderboard(task)

Load leaderboard data for a task.

prepare_human_baselines ¶

prepare_human_baselines(task, force=False)

Prepare human baseline data for a task.

Args: task: Task to prepare human baselines for force: Whether to force re-download of human baselines

Returns: Path to human baselines CSV file, or None if not available