Tasks API¶
Task preparation, management, and data handling functionality.
Data preparation and management for BioML-bench.
This module provides high-level functions for downloading and preparing datasets from various sources (Kaggle, Polaris, etc.) using a pluggable data source architecture.
download_and_prepare_dataset ¶
download_and_prepare_dataset(
task,
keep_raw=True,
overwrite_checksums=False,
overwrite_leaderboard=False,
skip_verification=False,
)
Download and prepare a dataset using the appropriate data source.
Args: task: Task to prepare keep_raw: Whether to keep raw downloaded data overwrite_checksums: Whether to overwrite existing checksums overwrite_leaderboard: Whether to overwrite existing leaderboard skip_verification: Whether to skip checksum verification
is_dataset_prepared ¶
Checks if the task has non-empty public
and private
directories with the expected files.
ensure_leaderboard_exists ¶
Ensures the leaderboard for a given task exists.
Args: task: Task to ensure leaderboard for force: Whether to force download/update of leaderboard
Returns: Path to the leaderboard file
Raises: FileNotFoundError: If leaderboard cannot be found or created
is_valid_prepare_fn ¶
Checks if the preparer_fn
takes three arguments: raw
, public
and private
, in that order.
generate_checksums ¶
Generate checksums for the files directly under the target directory with the specified extensions.
Args: target_dir: directory to generate checksums for. exts: List of file extensions to generate checksums for. exclude: List of file paths to exclude from checksum generation.
Returns: A dictionary of form file: checksum.
file_cache ¶
A decorator that caches results of a function with a Path argument, invalidating the cache when the file is modified.
prepare_human_baselines ¶
Prepare human baseline data for a task.
Args: task: Task to prepare human baselines for force: Whether to force re-download of human baselines
Returns: Path to human baselines CSV file, or None if not available