Skip to content

Data Sources API

Data source integration for downloading datasets from external platforms.

Data sources package for BioML-bench.

This package provides pluggable data source implementations for different platforms including Kaggle, Polaris, HTTP, and local files.

The factory pattern allows easy registration and creation of data sources based on configuration.

DataSource

Bases: ABC

Abstract base class for data sources.

download abstractmethod

download(source_config, data_dir)

Download data from the source.

Args: source_config: Configuration specific to this data source data_dir: Directory to download data to

Returns: Path to downloaded data (or None if no single file)

Raises: DataSourceError: If download fails

get_leaderboard abstractmethod

get_leaderboard(source_config)

Get the public leaderboard for the benchmark.

Args: source_config: Configuration specific to this data source

Returns: DataFrame with columns: teamName, score, submissionDate

Raises: DataSourceError: If leaderboard cannot be retrieved

get_human_baselines

get_human_baselines(source_config)

Extract human baseline performance data.

Args: source_config: Configuration specific to this data source

Returns: DataFrame with columns: team_name, score, human_type, source Returns None if no human baselines available

Raises: DataSourceError: If human baseline extraction fails

supports_human_baselines

supports_human_baselines()

Check if this data source can provide human baseline data.

DataSourceConfigError

DataSourceConfigError(message, source_type=None)

Bases: DataSourceError

Exception raised when data source configuration is invalid.

DataSourceError

DataSourceError(message, source_type=None)

Bases: Exception

Exception raised for data source related errors.

DataSourceNotFoundError

DataSourceNotFoundError(message, source_type=None)

Bases: DataSourceError

Exception raised when a data source type is not found.

DataSourceFactory

Factory for creating data source instances.

Automatically discovers and registers data source classes, then creates instances based on source type configuration.

register classmethod

register(source_type, source_class)

Register a data source class.

Args: source_type: String identifier for the source type source_class: Data source class to register

create classmethod

create(source_type)

Create a data source instance.

Args: source_type: Type of data source to create

Returns: Configured data source instance

Raises: DataSourceNotFoundError: If source type is not registered

list_available classmethod

list_available()

List all available data source types.

Returns: List of registered source type strings

is_available classmethod

is_available(source_type)

Check if a data source type is available.

Args: source_type: Source type to check

Returns: True if source type is registered

KaggleDataSource

Bases: DataSource

Data source for Kaggle competitions.

Downloads competition data and leaderboards using the Kaggle API.

validate_config

validate_config(source_config)

Validate Kaggle source configuration.

Args: source_config: Should contain 'benchmark_id' key

Returns: True if valid

Raises: DataSourceConfigError: If configuration is invalid

download

download(source_config, data_dir)

Download competition data from Kaggle.

Args: source_config: Must contain 'competition_id' data_dir: Directory to download data to

Returns: Path to downloaded zip file

Raises: DataSourceError: If download fails

get_leaderboard

get_leaderboard(source_config)

Get leaderboard from Kaggle competition.

Args: source_config: Must contain 'benchmark_id'

Returns: DataFrame with leaderboard data

Raises: DataSourceError: If leaderboard cannot be retrieved

get_human_baselines

get_human_baselines(source_config)

Extract human baselines from Kaggle public leaderboard.

Filters the public leaderboard to identify likely human participants and categorizes them by performance level.

supports_human_baselines

supports_human_baselines()

Kaggle supports human baseline extraction from public leaderboards.

ManualDataSource

Bases: DataSource

Data source for manual tasks that handle their own data preparation.

This data source is used for tasks where the data preparation is handled entirely by the task's prepare.py script, without needing to download data from external sources.

download

download(source_config, data_dir)

Manual tasks don't download data - preparation is handled by prepare.py.

Args: source_config: Configuration (not used for manual tasks) data_dir: Directory where data would be stored (not used)

Returns: None since no download occurs

get_leaderboard

get_leaderboard(source_config)

Manual tasks don't have external leaderboards.

Args: source_config: Configuration (not used for manual tasks)

Returns: Empty DataFrame with expected leaderboard columns

PolarisDataSource

Bases: DataSource

Data source for Polaris Hub benchmarks.

Downloads benchmark data and provides leaderboard information from the Polaris platform using the polarishub conda environment.

validate_config

validate_config(source_config)

Validate Polaris source configuration.

Args: source_config: Should contain 'benchmark_id' key

Returns: True if valid

Raises: DataSourceConfigError: If configuration is invalid

download

download(source_config, data_dir)

Download benchmark data from Polaris Hub.

Args: source_config: Must contain 'benchmark_id' data_dir: Directory to save data to

Returns: Path to the data directory (Polaris doesn't use zip files)

Raises: DataSourceError: If download fails

get_leaderboard

get_leaderboard(source_config)

Get leaderboard from Polaris Hub by scraping the website.

Args: source_config: Must contain 'benchmark_id'

Returns: DataFrame with leaderboard data (teamName, score, submissionDate columns)

Raises: DataSourceError: If leaderboard cannot be retrieved

ProteinGymDMSDataSource

Bases: DataSource

Data source for the ProteinGym deep mutational scanning (DMS) benchmark.

validate_config

validate_config(source_config)

Validate ProteinGym source configuration.

Args: source_config: May contain optional 'dataset_name' key

Returns: True if valid

Raises: ValueError: If configuration is invalid

download

download(source_config, data_dir)

Download ProteinGym DMS data (single substitutions, multi substitutions, and indels) and extract to a shared location. The zip files are downloaded once and extracted once to a shared directory that all tasks can access.

Args: source_config: Configuration for the data source, must contain 'benchmark_id' data_dir: Directory to download data to (not used for extraction)

Returns: Path to the task's raw directory (containing the specific dataset CSV file).

Raises: DataSourceError: If download fails

get_leaderboard

get_leaderboard(source_config)

Get leaderboard from ProteinGym.

Args: source_config: Configuration for the data source, must contain 'benchmark_id'

Returns: DataFrame with columns: teamName, score, submissionDate If dataset_name is specified, returns only that dataset's row

Raises: DataSourceError: If leaderboard cannot be retrieved

register_data_source

register_data_source(source_type)

Decorator to automatically register data source classes.

Args: source_type: String identifier for the source type

list_available_sources

list_available_sources()

Convenience function to list all available data source types.

Returns: List of registered data source type strings

create_data_source

create_data_source(source_type)

Convenience function to create a data source instance.

Args: source_type: Type of data source to create

Returns: Configured data source instance

Raises: DataSourceNotFoundError: If source type is not registered