Data Sources API¶
Data source integration for downloading datasets from external platforms.
Data sources package for BioML-bench.
This package provides pluggable data source implementations for different platforms including Kaggle, Polaris, HTTP, and local files.
The factory pattern allows easy registration and creation of data sources based on configuration.
DataSource ¶
Bases: ABC
Abstract base class for data sources.
download
abstractmethod
¶
Download data from the source.
Args: source_config: Configuration specific to this data source data_dir: Directory to download data to
Returns: Path to downloaded data (or None if no single file)
Raises: DataSourceError: If download fails
get_leaderboard
abstractmethod
¶
Get the public leaderboard for the benchmark.
Args: source_config: Configuration specific to this data source
Returns: DataFrame with columns: teamName, score, submissionDate
Raises: DataSourceError: If leaderboard cannot be retrieved
get_human_baselines ¶
Extract human baseline performance data.
Args: source_config: Configuration specific to this data source
Returns: DataFrame with columns: team_name, score, human_type, source Returns None if no human baselines available
Raises: DataSourceError: If human baseline extraction fails
supports_human_baselines ¶
Check if this data source can provide human baseline data.
DataSourceConfigError ¶
DataSourceError ¶
Bases: Exception
Exception raised for data source related errors.
DataSourceNotFoundError ¶
DataSourceFactory ¶
Factory for creating data source instances.
Automatically discovers and registers data source classes, then creates instances based on source type configuration.
register
classmethod
¶
Register a data source class.
Args: source_type: String identifier for the source type source_class: Data source class to register
create
classmethod
¶
Create a data source instance.
Args: source_type: Type of data source to create
Returns: Configured data source instance
Raises: DataSourceNotFoundError: If source type is not registered
list_available
classmethod
¶
List all available data source types.
Returns: List of registered source type strings
is_available
classmethod
¶
Check if a data source type is available.
Args: source_type: Source type to check
Returns: True if source type is registered
KaggleDataSource ¶
Bases: DataSource
Data source for Kaggle competitions.
Downloads competition data and leaderboards using the Kaggle API.
validate_config ¶
Validate Kaggle source configuration.
Args: source_config: Should contain 'benchmark_id' key
Returns: True if valid
Raises: DataSourceConfigError: If configuration is invalid
download ¶
Download competition data from Kaggle.
Args: source_config: Must contain 'competition_id' data_dir: Directory to download data to
Returns: Path to downloaded zip file
Raises: DataSourceError: If download fails
get_leaderboard ¶
Get leaderboard from Kaggle competition.
Args: source_config: Must contain 'benchmark_id'
Returns: DataFrame with leaderboard data
Raises: DataSourceError: If leaderboard cannot be retrieved
get_human_baselines ¶
Extract human baselines from Kaggle public leaderboard.
Filters the public leaderboard to identify likely human participants and categorizes them by performance level.
supports_human_baselines ¶
Kaggle supports human baseline extraction from public leaderboards.
ManualDataSource ¶
Bases: DataSource
Data source for manual tasks that handle their own data preparation.
This data source is used for tasks where the data preparation is handled entirely by the task's prepare.py script, without needing to download data from external sources.
download ¶
Manual tasks don't download data - preparation is handled by prepare.py.
Args: source_config: Configuration (not used for manual tasks) data_dir: Directory where data would be stored (not used)
Returns: None since no download occurs
get_leaderboard ¶
Manual tasks don't have external leaderboards.
Args: source_config: Configuration (not used for manual tasks)
Returns: Empty DataFrame with expected leaderboard columns
PolarisDataSource ¶
Bases: DataSource
Data source for Polaris Hub benchmarks.
Downloads benchmark data and provides leaderboard information from the Polaris platform using the polarishub conda environment.
validate_config ¶
Validate Polaris source configuration.
Args: source_config: Should contain 'benchmark_id' key
Returns: True if valid
Raises: DataSourceConfigError: If configuration is invalid
download ¶
Download benchmark data from Polaris Hub.
Args: source_config: Must contain 'benchmark_id' data_dir: Directory to save data to
Returns: Path to the data directory (Polaris doesn't use zip files)
Raises: DataSourceError: If download fails
get_leaderboard ¶
Get leaderboard from Polaris Hub by scraping the website.
Args: source_config: Must contain 'benchmark_id'
Returns: DataFrame with leaderboard data (teamName, score, submissionDate columns)
Raises: DataSourceError: If leaderboard cannot be retrieved
ProteinGymDMSDataSource ¶
Bases: DataSource
Data source for the ProteinGym deep mutational scanning (DMS) benchmark.
validate_config ¶
Validate ProteinGym source configuration.
Args: source_config: May contain optional 'dataset_name' key
Returns: True if valid
Raises: ValueError: If configuration is invalid
download ¶
Download ProteinGym DMS data (single substitutions, multi substitutions, and indels) and extract to a shared location. The zip files are downloaded once and extracted once to a shared directory that all tasks can access.
Args: source_config: Configuration for the data source, must contain 'benchmark_id' data_dir: Directory to download data to (not used for extraction)
Returns: Path to the task's raw directory (containing the specific dataset CSV file).
Raises: DataSourceError: If download fails
get_leaderboard ¶
Get leaderboard from ProteinGym.
Args: source_config: Configuration for the data source, must contain 'benchmark_id'
Returns: DataFrame with columns: teamName, score, submissionDate If dataset_name is specified, returns only that dataset's row
Raises: DataSourceError: If leaderboard cannot be retrieved
register_data_source ¶
Decorator to automatically register data source classes.
Args: source_type: String identifier for the source type
list_available_sources ¶
Convenience function to list all available data source types.
Returns: List of registered data source type strings
create_data_source ¶
Convenience function to create a data source instance.
Args: source_type: Type of data source to create
Returns: Configured data source instance
Raises: DataSourceNotFoundError: If source type is not registered