Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BoTorch kernel preset, which uses dimensions-scaled prior #483

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
39ad74c
Add direct arylation benchmark for TL with temperature as a task
Hrovatin Feb 19, 2025
863541f
Update changelog
Hrovatin Feb 19, 2025
520eab8
remove random seed that was set in the paper as it is redundant with …
Hrovatin Feb 19, 2025
fd990b6
Benchmark for transfer learning on arylhalides with dissimilar susbst…
Hrovatin Feb 20, 2025
4219451
Transfer learning benchmark with inverted Hartmann functions as tasks
Hrovatin Feb 20, 2025
64475fd
Add non-transfer learning campaign and transfer learning campaign wit…
Hrovatin Feb 20, 2025
d47b295
Transfer learning benchmark with noisy Michalewicz functions as tasks
Hrovatin Feb 21, 2025
8d20313
Transfer learning benchmark with noisy Easom functions as tasks.
Hrovatin Feb 21, 2025
ad8cbe1
Move data to benchmark folder
Hrovatin Feb 21, 2025
a4469a2
restructure benchmark data access
Hrovatin Feb 21, 2025
efcf7af
Make data paths general
Hrovatin Feb 21, 2025
ac8d371
Use csv instead of xlsx
Hrovatin Feb 21, 2025
deeda82
Add BoTorch kernel preset, which uses dimensions-scaled prior
Hrovatin Feb 11, 2025
66a63cf
pre-commit fixes
Hrovatin Feb 12, 2025
6d929e5
add to changelog
Hrovatin Feb 12, 2025
afcd803
Add a few botorch kernel preset benchmarks and adapt scripts for a te…
Hrovatin Feb 21, 2025
88044e4
Set N repeat iterations
Hrovatin Feb 21, 2025
0b44ff7
Added more benchmarks
Hrovatin Feb 21, 2025
eec73a7
Add benchmark
Hrovatin Feb 21, 2025
c1bb99e
Define benchmarks to run
Hrovatin Feb 21, 2025
1b46aa6
Reduce number of replicates to speed up benchmark time
Hrovatin Feb 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Copy link
Collaborator

@AdrianSosic AdrianSosic Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Hrovatin, just two comments upfront, to save us some work later:

  • The point of the botorch preset should not be to cook up the same prior logic again in our code but to make sure that the GPs are simply called without any additional arguments. The noticable difference is that the latter automatically i) avoids any potential bugs from our end plus ii) will automatically adapt when botorch makes changes
  • The preset should not only affect the kernel priors but all priors. This is also automatically fulfilled by not passing anything to the constructor.
  • I suggest you split up the PR into two, one that brings the preset (which we want to have anyway, regardless of benchmark performance) and another that adds the benchmarks. Otherwise, this will be a ton of code to review

Cheers 🙃

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah then I strongly misunderstood the aim - we should then align during the meeting next week

Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `BCUT2D` encoding for `SubstanceParameter`
- Stored benchmarking results now include the Python environment and version
- `qPSTD` acquisition function
- Additional benchmarks
- BoTorch kernel presets

### Changed
- Acquisition function indicator `is_mc` has been removed in favor of new indicators
Expand Down
7 changes: 7 additions & 0 deletions baybe/kernels/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from attrs.converters import optional as optional_c
from attrs.validators import ge, gt, in_, instance_of
from attrs.validators import optional as optional_v
from gpytorch.constraints import Interval
from typing_extensions import override

from baybe.kernels.base import BasicKernel
Expand Down Expand Up @@ -180,6 +181,12 @@ class RBFKernel(BasicKernel):
)
"""An optional initial value for the kernel lengthscale."""

# TODO replace with baybe constraint if possible
lengthscale_constraint: Interval | None = field(
default=None, validator=optional_v(instance_of(Interval))
)
"""An optional prior on the kernel lengthscale constraint."""


@define(frozen=True)
class RFFKernel(BasicKernel):
Expand Down
2 changes: 2 additions & 0 deletions baybe/surrogates/gaussian_process/presets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""Gaussian process surrogate presets."""

from baybe.surrogates.gaussian_process.presets.botorch import BotorchKernelFactory
from baybe.surrogates.gaussian_process.presets.core import (
GaussianProcessPreset,
make_gp_from_preset,
Expand All @@ -10,6 +11,7 @@
__all__ = [
"DefaultKernelFactory",
"EDBOKernelFactory",
"BotorchKernelFactory",
"make_gp_from_preset",
"GaussianProcessPreset",
]
56 changes: 56 additions & 0 deletions baybe/surrogates/gaussian_process/presets/botorch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
"""Presets adapted from BoTorch."""

from __future__ import annotations

from math import log, sqrt
from typing import TYPE_CHECKING

from attrs import define
from gpytorch.constraints import GreaterThan
from typing_extensions import override

from baybe.kernels.basic import RBFKernel
from baybe.parameters import TaskParameter
from baybe.priors.basic import LogNormalPrior
from baybe.searchspace import SearchSpace
from baybe.surrogates.gaussian_process.kernel_factory import KernelFactory

if TYPE_CHECKING:
from torch import Tensor

from baybe.kernels.base import Kernel


@define
class BotorchKernelFactory(KernelFactory):
"""A kernel factory for Gaussian process surrogates adapted from BoTorch.

References:
* https://github.com/pytorch/botorch/blob/a018a5ffbcbface6229d6c39f7ac6ef9baf5765e/botorch/models/multitask.py#L220
* https://github.com/pytorch/botorch/blob/a018a5ffbcbface6229d6c39f7ac6ef9baf5765e/botorch/models/utils/gpytorch_modules.py#L100

"""

@override
def __call__(
self, searchspace: SearchSpace, train_x: Tensor, train_y: Tensor
) -> Kernel:
ard_num_dims = train_x.shape[-1] - len(
[
param
for param in searchspace.discrete.parameters
if isinstance(param, TaskParameter)
]
)
lengthscale_prior = LogNormalPrior(
loc=sqrt(2) + log(ard_num_dims) * 0.5, scale=sqrt(3)
)

return RBFKernel(
lengthscale_prior=lengthscale_prior,
lengthscale_constraint=GreaterThan(
2.5e-2,
transform=None,
initial_value=lengthscale_prior.to_gpytorch().mode,
),
)
4,133 changes: 4,133 additions & 0 deletions benchmarks/data/ArylHalides/data.csv

Large diffs are not rendered by default.

4,600 changes: 4,600 additions & 0 deletions benchmarks/data/ArylHalides/data_raw.csv

Large diffs are not rendered by default.

1,729 changes: 1,729 additions & 0 deletions benchmarks/data/DirectArylation/data.csv

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions benchmarks/data/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""Utils for reading data."""

import os

DATA_PATH = os.sep.join(__file__.split(os.sep)[:-1]) + os.sep
20 changes: 18 additions & 2 deletions benchmarks/domains/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,26 @@
"""Benchmark domains."""

from benchmarks.definition.base import Benchmark
from benchmarks.domains.synthetic_2C1D_1C import synthetic_2C1D_1C_benchmark
from benchmarks.domains.kernel_presets.arylhalides_tl_substance import (
arylhalides_tl_substance_benchmark,
)
from benchmarks.domains.kernel_presets.direct_arylation_tl_temp import (
direct_arylation_tl_temp_benchmark,
)
from benchmarks.domains.kernel_presets.easom_tl_noise import easom_tl_noise_benchmark
from benchmarks.domains.kernel_presets.hartmann_tl_inverted_noise import (
hartmann_tl_inverted_noise_benchmark,
)
from benchmarks.domains.kernel_presets.michalewicz_tl_noise import (
michalewicz_tl_noise_benchmark,
)

BENCHMARKS: list[Benchmark] = [
synthetic_2C1D_1C_benchmark,
hartmann_tl_inverted_noise_benchmark,
easom_tl_noise_benchmark,
michalewicz_tl_noise_benchmark,
arylhalides_tl_substance_benchmark,
direct_arylation_tl_temp_benchmark,
]

__all__ = ["BENCHMARKS"]
165 changes: 165 additions & 0 deletions benchmarks/domains/arylhalides_tl_substance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
"""Benchmark on ArylHalides data with two distinct arylhalides as TL tasks."""

from __future__ import annotations

import os

import pandas as pd

from baybe.campaign import Campaign
from baybe.objectives import SingleTargetObjective
from baybe.parameters import SubstanceParameter, TaskParameter
from baybe.searchspace import SearchSpace
from baybe.simulation import simulate_scenarios
from baybe.targets import NumericalTarget
from benchmarks.data.utils import DATA_PATH
from benchmarks.definition import (
ConvergenceBenchmark,
ConvergenceBenchmarkSettings,
)


def get_data() -> pd.DataFrame:
"""Load data for benchmark.

Returns:
Data for benchmark.
"""
data_path = DATA_PATH + "ArylHalides" + os.sep
data = pd.read_table(data_path + "data.csv", sep=",")
data_raw = pd.read_table(data_path + "data_raw.csv", sep=",")
for substance in ["base", "ligand", "additive"]:
data[substance + "_smiles"] = data[substance].map(
dict(zip(data_raw[substance], data_raw[substance + "_smiles"]))
)
return data


data = get_data()

test_task = "1-iodo-4-methoxybenzene"
source_task = [
# Dissimilar source task
"1-chloro-4-(trifluoromethyl)benzene"
]


def space_data() -> (
SingleTargetObjective,
SearchSpace,
SearchSpace,
pd.DataFrame,
pd.DataFrame,
):
"""Definition of search space, objective, and data.

Returns:
Objective, TL search space, non-TL search space,
pre-measured task data (source task),
and lookup for the active (target) task.
"""
data_params = [
SubstanceParameter(
name=substance,
data=dict(zip(data[substance], data[f"{substance}_smiles"])),
encoding="MORDRED",
)
for substance in ["base", "ligand", "additive"]
]

task_param = TaskParameter(
name="aryl_halide",
values=[test_task] + source_task,
active_values=[test_task],
)

objective = SingleTargetObjective(NumericalTarget(name="yield", mode="MAX"))
searchspace = SearchSpace.from_product(parameters=[*data_params, task_param])
searchspace_nontl = SearchSpace.from_product(parameters=data_params)

lookup = data.query(f'aryl_halide=="{test_task}"').copy(deep=True)
initial_data = data.query("aryl_halide.isin(@source_task)", engine="python").copy(
deep=True
)

return objective, searchspace, searchspace_nontl, initial_data, lookup


def arylhalides_tl_substance(settings: ConvergenceBenchmarkSettings) -> pd.DataFrame:
"""Benchmark function comparing TL and non-TL campaigns.

Inputs:
base Discrete substance with numerical encoding
ligand Discrete substance with numerical encoding
additive Discrete substance with numerical encoding
Concentration Continuous
aryl_halide Discrete task parameter
Output: continuous
Objective: Maximization
Optimal Inputs: [
{
base MTBD
ligand AdBrettPhos
additive N,N-dibenzylisoxazol-3-amine
}
]
Optimal Output: 68.24812709999999
"""
objective, searchspace, searchspace_nontl, initial_data, lookup = space_data()

campaign = Campaign(
searchspace=searchspace,
objective=objective,
)

results = []
for p in [0.01, 0.02, 0.05, 0.1, 0.2]:
results.append(
simulate_scenarios(
{f"{int(100 * p)}": campaign},
lookup,
initial_data=[
initial_data.sample(frac=p) for _ in range(settings.n_mc_iterations)
],
batch_size=settings.batch_size,
n_doe_iterations=settings.n_doe_iterations,
impute_mode="error",
)
)
# No training data
results.append(
simulate_scenarios(
{"0": campaign},
lookup,
batch_size=settings.batch_size,
n_doe_iterations=settings.n_doe_iterations,
n_mc_iterations=settings.n_mc_iterations,
impute_mode="error",
)
)
# Non-TL campaign
results.append(
simulate_scenarios(
{"non-TL": Campaign(searchspace=searchspace_nontl, objective=objective)},
lookup,
batch_size=settings.batch_size,
n_doe_iterations=settings.n_doe_iterations,
n_mc_iterations=settings.n_mc_iterations,
impute_mode="error",
)
)
results = pd.concat(results)
return results


benchmark_config = ConvergenceBenchmarkSettings(
batch_size=2,
n_doe_iterations=10,
n_mc_iterations=100,
)

arylhalides_tl_substance_benchmark = ConvergenceBenchmark(
function=arylhalides_tl_substance,
optimal_target_values={"yield": 68.24812709999999},
settings=benchmark_config,
)
Loading
Loading