Datasets¶
Olympus provides various datasets form across the natural sciences that form the basis of realistic and challenging benchmarks for optimization algorithms. Models trained on these datasets provide Emulators that are used to simulate an experimental campaign.
While you can load pre-trained Emulators based on these datasets, you can load these datasets with Dataset
class:
from olympus.datasets import Dataset
dataset = Dataset(kind='snar')
The datasets currently available are the following:
No. |
Dataset |
Kind Keyword |
Objective |
Goal |
---|---|---|---|---|
1 |
alkox |
reaction rate |
Max |
|
2 |
colors_bob |
green-ness |
Min |
|
3 |
colors_n9 |
green-ness |
Min |
|
4 |
fullerenes |
yield of X1+X2 |
Max |
|
5 |
hplc |
peak area |
Max |
|
6 |
photo_pce10 |
stability |
Min |
|
7 |
photo_wf3 |
stability |
Min |
|
8 |
snar |
e_factor |
Min |
|
9 |
benzylation |
e_factor |
Min |
|
10 |
suzuki |
yield |
Max |
In addition to the Olympus datasets, you can load your own custom ones:
from olympus.datasets import Dataset
import pandas as pd
mydata = pd.from_csv('mydata.csv')
dataset = Dataset(data=mydata)
Dataset Class¶
-
class
olympus.datasets.
Dataset
(kind=None, data=None, columns=None, target_ids=None, test_frac=0.2, num_folds=5, random_seed=None)[source] A
Dataset
object stores the data of a dataset by wrapping apandas.DataFrame
in itsdata
attribute, provides additional information on the dataset, and provides convenience methods to access features and targets as well as to generate training/validation/test splits.- Parameters
kind (str) – kind of the Olympus dataset to load.
data (array) – custom dataset. Same input as for pandas.DataFrame.
columns (list) – column names. Same input as for pandas.DataFrame.
target_ids (list) – list of column indices, or names if provided, that identify the targets for the predictions.
test_frac (float) – fraction of the data to be used as test set.
num_folds (int) – number of cross validation folds the training set will be split into.
random_seed (int) – random seed for numpy. Setting a seed makes the random splits reproducible.
Methods
dataset_info
()Provide summary info about dataset.
set_param_space
(param_space)Define the parameter space of the dataset.
get_cv_fold
(fold)Get the data for a specific cross-validation fold.