Datasets¶

Olympus provides various datasets form across the natural sciences that form the basis of realistic and challenging benchmarks for optimization algorithms. Models trained on these datasets provide Emulators that are used to simulate an experimental campaign.

While you can load pre-trained Emulators based on these datasets, you can load these datasets with Dataset class:

from olympus.datasets import Dataset
dataset = Dataset(kind='snar')

The datasets currently available are the following:

No.	Dataset	Kind Keyword	Objective	Goal
1	Alkoxylation	alkox	reaction rate	Max
2	Colors Bob	colors_bob	green-ness	Min
3	Colors N9	colors_n9	green-ness	Min
4	Buckminsterfullerene adducts	fullerenes	yield of X1+X2	Max
5	HPLC	hplc	peak area	Max
6	Photobleaching PCE10	photo_pce10	stability	Min
7	Photobleaching WF3	photo_wf3	stability	Min
8	SnAr reaction	snar	e_factor	Min
9	N-benzylation	benzylation	e_factor	Min
10	Suzuki reaction	suzuki	yield	Max

In addition to the Olympus datasets, you can load your own custom ones:

from olympus.datasets import Dataset
import pandas as pd

mydata = pd.from_csv('mydata.csv')
dataset = Dataset(data=mydata)

Dataset Class¶

class olympus.datasets.Dataset(kind=None, data=None, columns=None, target_ids=None, test_frac=0.2, num_folds=5, random_seed=None)[source]

A Dataset object stores the data of a dataset by wrapping a pandas.DataFrame in its data attribute, provides additional information on the dataset, and provides convenience methods to access features and targets as well as to generate training/validation/test splits.

Parameters

kind (str) – kind of the Olympus dataset to load.
data (array) – custom dataset. Same input as for pandas.DataFrame.
columns (list) – column names. Same input as for pandas.DataFrame.
target_ids (list) – list of column indices, or names if provided, that identify the targets for the predictions.
test_frac (float) – fraction of the data to be used as test set.
num_folds (int) – number of cross validation folds the training set will be split into.
random_seed (int) – random seed for numpy. Setting a seed makes the random splits reproducible.

Methods

`dataset_info`()	Provide summary info about dataset.
`set_param_space`(param_space)	Define the parameter space of the dataset.
`get_cv_fold`(fold)	Get the data for a specific cross-validation fold.