Custom Dataset¶
In this example, we will load a dataset from scikit-learn
and use it to create a custom Dataset
object in Olympus.
[1]:
import pandas as pd
import numpy as np
from olympus import Dataset
[2]:
# load the boston dataset from sklearn
from sklearn.datasets import load_boston
boston = load_boston()
[3]:
# concatenate the features and targets into single lists/arrays and use the to create a pandas dataframe
data = np.c_[boston['data'], boston['target']]
columns = list(boston['feature_names'])
columns.append('target')
df = pd.DataFrame(data=data, columns=columns)
df
[3]:
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
506 rows × 14 columns
[4]:
# pass the Dataframe as the data argument for Dataset and specify which one is the target variable
dataset = Dataset(data=df, target_ids=['target'])
Now dataset
is an instance of the Olympus class Dataset
. However, before we can use it to train a custom Emulator
, we need to specicify the parameter space for this dataset/problem.
[5]:
from olympus import ParameterSpace, Parameter
# initialise a parameter space object
param_space = ParameterSpace()
# add all features in the dataset as a variable in the parameter space
for feature in dataset.features:
low = np.min(dataset.data[feature]) # take the min in the data
high = np.max(dataset.data[feature]) # take the max in the data
param = Parameter(kind='continuous', name=feature, low=low, high=high)
param_space.add(param)
dataset.set_param_space(param_space)
Note that in the above code we set the bounds of the parameters based on the min/max samples in the dataset. This can also be achieved by using the infer_param_space
method of Dataset
, as follows:
[6]:
dataset.infer_param_space()
However, most often you will want these bounds to depend on the details your problem, in which case you can explicitly specify the bounds for all parameters.
Now we define a small Bayesian Neural Network and we will test its performance in emulating this dataset. Note that, by default, Dataset
creates 5 random folds for cross validation and reserves 20% of the data for testing.
[7]:
from olympus import Emulator
from olympus.models import BayesNeuralNet
mymodel = BayesNeuralNet(hidden_depth=2, hidden_nodes=12, hidden_act='leaky_relu', out_act="relu",
batch_size=50, reg=0.005, max_epochs=10000)
emulator = Emulator(dataset=dataset, model=mymodel, feature_transform='normalize', target_transform='normalize')
[8]:
emulator.train()
[INFO] >>> Training model on 80% of the dataset, testing on 20%...
WARNING:tensorflow:From /Users/Matteo/anaconda2/envs/olympus/lib/python3.7/site-packages/tensorflow_probability/python/layers/util.py:104: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
WARNING:tensorflow:From /Users/Matteo/anaconda2/envs/olympus/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
[INFO] =======================================================================
[INFO] Epoch Train R2 Train RMSD Test R2 Test RMSD
[INFO] =======================================================================
[INFO] 0 -4.011 0.483 -5.083 0.404 *
[INFO] 100 -4.011 0.483 -5.083 0.404
[INFO] 200 -4.011 0.483 -5.083 0.404
[INFO] 300 -4.011 0.483 -5.083 0.404
[INFO] 400 -4.011 0.483 -5.083 0.404
[INFO] 500 -4.011 0.483 -5.083 0.404
[INFO] 600 -4.011 0.483 -5.083 0.404
[INFO] 700 -4.011 0.483 -5.083 0.404
[INFO] 800 -4.011 0.483 -5.083 0.404
[INFO] 900 -4.011 0.483 -5.083 0.404
[INFO] 1000 -4.011 0.483 -5.083 0.404
[INFO] 1100 -4.011 0.483 -5.083 0.404
[INFO] 1200 -3.730 0.470 -4.805 0.394 *
[INFO] 1300 0.392 0.168 0.536 0.111 *
[INFO] 1400 0.579 0.140 0.655 0.096 *
[INFO] 1500 0.617 0.134 0.721 0.086 *
[INFO] 1600 0.696 0.119 0.753 0.081 *
[INFO] 1700 0.665 0.125 0.788 0.075 *
[INFO] 1800 0.681 0.122 0.796 0.074 *
[INFO] 1900 0.704 0.118 0.795 0.074
[INFO] 2000 0.769 0.104 0.787 0.076
[INFO] 2100 0.792 0.099 0.796 0.074
[INFO] 2200 0.795 0.098 0.800 0.073 *
[INFO] 2300 0.807 0.095 0.782 0.076
[INFO] 2400 0.775 0.102 0.829 0.068 *
[INFO] 2500 0.801 0.096 0.805 0.072
[INFO] 2600 0.836 0.088 0.808 0.072
[INFO] 2700 0.834 0.088 0.807 0.072
[INFO] 2800 0.825 0.090 0.812 0.071
[INFO] 2900 0.841 0.086 0.796 0.074
[INFO] 3000 0.830 0.089 0.805 0.072
[INFO] 3100 0.842 0.086 0.821 0.069
[INFO] 3200 0.848 0.084 0.824 0.069
[INFO] 3300 0.843 0.086 0.805 0.072
[INFO] 3400 0.862 0.080 0.811 0.071
[INFO] 3500 0.861 0.080 0.794 0.074
[INFO] 3600 0.880 0.075 0.796 0.074
[INFO] 3700 0.872 0.077 0.805 0.072
[INFO] 3800 0.883 0.074 0.806 0.072
[INFO] 3900 0.881 0.074 0.810 0.071
[INFO] 4000 0.877 0.076 0.816 0.070
[INFO] 4100 0.882 0.074 0.824 0.069
[INFO] 4200 0.887 0.073 0.815 0.070
[INFO] 4300 0.882 0.074 0.815 0.070
[INFO] 4400 0.861 0.080 0.806 0.072
[INFO] 4500 0.888 0.072 0.813 0.071
[INFO] 4600 0.889 0.072 0.809 0.071
[INFO] 4700 0.901 0.068 0.820 0.069
[INFO] 4800 0.887 0.073 0.825 0.068
[INFO] 4900 0.899 0.069 0.805 0.072
[INFO] 5000 0.909 0.065 0.819 0.070
[INFO] 5100 0.906 0.066 0.822 0.069
[INFO] 5200 0.919 0.061 0.827 0.068
[INFO] 5300 0.914 0.063 0.824 0.069
[INFO] 5400 0.917 0.062 0.833 0.067 *
[INFO] 5500 0.915 0.063 0.835 0.066 *
[INFO] 5600 0.920 0.061 0.846 0.064 *
[INFO] 5700 0.926 0.059 0.851 0.063 *
[INFO] 5800 0.927 0.058 0.844 0.065
[INFO] 5900 0.929 0.058 0.851 0.063
[INFO] 6000 0.922 0.060 0.846 0.064
[INFO] 6100 0.932 0.057 0.848 0.064
[INFO] 6200 0.931 0.057 0.861 0.061 *
[INFO] 6300 0.929 0.057 0.863 0.061 *
[INFO] 6400 0.930 0.057 0.866 0.060 *
[INFO] 6500 0.934 0.056 0.865 0.060
[INFO] 6600 0.930 0.057 0.857 0.062
[INFO] 6700 0.930 0.057 0.878 0.057 *
[INFO] 6800 0.935 0.055 0.859 0.061
[INFO] 6900 0.930 0.057 0.876 0.058
[INFO] 7000 0.930 0.057 0.890 0.054 *
[INFO] 7100 0.927 0.058 0.882 0.056
[INFO] 7200 0.938 0.054 0.872 0.059
[INFO] 7300 0.935 0.055 0.870 0.059
[INFO] 7400 0.932 0.056 0.887 0.055
[INFO] 7500 0.936 0.055 0.872 0.059
[INFO] 7600 0.928 0.058 0.886 0.055
[INFO] 7700 0.934 0.055 0.876 0.058
[INFO] 7800 0.946 0.050 0.866 0.060
[INFO] 7900 0.944 0.051 0.880 0.057
[INFO] 8000 0.941 0.052 0.878 0.057
[INFO] 8100 0.942 0.052 0.884 0.056
[INFO] 8200 0.934 0.056 0.873 0.058
[INFO] 8300 0.939 0.053 0.879 0.057
[INFO] 8400 0.935 0.055 0.872 0.059
[INFO] 8500 0.941 0.053 0.879 0.057
[INFO] 8600 0.943 0.051 0.890 0.054 *
[INFO] 8700 0.931 0.057 0.896 0.053 *
[INFO] 8800 0.936 0.055 0.889 0.055
[INFO] 8900 0.934 0.056 0.881 0.056
[INFO] 9000 0.942 0.052 0.880 0.057
[INFO] 9100 0.940 0.053 0.879 0.057
[INFO] 9200 0.938 0.054 0.877 0.057
[INFO] 9300 0.944 0.051 0.883 0.056
[INFO] 9400 0.937 0.054 0.882 0.056
[INFO] 9500 0.932 0.056 0.904 0.051 *
[INFO] 9600 0.942 0.052 0.889 0.055
[INFO] 9700 0.946 0.050 0.885 0.055
[INFO] 9800 0.944 0.051 0.890 0.054
[INFO] 9900 0.939 0.053 0.883 0.056
[INFO] Training completed in 10.18 seconds.
[INFO] ===========================================================================
[INFO] Train R2 Score: 0.9322
[INFO] Test R2 Score: 0.9044
[INFO] Train RMSD Score: 0.0562
[INFO] Test RMSD Score: 0.0506
[8]:
{'train_r2': 0.932159417784379,
'test_r2': 0.9044223215989219,
'train_rmsd': 0.05624565811984018,
'test_rmsd': 0.05059158705037693}
Let’s now say you would like to share this dataset with the community by uploading it to the Olympus Datasets. You can do this with the upload
command line tool in Olympus as described in the documentation. However, you first need to prepare the dataset in the expected format. One way to easily do this is to use the to_disk
method available to Dataset
objects.
[9]:
# save dataset to disk
dataset.to_disk('custom_dataset')
[10]:
!ls custom_dataset/
config.json data.csv description.txt
[ ]: