Datasets¶
Utilities for downloading or generating datasets, splitting data, and computing accuracy metrics.
- equadratures.datasets.gen_friedman(n_observations=100, n_dim=5, noise=0.0, random_seed=None, normalise=False)[source]¶
Generates the friedman regression problem described by Friedman [1] and Breiman [2].
Inspired by
sklearn.datasets.make_friedman1
. The function hasn_dim=5
, and choosingn_dim>5
adds irrelevent input dimensions. Gaussian noise with standard deviationnoise
is added.- Parameters
n_observations (int, optional) – The number of observations (samples).
n_dim (int, optional) – The total number of dimensions. n_dim>=5, with n_dim>5 adding irrelevent input dimensions.
noise (float, optional) – The standard deviation of the gaussian noise applied to the output.
random_seed (int, optional) – Random number generator seed.
normalise (bool, optional) – Normalise y to lie between -1 to 1.
- Returns
Tuple (X,y) containing two numpy.ndarray’s; One with shape (n_observations,n_dim) containing the inputs, and one with shape (n_observations,1) containing the outputs/targets.
- Return type
References
Friedman, “Multivariate adaptive regression splines”, The Annals of Statistics 19 (1), pages 1-67, 1991.
Breiman, “Bagging predictors”, Machine Learning 24, pages 123-140, 1996.
- equadratures.datasets.gen_linear(n_observations=100, n_dim=5, n_relevent=5, bias=0.0, noise=0.0, random_seed=None)[source]¶
Generate a synthetic linear dataset for regression.
Data is generated using a random linear regression model with
n_relevent
input dimensions. The remaining dimensions are “irrelevent” noise i.e. they do not affect the output. Gaussian noise with standard deviationnoise
is added.- Parameters
n_observations (int, optional) – The number of observations (samples).
n_dim (int, optional) – The total number of dimensions.
n_relevent (int, optional) – The number of relevent input dimensions, i.e., the number of features used to build the linear model used to generate the output.
bias (float, optional) – The bias term in the underlying linear model.
noise (float, optional) – The standard deviation of the gaussian noise applied to the output.
random_seed (int, optional) – Random number generator seed.
- Returns
Tuple (X,y) containing two numpy.ndarray’s; One with shape (n_observations,n_dim) containing the inputs, and one with shape (n_observations,1) containing the outputs/targets.
- Return type
- equadratures.datasets.load_eq_dataset(dataset, data_dir=None)[source]¶
Loads the requested dataset from the equadratures dataset repository.
Visit the aforementioned repo for a description of the available datasets.
The requested dataset can either be downloaded directly upon request, or to minimise downloads the repo can be cloned once by the user, and the local repo directory can be given via
data_dir
(see examples).- Parameters
- Returns
NpzFile instance (see numpy.lib.format) containing the dataset. Contents can be accessed in the usual way e.g.
X = NpzFile['X']
.- Return type
NpzFile
Examples
- Loading from remote repository
>>> # Load the naca0012 aerofoil dataset >>> data = eq.datasets.load_eq_dataset('naca0012') >>> print(data.files) ['X', 'Cp', 'Cl', 'Cd'] >>> X = data['X'] >>> y = data['Cp']
- Loading from a locally cloned repository
>>> git clone https://github.com/Effective-Quadratures/data-sets.git >>> data = eq.datasets.load_eq_dataset('naca0012', data_dir='/Users/user/Documents/data-sets')
- equadratures.datasets.score(y_true, y_pred, metric='r2', X=None)[source]¶
Evaluates the accuracy/error score between predictions and the truth, according to the given accuracy metric.
- Parameters
y_pred (numpy.ndarray) – Array with shape (number_of_observations, 1), containing predictions.
y_true (numpy.ndarray) – Array with shape (number_of_observations, 1) containing the true data.
metric (str, optional) – The scoring metric to use. Avaliable options are:
`adjusted_r2`
,`r2`
,`mae`
,`rmse`
, or`normalised_mae`
.X (numpy.ndarray) – The input data associated with y_pred. Required if
metric=`adjusted_r2`
.
- Returns
The accuracy or error score.
- Return type
- equadratures.datasets.train_test_split(X, y, train=0.7, random_seed=None, shuffle=True)[source]¶
Split arrays or matrices into random train and test subsets.
Inspired by
sklearn.model_selection.train_test_split
.- Parameters
X (numpy.ndarray) – Array with shape (n_observations,n_dim) containing the inputs.
y (numpy.ndarray) – Array with shape (n_observations,1) containing the outputs/targets.
train (float, optional) – Fraction between 0.0 and 1.0, representing the proportion of the dataset to include in the train split.
random_seed (int, optional) – Seed for random number generator.
shuffle (bool, optional) – Whether to shuffle the rows of data when spliting.
- Returns
Tuple (X_train, X_test, y_train, y_test) containing the split data, output as numpy.ndarray’s.
- Return type
Example
>>> X_train, X_test, y_train, y_test = eq.datasets.train_test_split(X, y, >>> train=0.8, random_seed = 42)