Datasets

Utilities for downloading or generating datasets, splitting data, and computing accuracy metrics.

equadratures.datasets.gen_friedman(n_observations=100, n_dim=5, noise=0.0, random_seed=None, normalise=False)[source]

Generates the friedman regression problem described by Friedman [1] and Breiman [2].

Inspired by sklearn.datasets.make_friedman1. The function has n_dim=5, and choosing n_dim>5 adds irrelevent input dimensions. Gaussian noise with standard deviation noise is added.

Parameters
  • n_observations (int, optional) – The number of observations (samples).

  • n_dim (int, optional) – The total number of dimensions. n_dim>=5, with n_dim>5 adding irrelevent input dimensions.

  • noise (float, optional) – The standard deviation of the gaussian noise applied to the output.

  • random_seed (int, optional) – Random number generator seed.

  • normalise (bool, optional) – Normalise y to lie between -1 to 1.

Returns

Tuple (X,y) containing two numpy.ndarray’s; One with shape (n_observations,n_dim) containing the inputs, and one with shape (n_observations,1) containing the outputs/targets.

Return type

tuple

References

    1. Friedman, “Multivariate adaptive regression splines”, The Annals of Statistics 19 (1), pages 1-67, 1991.

    1. Breiman, “Bagging predictors”, Machine Learning 24, pages 123-140, 1996.

equadratures.datasets.gen_linear(n_observations=100, n_dim=5, n_relevent=5, bias=0.0, noise=0.0, random_seed=None)[source]

Generate a synthetic linear dataset for regression.

Data is generated using a random linear regression model with n_relevent input dimensions. The remaining dimensions are “irrelevent” noise i.e. they do not affect the output. Gaussian noise with standard deviation noise is added.

Parameters
  • n_observations (int, optional) – The number of observations (samples).

  • n_dim (int, optional) – The total number of dimensions.

  • n_relevent (int, optional) – The number of relevent input dimensions, i.e., the number of features used to build the linear model used to generate the output.

  • bias (float, optional) – The bias term in the underlying linear model.

  • noise (float, optional) – The standard deviation of the gaussian noise applied to the output.

  • random_seed (int, optional) – Random number generator seed.

Returns

Tuple (X,y) containing two numpy.ndarray’s; One with shape (n_observations,n_dim) containing the inputs, and one with shape (n_observations,1) containing the outputs/targets.

Return type

tuple

equadratures.datasets.load_eq_dataset(dataset, data_dir=None)[source]

Loads the requested dataset from the equadratures dataset repository.

Visit the aforementioned repo for a description of the available datasets.

The requested dataset can either be downloaded directly upon request, or to minimise downloads the repo can be cloned once by the user, and the local repo directory can be given via data_dir (see examples).

Parameters
  • dataset (str) – The dataset to download. Options are `naca0012`, `blade_envelopes`, `probes`, `3Dfan_blades`.

  • data_dir (str, optional) – Directory name where a local clone of the data-sets repo is located. If given, the dataset will be loaded from here instead of downloading from the remote repo.

Returns

NpzFile instance (see numpy.lib.format) containing the dataset. Contents can be accessed in the usual way e.g. X = NpzFile['X'].

Return type

NpzFile

Examples

Loading from remote repository
>>> # Load the naca0012 aerofoil dataset
>>> data = eq.datasets.load_eq_dataset('naca0012')
>>> print(data.files)
['X', 'Cp', 'Cl', 'Cd']
>>> X = data['X']
>>> y = data['Cp']
Loading from a locally cloned repository
>>> git clone https://github.com/Effective-Quadratures/data-sets.git
>>> data = eq.datasets.load_eq_dataset('naca0012', data_dir='/Users/user/Documents/data-sets')
equadratures.datasets.score(y_true, y_pred, metric='r2', X=None)[source]

Evaluates the accuracy/error score between predictions and the truth, according to the given accuracy metric.

Parameters
  • y_pred (numpy.ndarray) – Array with shape (number_of_observations, 1), containing predictions.

  • y_true (numpy.ndarray) – Array with shape (number_of_observations, 1) containing the true data.

  • metric (str, optional) – The scoring metric to use. Avaliable options are: `adjusted_r2`, `r2`, `mae`, `rmse`, or `normalised_mae`.

  • X (numpy.ndarray) – The input data associated with y_pred. Required if metric=`adjusted_r2`.

Returns

The accuracy or error score.

Return type

float

equadratures.datasets.train_test_split(X, y, train=0.7, random_seed=None, shuffle=True)[source]

Split arrays or matrices into random train and test subsets.

Inspired by sklearn.model_selection.train_test_split.

Parameters
  • X (numpy.ndarray) – Array with shape (n_observations,n_dim) containing the inputs.

  • y (numpy.ndarray) – Array with shape (n_observations,1) containing the outputs/targets.

  • train (float, optional) – Fraction between 0.0 and 1.0, representing the proportion of the dataset to include in the train split.

  • random_seed (int, optional) – Seed for random number generator.

  • shuffle (bool, optional) – Whether to shuffle the rows of data when spliting.

Returns

Tuple (X_train, X_test, y_train, y_test) containing the split data, output as numpy.ndarray’s.

Return type

tuple

Example

>>> X_train, X_test, y_train, y_test = eq.datasets.train_test_split(X, y,
>>>                                    train=0.8, random_seed = 42)