# Datasets¶

Generates the friedman regression problem described by Friedman [1] and Breiman [2].

Inspired by `sklearn.datasets.make_friedman1`. The function has `n_dim=5`, and choosing `n_dim>5` adds irrelevent input dimensions. Gaussian noise with standard deviation `noise` is added.

Parameters
• n_observations (int, optional) – The number of observations (samples).

• n_dim (int, optional) – The total number of dimensions. n_dim>=5, with n_dim>5 adding irrelevent input dimensions.

• noise (float, optional) – The standard deviation of the gaussian noise applied to the output.

• random_seed (int, optional) – Random number generator seed.

• normalise (bool, optional) – Normalise y to lie between -1 to 1.

Returns

Tuple (X,y) containing two numpy.ndarray’s; One with shape (n_observations,n_dim) containing the inputs, and one with shape (n_observations,1) containing the outputs/targets.

Return type

tuple

References

1. Friedman, “Multivariate adaptive regression splines”, The Annals of Statistics 19 (1), pages 1-67, 1991.

1. Breiman, “Bagging predictors”, Machine Learning 24, pages 123-140, 1996.

equadratures.datasets.gen_linear(n_observations=100, n_dim=5, n_relevent=5, bias=0.0, noise=0.0, random_seed=None)[source]

Generate a synthetic linear dataset for regression.

Data is generated using a random linear regression model with `n_relevent` input dimensions. The remaining dimensions are “irrelevent” noise i.e. they do not affect the output. Gaussian noise with standard deviation `noise` is added.

Parameters
• n_observations (int, optional) – The number of observations (samples).

• n_dim (int, optional) – The total number of dimensions.

• n_relevent (int, optional) – The number of relevent input dimensions, i.e., the number of features used to build the linear model used to generate the output.

• bias (float, optional) – The bias term in the underlying linear model.

• noise (float, optional) – The standard deviation of the gaussian noise applied to the output.

• random_seed (int, optional) – Random number generator seed.

Returns

Tuple (X,y) containing two numpy.ndarray’s; One with shape (n_observations,n_dim) containing the inputs, and one with shape (n_observations,1) containing the outputs/targets.

Return type

tuple

Visit the aforementioned repo for a description of the available datasets.

The requested dataset can either be downloaded directly upon request, or to minimise downloads the repo can be cloned once by the user, and the local repo directory can be given via `data_dir` (see examples).

Parameters
• dataset (str) – The dataset to download. Options are ``naca0012``, ``blade_envelopes``, ``probes``, ``3Dfan_blades``.

• data_dir (str, optional) – Directory name where a local clone of the data-sets repo is located. If given, the dataset will be loaded from here instead of downloading from the remote repo.

• verbose (bool, optional) – Option to print verbose messages to screen.

Returns

NpzFile instance (see numpy.lib.format) containing the dataset. Contents can be accessed in the usual way e.g. `X = NpzFile['X']`.

Return type

NpzFile

Examples

```>>> # Load the naca0012 aerofoil dataset
>>> print(data.files)
['X', 'Cp', 'Cl', 'Cd']
>>> X = data['X']
>>> y = data['Cp']
```
```>>> git clone https://github.com/Effective-Quadratures/data-sets.git
```

Evaluates the accuracy/error score between predictions and the truth, according to the given accuracy metric.

Parameters
• y_pred (numpy.ndarray) – Array with shape (number_of_observations, 1), containing predictions.

• y_true (numpy.ndarray) – Array with shape (number_of_observations, 1) containing the true data.

• metric (str, optional) – The scoring metric to use. Avaliable options are: ``adjusted_r2``, ``r2``, ``mae``, ``rmse``, or ``normalised_mae``.

• X (numpy.ndarray) – The input data associated with y_pred. Required if `metric=`adjusted_r2``.

Returns

The accuracy or error score.

Return type

float

Split arrays or matrices into random train and test subsets.

Parameters
• X (numpy.ndarray) – Array with shape (n_observations,n_dim) containing the inputs.

• y (numpy.ndarray) – Array with shape (n_observations,1) containing the outputs/targets.

• train (float, optional) – Fraction between 0.0 and 1.0, representing the proportion of the dataset to include in the train split.

• random_seed (int, optional) – Seed for random number generator.

• shuffle (bool, optional) – Whether to shuffle the rows of data when spliting.

Returns

Tuple (X_train, X_test, y_train, y_test) containing the split data, output as numpy.ndarray’s.

Return type

tuple

Example

```>>> X_train, X_test, y_train, y_test = eq.datasets.train_test_split(X, y,
>>>                                    train=0.8, random_seed = 42)
```