Performance metrics

FairBench reports implement several definitions of fairness that quantify imbalances between groups of people (e.g., different genders) in terms of them obtaining different assessments by base performance metrics. These assessments are often further reduced across groups of samples with different sensitive attribute values.

Here, we present base metrics used to assess AI that reports use. All metrics computed on a subset of 'sensitive samples', which form the group being examined each time. Outputs are wrapped into explainable objects that keep track of relevant metadata.

Classification
Ranking
Regression

Classification

Classification metrics assess binary predictions. Unless stated otherwise, the following arguments need to be provided:

Argument	Role	Values
predictions	system output	binary array
labels	prediction target	binary array
sensitive	sensitive attribute	fork of arrays with elements in \([0,1]\) (either binary or fuzzy)

`accuracy`

Computes the accuracy for correctly predicting provided binary labels for sensitive data samples. Returns a float in the range \([0,1]\).

Explanation: number of samples, true predictions

`pr`

Computes the positive rate of binary predictions for sensitive data samples. Returns a float in the range \([0,1]\). This metric does not use the labels argument.

Explanation: number of samples, positive predictions

`positives`

Computes the number of positive predictions for sensitive data samples. Returns a float in the range \([0,\infty)\). This metric does not use the labels argument.

Explanation: number of samples

`tpr`

Computes the true positive rate of binary predictions for sensitive data samples. Returns a float in the range \([0,1]\).

Explanation: number of samples, number of positives, number of true positives

`tnr`

Computes the true negative rate of binary predictions for sensitive data samples. Returns a float in the range \([0,1]\).

Explanation: number of samples, number of negatives, number of true negatives

`fpr`

Computes the false positive rate of binary predictions for sensitive data samples. Returns a float in the range \([0,1]\).

Explanation: number of samples, number of positives, number of false positives

`fnr`

Computes the false negative rate of binary predictions for sensitive data samples. Returns a float in the range \([0,1]\).

Explanation: number of samples, number of negatives, number of false negatives

Ranking

Ranking metrics assess scores that aim to approach provided labels. The following arguments need to be provided:

Argument	Role	Values
scores	system output	array with elements in \([0,1]\)
labels	prediction target	binary array
sensitive	sensitive attribute	fork of arrays with elements in \([0,1]\) (either binary or fuzzy)

`auc`

Computes the area under curve of the receiver operating characteristics for sensitive data samples. Returns a float in the range \([0,1]\).

Explanation: number of samples, the receiver operating characteristic curve

`phi`

Computes the score mass of sensitive data samples compared to the total scores. Returns a float in the range \([0,1]\).

Explanation: number of samples, sensitive scores

`tophr`

Computes the hit rate, i.e., precision, for a set number of top scores for sensitive data samples. This is used to assess recommendation systems. By default, the top-3 hit rate is analysed. Returns a float in the range \([0,1]\).

Explanation: number of samples, top scores, true top scores

Optional argument	Role	Values
top	parameter	integer in the range \([1,\infty)\)

`toprec`

Computes the recall for a set number of top scores for sensitive data samples. This is used to assess recommendation systems. By default, the top-3 recall is analysed. Returns a float in the range \([0,1]\).

Explanation: number of samples, top scores, true top scores

Optional argument	Role	Values
top	parameter	integer in the range \([1,\infty)\)

`topf1`

Computes the f1-score for a set number of top scores for sensitive data samples. This is the harmonic mean between hr and preck and is used to assess recommendation systems. By default, the top-3 f1 is analysed. Returns a float in the range \([0,1]\).

Explanation: number of samples, top scores, true top scores

Optional argument	Role	Values
top	parameter	integer in the range \([1,\infty)\)

`tophr`

Computes the average hit rate/precession across different numbers of top scores with correct predictions. By default, the top-3 average precision is computed. Returns a float in the range \([0,1]\).

Explanation: number of samples, top scores, hr curve

Optional argument	Role	Values
top	parameter	integer in the range \([1,\infty)\)

Regression

Regression metrics assess scores that aim to reproduce desired target scores. The following arguments need to be provided:

Argument	Role	Values
scores	system output	any float array
targets	prediction target	any float array
sensitive	sensitive attribute	fork of arrays with elements in \([0,1]\) (either binary or fuzzy)

`max_error`

Computes the maximum absolute error between scores and targets for sensitive data samples. Returns a float in the range \([0,\infty)\).

Explanation: ---

`mae`

Computes the mean of the absolute error between scores and targets for sensitive data samples. Returns a float in the range \([0,\infty)\).

Explanation: number of samples, sum of absolute errors

`mse`

Computes the mean of the square error between scores and targets for sensitive data samples. Returns a float in the range \([0,\infty)\).

Explanation: number of samples, sum of square errors

`rmse`

Computes the root of mse. Returns a float in the range \([0,\infty)\).

Explanation: number of samples, sum of square errors

`r2`

Computes the r2 score between scores and target values, adjusted for the provided degree of freedom (default is zero). Returns a float in the range \((-\infty,1]\), where larger values correspond to better estimation and models that output the mean are evaluated to zero.

Explanation: number of samples, sum of square errors, degrees of freedom

Optional argument	Role	Values
deg_freedom	parameter	integer in the range \([0,\infty)\)

`pinball`

Computes the pinball deviation between scores and target values for a balance parameter alpha (default is 0.5). Returns a float in the range \([0,\infty)\), where smaller values correspond to better estimation.

Explanation: number of samples

Optional argument	Role	Values
alpha	parameter	float in the range \([0,1]\)