Performance metrics

FairBench reports implement several definitions of fairness that quantify imbalances between groups of people (e.g., different genders) in terms of them obtaining different assessments by base performance metrics. These assessments are often further reduced across groups of samples with different sensitive attribute values.

Here, we present base metrics used to assess AI that reports use. All metrics computed on a subset of 'sensitive samples', which form the group being examined each time. Outputs are wrapped into explainable objects that keep track of relevant metadata.

  1. Classification
  2. Ranking
  3. Regression

Classification

Classification metrics assess binary predictions. Unless stated otherwise, the following arguments need to be provided:

Argument Role Values
predictions system output binary array
labels prediction target binary array
sensitive sensitive attribute fork of arrays with elements in \([0,1]\) (either binary or fuzzy)

accuracy

Computes the accuracy for correctly predicting provided binary labels for sensitive data samples. Returns a float in the range \([0,1]\).
Explanation: number of samples, true predictions

pr

Computes the positive rate of binary predictions for sensitive data samples. Returns a float in the range \([0,1]\). This metric does not use the labels argument.
Explanation: number of samples, positive predictions

positives

Computes the number of positive predictions for sensitive data samples. Returns a float in the range \([0,\infty)\). This metric does not use the labels argument.
Explanation: number of samples

tpr

Computes the true positive rate of binary predictions for sensitive data samples. Returns a float in the range \([0,1]\).
Explanation: number of samples, number of positives, number of true positives

tnr

Computes the true negative rate of binary predictions for sensitive data samples. Returns a float in the range \([0,1]\).
Explanation: number of samples, number of negatives, number of true negatives

fpr

Computes the false positive rate of binary predictions for sensitive data samples. Returns a float in the range \([0,1]\).
Explanation: number of samples, number of positives, number of false positives

fnr

Computes the false negative rate of binary predictions for sensitive data samples. Returns a float in the range \([0,1]\).
Explanation: number of samples, number of negatives, number of false negatives

Ranking

Ranking metrics assess scores that aim to approach provided labels. The following arguments need to be provided:

Argument Role Values
scores system output array with elements in \([0,1]\)
labels prediction target binary array
sensitive sensitive attribute fork of arrays with elements in \([0,1]\) (either binary or fuzzy)

auc

Computes the area under curve of the receiver operating characteristics for sensitive data samples. Returns a float in the range \([0,1]\).
Explanation: number of samples, the receiver operating characteristic curve

phi

Computes the score mass of sensitive data samples compared to the total scores. Returns a float in the range \([0,1]\).
Explanation: number of samples, sensitive scores

tophr

Computes the hit rate, i.e., precision, for a set number of top scores for sensitive data samples. This is used to assess recommendation systems. By default, the top-3 hit rate is analysed. Returns a float in the range \([0,1]\).
Explanation: number of samples, top scores, true top scores
Optional argument Role Values
top parameter integer in the range \([1,\infty)\)

toprec

Computes the recall for a set number of top scores for sensitive data samples. This is used to assess recommendation systems. By default, the top-3 recall is analysed. Returns a float in the range \([0,1]\).
Explanation: number of samples, top scores, true top scores
Optional argument Role Values
top parameter integer in the range \([1,\infty)\)

topf1

Computes the f1-score for a set number of top scores for sensitive data samples. This is the harmonic mean between hr and preck and is used to assess recommendation systems. By default, the top-3 f1 is analysed. Returns a float in the range \([0,1]\).
Explanation: number of samples, top scores, true top scores
Optional argument Role Values
top parameter integer in the range \([1,\infty)\)

tophr

Computes the average hit rate/precession across different numbers of top scores with correct predictions. By default, the top-3 average precision is computed. Returns a float in the range \([0,1]\).
Explanation: number of samples, top scores, hr curve
Optional argument Role Values
top parameter integer in the range \([1,\infty)\)

Regression

Regression metrics assess scores that aim to reproduce desired target scores. The following arguments need to be provided:

Argument Role Values
scores system output any float array
targets prediction target any float array
sensitive sensitive attribute fork of arrays with elements in \([0,1]\) (either binary or fuzzy)

max_error

Computes the maximum absolute error between scores and targets for sensitive data samples. Returns a float in the range \([0,\infty)\).
Explanation: ---

mae

Computes the mean of the absolute error between scores and targets for sensitive data samples. Returns a float in the range \([0,\infty)\).
Explanation: number of samples, sum of absolute errors

mse

Computes the mean of the square error between scores and targets for sensitive data samples. Returns a float in the range \([0,\infty)\).
Explanation: number of samples, sum of square errors

rmse

Computes the root of mse. Returns a float in the range \([0,\infty)\).
Explanation: number of samples, sum of square errors

r2

Computes the r2 score between scores and target values, adjusted for the provided degree of freedom (default is zero). Returns a float in the range \((-\infty,1]\), where larger values correspond to better estimation and models that output the mean are evaluated to zero.
Explanation: number of samples, sum of square errors, degrees of freedom
Optional argument Role Values
deg_freedom parameter integer in the range \([0,\infty)\)

pinball

Computes the pinball deviation between scores and target values for a balance parameter alpha (default is 0.5). Returns a float in the range \([0,\infty)\), where smaller values correspond to better estimation.
Explanation: number of samples
Optional argument Role Values
alpha parameter float in the range \([0,1]\)