Fairness in machine learning have become essential and have attracted the attention of many researchers in the recent years, due to the need of tackle discrimination brought by algorithms and to respect the fundamental rights of the involved persons. At LIMPID, we work on a set of new fairness metrics to assess and correct for potential discriminatory biases. We also propose the von Mises-Fisher (vMF) mixture model as the theoretical foundation of debiasing mechanisms.
In 2018, the American Civil Liberty Union (ACLU) spotted bias in the Amazon face recognition software “Rekognition”, which incorrectly matched 28 members of the US Congress as other people who have been arrested for a crime. These false matches included legislators of all ages and from all across the country, men and women. However, they were disproportionately associated to people of color.
The expected outcome of the experiment should have been that, regardless of the group of the query, the False Positive Identification Rate () would be the same. Indeed, for an unbiased face recognition algorithm, if there is, say, 20% of people of color in the dataset to check (like in the US Congress), there would be 20% of false matches from people of color. In the ACLU experiment, 39% of people of color were incorrectly matched: FPIR was not equal among ethnic groups. In a real identification scenario under such biased rates, the person’s fundamental rights might have been affected without their consent and/or awareness.
Understandably, and as a consequence of this experiment, most of the concerns expressed about facial recognition focused afterwards on identification, and numerous reports and studies on this topic were funded in the US and in Europe. It became a major subject for H2020 European research program and for the French National Research Agency.
In the United States, the National Institute of Standards and Technology (NIST) had a long record on organizing Face Recognition Vendor Test (FRVT), back to the Face Recognition Technology (FERET) program in 1993. The initial goals of the FERET program were “to establish the viability of automatic face recognition algorithms and to establish a performance baseline against which to measure future progress”. Seven years later, face recognition technology had moved from universities and research labs prototypes to the commercial market. The FRVT 2000 program was then launched to evaluate the capabilities of these commercial systems.
Since then, several FRVT campaigns have been conducted, sometimes with precise objectives such as to correctly detect and recognize children’s faces. Some of them yielded reports.
Nowadays, the NIST FRVT activity is organized on a continuing basis. Developers submit their algorithms to NIST whenever they are ready. Among the different tests available, the so-called FRVT 1:1 track is devoted to the verification scenario (deciding whether two face images correspond to the same identity or not), while FRVT 1:N is devoted to the one-to-many face recognition identification algorithms (finding the specific identity of a probe face among several previously enrolled identities). The evaluations use very large (sub)sets of facial imagery (mugshot, visa, border-crossing…) to measure advancements in the accuracy and speed of the algorithms.
In 2018, NIST passed 200 face recognition algorithms through a test to quantify demographic differences. Four collections of photographs with more than 18 million images of more than 8 million people were used. A report on demographic effects was then issued, focusing on assessing FPIR stability across ethnic groups and genders, . It stated that “some developers supplied identification algorithms for which false positive differentials [were] undetectable”. In November 2020, NIST published , the second report out of a series aimed at quantifying face recognition accuracy for people wearing masks (the effect of masks on both false negative and false positives match rates).
Even if the False Positive Identification Rate in identification is regarded as the most critical component of bias, false negatives—such as False Negative Identification Rate () and False Rejection Rate ()—can also be considered as correlated to discrimination. In a Physical Access Control scenario, a False Positive means that a non-authorized user gets access, which is a security issue, a False Negative means that an authorized person doesn’t get access, a situation which can be considered as a form of discrimination.
The question is: how to weigh / assess / compare these different kinds of biases?
Indeed, a lot of factors impact face recognition: expression, pose, wearing glasses, wearing a mask, age, gender, ethnicity… From the algorithm point of view, they all need to be handled. However, some of them are just seen as a limiting factor of the performances when others exhibit unacceptable discriminating features and must be monitored when in use:
As an illustration of the balance to be preserved, the security automation in an airport requires a very low FAR while keeping a reasonable FRR to ensure a pleasant user experience.
FRR and FAR are crucial quantities to evaluate a given algorithm in the context of face recognition, a use case intrinsically linked to biometric applications, where the usual accuracy evaluation metric is not sufficient to assess the quality of a learned decision rule. We will now see how it is possible to improve the metrics commonly used.
Measuring the fairness of face recognition systems is a very challenging task and we introduce new metrics that both respond to the need for security and equity. But first of all, let’s recall some principles.
We want to decide whether two face images correspond to the same identity or not. To do so, the closeness between two face embeddings \(z_i\) and \(z_j\) (vectors of biometric features) is quantified with a similarity measure \(s(z_{i};z_{j})\) (usually the cosine similarity measure). A threshold \(t \in [-1,1]\) is chosen to classify a pair \((z_{i},z_{j})\) as genuine (same identity) if \(s \geq t\) and impostor (distinct identities) otherwise. Denoting by \(\mathcal{G}\) the set of genuine pairs and by \(\mathcal{I}\) the set of impostor pairs in a given test set, the False Acceptance and False Rejection Rates are defined along \(t\) as follows:
$$\textrm{FAR}(t) := \frac{\\\left\{ (z_{i},z_{j}) \in \mathcal{I} : s(z_{i},z_{j}) \geq t \right\}}{\\\left\{ (z_{i},z_{j}) \in \mathcal{I}\right\}},\textrm{FRR}(t) := \frac{\\\left\{ (z_{i},z_{j}) \in \mathcal{G} : s(z_{i},z_{j}) < t \right\}}{\\\left\{ (z_{i},z_{j}) \in \mathcal{G}\right\}}$$
The most widely used metric consists in first fixing the threshold \(t\) so that the FAR is equal to a pre-defined value \(\alpha \in [0,1]\), and then computing the FRR at this threshold. We use the canonical FR notation to denote the resulting quantity:
$$\textrm{FRR}@(\textrm{FAR} = \alpha) := \textrm{FRR}(t) \textrm{ with } t \textrm{ such that FAR}(t) = \alpha$$
The FAR level \(\alpha\) determines the operational point of the face recognition system and corresponds to the security risk one is ready to take. According to the use case, it is typically set to \(10^{-i}\) with \(i \in \left\{ 1,\cdots,6 \right\}\)).
However, this settings are not sufficient to correctly measure the fairness of the face recognition system, as we will now see.
Let’s consider for instance the case of a sensitive attribute with 2 distinct values (male/female, represented by 0/1). The demographic parity criterion, the well-known criteria for fairness, requires the prediction to be independent of the sensitive attribute, which amounts to equalizing the likelihood of being genuine conditional to \(a=0\) and \(a=1\). However, besides heavily depending on the number and quality of impostors and genuines pairs among subgroups, this criterion does not take into account the FARs and FRRs. Many attempts to incorporate those criteria has been done in the past, based on the choice of a threshold achieving a global FAR. We consider that this choice is not entirely relevant for it depends on the relative proportions of—in our example—females and males of the considered dataset together with the relative proportion of intra-groups impostors and genuines.
Thus we proposed two new metrics that alleviates the previously proportions dependence, and allows to monitor the risk one is willing to take among each subgroup: for a pre-definite rate \(\alpha\) deemed acceptable, one typically would like to compare the performance among subgroups for a threshold where each subgroup satisfies \(\textrm{FAR}_a \leq \alpha\). Our two resulting metrics are thus, for the sensitive attribute with 2 distinct values case (those metrics generalizes well for more than 2 distinct values of the sensitive attribute):
$$\textrm{BFRR}(\alpha) := \frac{\max_{a\in\left\{ 0,1 \right\}}\textrm{FRR}_a(t)}{\min_{a\in\left\{ 0,1 \right\}}\textrm{FRR}_a(t)}\textrm{ with } t\textrm{ such that }\max\limits_{a\in\left\{ 0,1 \right\}}\textrm{FAR}_a(t) = \alpha$$
and
$$\textrm{BFAR}(\alpha) := \frac{\max_{a\in\left\{ 0,1 \right\}}\textrm{FAR}_a(t)}{\min_{a\in\left\{ 0,1 \right\}}\textrm{FAR}_a(t)}\textrm{ with } t\textrm{ such that }\max\limits_{a\in\left\{ 0,1 \right\}}\textrm{FAR}_a(t) = \alpha$$
One can read the above acronyms “Bias in FRR/FAR”.
In addition to being more security demanding than previous known metrics, BFRR and BFAR are more amenable to interpretation: the ratios of FRRs or FARs corresponds to the number of times the algorithm makes more mistakes on the discriminated subgroup.
Facial recognition algorithm aims to transform faces into vectors in a high dimensional space, as shown in the figure below. For algorithms not taking into account the bias, the spread of a specific group in the is directly linked to its proportion in the learning database.
The main goal of the loss function used in the face recognition algorithms is to project identities in some specific space region to maximize the classification on the learning database. One way to correctly deal with the bias is to choose the correct loss function. However, the classic softmax loss function is not good as discrimination.
Several new loss function have been proposed, all with the same objective to improve the performance: learning large-margin face features while maximizing inter-class variance and minimizing intra-class variance.
Until 2018, most of the popular FR loss functions were of the form:
$$\mathcal{L}=-\frac{1}{n}\sum_{i=1}^{n} \log\left(\frac{e^{\kappa \boldsymbol{\mu}_{y_i}^\intercal \boldsymbol{z}_i}}{\sum_{k=1}^{K}e^{\kappa \boldsymbol{\mu}_{k}^\intercal \boldsymbol{z}_i}} \right)$$
where the\(\boldsymbol{\mu}_{k}\)’s are the fully-connected layer’s parameters, \(\kappa > 0\) is the inverse temperature of the softmax function used in brackets and \(n\) is the batch size. Early works took \(\kappa = 1\) and used a bias term in the fully-connected layer but it was soon shown that this bias term degrades the performance of the model.
studied different several loss functions, and proposed its own function for deep face verification, a “conceptually simple and geometrically interpretable objective function: additive margin Softmax (AM-Softmax)”.
However, in order to control bias, we also want to control the local density in terms of ethnicity, pose, gender or image quality. In other terms, whatever the face image projected on the feature space, we want to have the same FAR (same probability to have other identities nearby).
In 2017, a joint research between Safran Identity & Security (now IDEMIA) and École Centrale de Lyon proposed the von Mises-Fisher (vMF) mixture model as the theoretical foundation for an effective deep-learning implementation following the above directional features. The authors derived a novel vMF Mixture Loss and its corresponding vMF deep features in . It was a new powerful way to discriminate different identities, before the advent of large-margin losses.
The proposed vMF feature learning achieved the characteristics of discriminative learning, i.e., compacting the instances of the same class while increasing the distance of instances from different classes.
In the feature space (of dimension \(d\)), a person is represented by a center \(\boldsymbol{\mu}\) and a concentration \(\kappa\). This model allows a direct link between a person of the learning database and the size of the region it is supposed to take in the feature space. It allows to control unbalanced databases in term of ethnicity, gender, age, qualities. The vMF distribution is a probability measure defined on the hypersphere of the feature space by the following density:
$$V_d(x;\boldsymbol{\mu},\kappa) := C_d(\kappa)e^{\kappa \boldsymbol{\mu}^\intercal x}$$
with
$$C_d(\kappa) = \frac{\kappa^{\frac{d}{2}-1}}{(2\pi)^\frac{d}{2}I_{\frac{d}{2}-1}(\kappa)}$$
\(C_d(\kappa)\) is a quantity which can be computed with high precision (\(I_\nu\) is the modified Bessel function of the first kind at order \(\nu\)).
We now see how the general form of the vMF Loss on the face embeddings defined as:
$$\mathcal{L}_\mathrm{vMF}=-\frac{1}{n}\sum_{i=1}^{n} \log\left(\frac{C_d(\kappa_{y_i})e^{\kappa_{y_i} \boldsymbol{\mu}_{y_i}^\intercal \boldsymbol{z}_i}}{\sum_{k=1}^{K}C_d(\kappa_{k})e^{\kappa_{k} \boldsymbol{\mu}_{k}^\intercal \boldsymbol{z}_i}} \right)$$
can be used to reach the objectives of discriminative learning, by modifying the \(\kappa\) of male and female for the loss, which leads to various trade-off in term of overall performances and bias in FRR and FAR.
Biases do exist, but their consequences are vastly different depending on application and environment. They are a consequence of misrepresentation of certain categories of population, both at training and during operations, but this is only a part of the problem. To assess bias in biometric algorithm, we have proposed a new metric that measure the worst bias ratio between classes. We also proposed to train a network to transforms the deep embeddings of pre-trained biometric model in order to give more representation power to the discriminated subgroups. Its training is supervised by the von Mises-Fisher loss, whose hyperparameters allow to control the space allocated to each subgroup in the latent space.
As reported by the NIST, the problem of biaises in face recognition is solved for the 1:N (identification) case and more recently strongly mitigated for 1:1 (verification) case on FAR management.
Even if there are ways to mitigate these effects for False Acceptance, we still want to have simultaneous control of FRR and FAR bias, while maximizing overall accuracy: that’s where our research currently is in March 2021.
[This article is derived from a conference talk held by Vincent Despiegel, Research Team Leader at IDEMIA, during the European Association for Biometrics events series in March 2021.]
Amazon’s Face Recognition Falsely Matched 28 Members of Congress With Mugshots. By Jacob Snow, Technology & Civil Liberties Attorney, ACLU of Northern California. July 26, 2018 https://www.aclu.org/blog/privacy-technology/surveillance-technologies/amazons-face-recognition-falsely-matched-28
FRVT NIST activities are presented on https://www.nist.gov/programs-projects/face-recognition-vendor-test-frvt
von Mises-Fisher Mixture Model-based Deep learning: Application to Face Verification() arXiv. DOI: https://doi.org/10.48550/arxiv.1706.04264
Face recognition vendor test part 3:() National Institute of Standards and Technology. doi:10.6028/nist.ir.8280
[NISTIR 8331]
Ngan, Mei & Grother, Patrick & Hanaoka, Kayee Ongoing Face Recognition Vendor Test (FRVT) Part 6B: Face recognition accuracy with face masks using post-COVID-19 algorithms
() National Institute of Standards and Technology. doi:10.6028/nist.ir.8331
Additive Margin Softmax for Face Verification() Institute of Electrical and Electronics Engineers (IEEE). doi:10.1109/lsp.2018.2822810