For facial recognition, the main challenge is fairness: the algorithm should behave in the same way whatever the visible characteristic of the person are (age, sex, skin color, nationality, etc.).

Facial recognition raises highly specific fairness issues in law due to the potential for discrimination based on skin color, gender, handicap or other legally protected attributes, as well as the legal consequences flowing from refusing access to services due to the user. Indeed, operators of some facial recognition systems already include in their specifications the requirement that systems be fair, and that fairness can be proven.

Designing novel strategies for debiasing datasets and estimating the debias weights

We develop original strategies for identifying and correcting selection bias.

Representativeness issues do not vanish simply under the effect of the size of the training set. Selection bias issues are now the subject of much attention in the literature, that is to say situations where the samples at disposal for learning a predictive rule are not distributed as those to which the predictive rule will be applied when deployed.

Data gathered in many repositories available on the Web has not been collected by means of a rigorous experimental design but rather by leaping at the opportunity.

Depending on the nature of the mechanism causing the sample selection bias and on that of the statistical information available for learning a decision rule, special cases have been considered in the literature, for which dedicated approaches have been developed. Most of these methods boil down to weighting the training observations using appropriate weights, based on the Inverse Probability Weighting technique.

For instance, these weights are the inverses of the first order inclusion probabilities in the case where data are acquired by means of a survey plan, see (Clémençon et al., 2017).

In the context of regression under random censorship, a weighted version of the empirical risk can also be considered, weights corresponding to the inverses of estimates of the probability of not being censored, see (Ausset et al., 2019). In general, side information about the cause of the selection bias is crucially used to derive explicit forms for the appropriate weights and to design ways of estimating them from the observations available. In many situations, the selection bias mechanism is excessively complex to derive fully explicit forms for the appropriate weights that would permit to mimic the target distribution based on the observations available.

See also in our past and present publications: Bias issues in AI have been reviewed in (Bertail et al., 2019).

Our works

In line with the aforementioned approaches, a preliminary framework that allows tackling problems where the biasing mechanism at work is very general, if certain identifiability hypotheses are satisfied, has been developed in (Clémençon and Laforgue, 2020). Promising preliminary experiments, based on synthetic and real data, have also provided empirical evidence of the relevance of the approach. The case where the biasing mechanism is only approximately known is of considerable interest in practice and it is one of the main objectives in our Fairness studies to investigate to which extent the statistical learning guarantees established are preserved. In (Clémençon et al., 2020) and (Ausset et al., 2019), several situations (e.g. positive-unlabeled learning, learning under random censorship) where the biasing mechanism depends on a few (possibly functional) parameters that can be estimated, have been exhibited.

It is our goal to understand conditions under which plugging estimates of the biasing functions at work do not compromise the accuracy of the methodology proposed in (Clémençon and Laforgue, 2020).

Checking the absence of selection bias by testing homogeneity between high dimensional data samples

When a (small) sample of observations drawn from the target/test distribution is available, a natural way to check whether the training data available form an unbiased sample, i.e. the absence of bias in the training sample, is to test homogeneity of the two samples. The “two-sample problem” arises in a wide variety of applications, ranging from bioinformatics to psychometrics through database attribute matching for instance. However, it still raises challenging questions in a (highly) multidimensional setting, where the notion of rank is far from straightforward. A recent and original approach initiated in (Vogel et al., 2020) consists in investigating how ranking bipartite methods for multivariate data can be exploited in order to extend the rank-based test approach for testing homogeneity between two samples to a multivariate framework. Offering an appealing alternative to parametric approaches, plug-in techniques or Maximum Mean Discrepancy methods, the idea promoted is simple and avoids the curse of dimensionality: training data are split into two subsamples (with approximately the same proportion of positive instances) so that an empirical maximizer of the rank-based criterion chosen (e.g. the popular area under curve (AUC) criterion) can be learned and next used to score the data of the second subsample. Easy to formulate, this promising approach remains to be studied at length.

Our works

Our objective is to investigate the approach described above from a theoretical and experimental perspective, both at the same time. If it can be successfully implemented, this approach could also offer an alternative method to debias training samples, by designing a procedure removing the data that cause the rejection of the homogeneity assumption or weighting them in an appropriate fashion.

Inserting fairness constraints in the algorithm

Machine learning systems that make crucial decisions for humans should guarantee that they do not penalize certain groups of individuals. Fairness as other trustworthiness properties can be imposed during learning by imposing appropriate constraints.
Fairness constraints are generally modeled by means of a (qualitative) sensitive variable, indicating membership to a certain group (e.g., ethnicity, gender). The vast majority of the work dedicated to algorithmic fairness in machine learning focuses on binary classification. In this context, fairness constraints force the classifiers to have the same true positive rate (or false positive rate) across the sensitive groups.
However a fairness constraint can degrade performance of the algorithm in other areas: for example, it may teach the algorithm to pay particular importance to small differences in the shape of eyes of persons from a certain under-represented ethnic group in order to ensure that the quality of the algorithm for that group is equivalent to the quality for other groups of the population. But this constraint may lead to decrease performance of the algorithm for another group. Trade-offs are thus necessary.
The intensity of the fairness constraints and the resulting trade-offs depend on the legal/ethical requirements for face recognition. This is developed in cooperation with the institutional stakeholders.

Methods for mitigating biases

DCNNs for face encoding learn how to project images of faces in a high dimension space. Similarities between images would be calculated using cosine distance between templates in the N-dimensional space. In an ideal state, this projection should be even whatever the soft biometrics of the individual (men and women should occupy half of the space each, no bias as a function of nationality, color of the eyes…). (Das et al., 2018) propose to mitigate those for soft biometrics estimation. For face encoding, losses like von Mises Fisher loss as in (Hasnat et al., 2017) could help balancing these biases if properly estimated.
A previous collaboration between Idemia and Télécom Paris was also devoted to algorithmic fairness: learning scoring functions from binary-labeled data. This statistical learning task, referred to as bipartite ranking, is of considerable importance in applications for which fairness requirements are a major concern (credit scoring in banking, pathology scoring in medicine or recidivism scoring in criminal justice).

Evaluating performance is itself a challenge

The gold standard, the ROC (Receiver Operating Characteristic) curve is highly relevant for evaluating fairness of face recognition algorithms but serious computational problems come up with such a functional criterion, most of the literature focusing on the maximization of scalar summaries, e.g. the AUC (area under curve) criterion. A thorough study of fairness in bipartite ranking has been proposed in (Vogel et al., 2020), where the goal is to guarantee that sensitive variables (such as skin color) have little impact on the rankings induced by a scoring function. Limitations in using AUC to measure fairness have motivated the design of richer definitions of fairness for scoring functions related to the ROC curves themselves. These definitions have strong implications on fair classification: classifiers obtained by thresholding such fair scoring functions approximately satisfy definitions of classification fairness for a wide range of thresholds.

Our works

We investigate to what extent the accuracy of decision rules learned by machine learning for face recognition tasks can be preserved under fairness constraints. We also try to extend to similarity ranking, a variant of bipartite ranking covering key applications such as scoring for face recognition, e.g. concepts and methods for fair bipartite ranking.

References

Ausset et al., 2019 G. Ausset, S. Clémençon, and F. Portier, “Empirical Risk Minimization under Random Censorship: Theory and Practice.” https://arxiv.org/abs/1906.01908.

Bertail et al., 2019 P. Bertail, D. Bounie, S. Clémençon, and P. Waelbroeck, “Algorithmes : Biais, Discrimination et Équité,” Feb. 2019. https://hal.telecom-paris.fr/hal-02077745.

Clémençon et al., 2017 S. Clémençon, P. Bertail, and E. Chautru, “Sampling and empirical risk minimization,” Statistics, vol. 51, no. 1, pp. 30–42, Nov. 2016, https://hal.archives-ouvertes.fr/hal-01468905/.

Clémençon and Laforgue, 2020 S. Clémençon and P. Laforgue, “Statistical Learning from Biased Training Samples,” http://arxiv.org/abs/1906.12304.

Hasnat et al., 2017, Md. A. Hasnat, J. Bohné, J. Milgram, S. Gentric, and L. Chen, “von Mises-Fisher Mixture Model-based Deep learning: Application to Face
Verification,” https://arxiv.org/abs/1706.04264

Vogel et al., 2020 R. Vogel, A. Bellet, and S. Clémençon, “Learning Fair Scoring Functions: Bipartite Ranking under ROC-based Fairness Constraints.” https://arxiv.org/abs/2002.08159.