A Framework to Learn with Interpretation

Back to list

NeurIPS 2021

Abstract

To tackle interpretability in deep learning, we present a novel framework to jointly
learn a predictive model and its associated interpretation model. The interpreter
provides both local and global interpretability about the predictive model in terms of
human-understandable high level attribute functions, with minimal loss of accuracy.
This is achieved by a dedicated architecture and well chosen regularization penalties.
We seek for a small-size dictionary of high level attribute functions that take
as inputs the outputs of selected hidden layers and whose outputs feed a linear
classifier. We impose strong conciseness on the activation of attributes with an
entropy-based criterion while enforcing fidelity to both inputs and outputs of
the predictive model. A detailed pipeline to visualize the learnt features is also
developed. Moreover, besides generating interpretable models by design, our
approach can be specialized to provide post-hoc interpretations for a pre-trained
neural network. We validate our approach against several state-of-the-art methods
on multiple datasets and show its efficacy on both kinds of tasks.

Jayneel Parekh, Pavlo Mozharovskyi, Florence d’Alché-Buc, “A Framework to Learn with Interpretation.” Advances in Neural Information Processing Systems 34 (NeurIPS 2021).

Publications & news

See also news about our activities and links to our main scientific publications, along with publications related to our work.

See publications See news

Research notes

See also news about our activities and links to our main scientific publications, along with publications related to our work.