NeurIPS 2021

Abstract

To tackle interpretability in deep learning, we present a novel framework to jointly
learn a predictive model and its associated interpretation model. The interpreter
provides both local and global interpretability about the predictive model in terms of
human-understandable high level attribute functions, with minimal loss of accuracy.
This is achieved by a dedicated architecture and well chosen regularization penalties.
We seek for a small-size dictionary of high level attribute functions that take
as inputs the outputs of selected hidden layers and whose outputs feed a linear
classifier. We impose strong conciseness on the activation of attributes with an
entropy-based criterion while enforcing fidelity to both inputs and outputs of
the predictive model. A detailed pipeline to visualize the learnt features is also
developed. Moreover, besides generating interpretable models by design, our
approach can be specialized to provide post-hoc interpretations for a pre-trained
neural network. We validate our approach against several state-of-the-art methods
on multiple datasets and show its efficacy on both kinds of tasks.

 

Jayneel Parekh, Pavlo Mozharovskyi, Florence d’Alché-Buc, “A Framework to Learn with Interpretation.” Advances in Neural Information Processing Systems 34 (NeurIPS 2021).