University of Cambridge > > Machine Learning @ CUED > Explaining Neural Networks: Post-hoc and Natural Language Explanations

Explaining Neural Networks: Post-hoc and Natural Language Explanations

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Robert Peharz.

In this talk, we discuss two paradigms of explainability in neural networks: post-hoc explanations and neural networks generating natural language explanations for their own decisions. For the first paradigm, we present two issues of existing post-hoc explanatory methods. The first issue is that two prevalent perspectives on explanations—feature-additivity and feature-selection—lead to fundamentally different instance-wise explanations. In the literature, explainers from different perspectives are currently being directly compared, despite their distinct explanation goals. The second issue is that current post-hoc explainers have only been thoroughly validated on simple models, such as linear regression, and, when applied to real-world neural networks, explainers are commonly evaluated under the assumption that the learned models behave reasonably. However, neural networks often rely on unreasonable correlations, even when producing correct decisions. We introduce a verification framework for explanatory methods under the feature-selection perspective. Our framework is, to our knowledge, the first evaluation test based on a non-trivial real-world neural network for which we are able to provide guarantees on its inner workings. We show several failure modes of current explainers, such as LIME , SHAP and L2X . (based on For the paradigm of neural networks that explain their own decisions in natural language, we introduce a large dataset of human-annotated explanations for the ground-truth relations of SNLI , which we call e-SNLI. The corpus contains 570K instances, being, to our knowledge, the largest dataset of free-form natural language explanations. We present a series of models trained on e-SNLI. (based on Finally, we show that this class of models is prone to outputting inconsistent explanations, such as “A dog is an animal” and “A dog is not an animal”, which are likely to decrease users’ trust in these systems. To detect such inconsistencies, we introduce a simple but effective adversarial framework for generating a complete target sequence, a scenario that has not been addressed so far. Finally, we apply our framework to the best model trained on e-SNLI, and we show that this model is capable of generating a significant amount of inconsistencies. (based on

This talk is part of the Machine Learning @ CUED series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity