[Wild Thoughts] Interpretable Models: What Are We Talking About When We Discuss Interpretable Models?
September 11, 2024As someone who started my research career in deep learning, I wasn't initially that enthusiastic about explaining models. Deep learning models are often very "deep," and even if they achieve the desired goals, such as classifying images or predicting certain properties, they may not operate according to the logic we expect. Therefore, I always felt that it is quite arrogant for humans to try to understand deep learning models in our own way.
Interpret Machine Learning Models from a Human Perspective, Why and Why Not?
However, it is quite common to understand new things based on one's own experiences. For instance, when learning a new language, people often tend to establish correspondences between the new language and their native language through translation.
But is the common choice always correct? I often think of the advice in the book "Words Power Made Easy": learn new words as a child learns to speak. In this way of learning, our understanding of each word is based entirely on the new language itself, rather than simply mapping it to a known language. Additionally, we all know that there is no exact one-to-one correspondence between languages.
For example, the Chinese word "缘故" is difficult to translate with a single English word. Translating it as "cause" or "reason" might not fully capture the nuance of "缘故," which often implies a more subtle or indirect reason behind an event or situation. Such subtle differences that cannot be precisely described with a single English word sometimes lead to the creation of new terms or explanations to better convey the intended meaning.
Understanding Models in Terms of Models
Currently, interpretable models can be divided into three categories [1]. The first category involves designing model structures based on known knowledge structures, which can be directly explained through the model parameters. The second category involves methods based on gradients or contributions, where various mechanisms are designed to measure the contributions of inputs and network components to the output. The third category involves perturbation-based methods, where contributions of various parts are determined by changing inputs and observing changes in outputs.
Firstly, when dealing with complex tasks, especially those with incomplete knowledge, the first method, although providing interpretability, inevitably sacrifices accuracy. This is clearly not a good approach for moving towards explainable general artificial intelligence.
Additionally, perturbation-based methods attempt to explain deep learning models using human understanding. For example, suppose you have a model for classifying cats and dogs. If you use a masking method to obscure the ear region in the image and observe changes in the model's predictions, if the predictions change significantly, you might infer that the ears are highly important for the model to identify cats or dogs. However, in reality, the model may rely more on overall visual context, such as the body shape and color features of the cat, and simply masking the ears might give incomplete or misleading explanations. Of course, the actual situation may be more complex, as there may be high-dimensional features in the data that we cannot understand. Therefore, while perturbation-based methods are intuitive, they may not fully capture the comprehensive internal mechanisms of deep learning models when processing complex data.
Although current gradient- or contribution-based methods consider only the independent effects of input or hidden layer neurons on the output, while ignoring interactions between different neurons, and these methods are less robust compared to perturbation-based explanations [2], they seem to be closer to revealing the internal workings of the model.
Final Thoughts
Although I believe that gradient-based or contribution-based explanation methods are the 'ultimate way' to open the black box of machine learning, explaining models through human knowledge remains a pathway to building human trust in deep learning, analyzing data, and even discovering more scientific principles based on human understanding. Interpretable neural networks are not only for explaining the model itself but also for building trust and more.
References
- Montavon, G., Samek, W. and Müller, K.R., 2018. Methods for interpreting and understanding deep neural networks. Digital signal processing, 73, pp.1-15. [link]
- Ghorbani, A., Abid, A. and Zou, J., 2019, July. Interpretation of neural networks is fragile. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 3681-3688). [link]