Interpretability

Interpretability refers to the study of how to understand the decisions of machine learning systems, and how to design systems whose decisions are easily understood, or interpretable. It is closely related to explainabity.

Mechanistic Interpretability #

Attempts to interpret an ML model through analysing and describing the internal structure of the model itself. https://www.serimats.org/mechanistic-interpretability

Non-Mechanistic Interpretability #

Alternative interpretability approaches (referred to in [1]) used for Neural Network type models are based on saliency, which uses the algorithm to identify which elements of an input was most important in leading to a particular result, and feature synthesis which uses the model to generate an input that would lead to a specific output.

[1] Benchmarking Interpretability Tools for Deep Neural Networks arXiv:2302.10894

Interpreting Generative Large Language Models #

There are several tools focussed on analysing the internals of large language models;

Interpreting GPT - the Logit Lens by Nostalgebraist
Ecco a python library developed by Jay Alammar
Tuned Lens a library developed by Nora Belrose/Eleuther AI
Transformer Lens a library for mechanistic interpretability by Neel Nanda

Logit Lens #

The idea of logit lens is to use the output embedding layer of the model to probe internal layers, where it was discovered that in many cases the next token was already identified at an early layer, with variance which appears meaningful in how the calculation proceeds. The original implementation worked for GPT2 but not for GPT-Neo and some subsequent models. An adjustment to the visualisation replicated similar results for these models.

Ecco #

This tool packages several LLM analysis and visualisation tools, including a version of Logit Lens.

Mechanistic Interpretability #

Non-Mechanistic Interpretability #

Interpreting Generative Large Language Models #

Logit Lens #

Ecco #

Transformer Lens #