Summary #

This project is intended to develop an annotation for the internal representations of natural language text processed through a generative transformer model

Lines of work

SymbolicTransformer
- Julia project
  - Notation for components of a transformer and the calculations it performs
  - Abstract representation of some of the calculations
Annotate Transformer
- Python project using transformer lens and the pythia model

Embed/Unembed

calculating distance between random input/output vectors compared to bigram stats calculated from pile ()

Latent Space of Large Language Models #

Also known as Residual space, this is the vector space which holds information from the context window passed to a transformer model during inference. This is defined by but distinct from the parameter space which is altered during model training. One of the goals in this work is to better understand the parallels and distinctions between information held by the model during pre-training and information provided in the context window.

EleutherAI’s Pythia project released a series of transformer models which support interpretability research through use of standardised training data and process, and release of checkpoints from training. See the Pythia repo on GitHub , the model card on Hugging Face and the original Pythia paper on Arxiv.

The structure of the latent space serves 3 main purposes ( based on training objectives)

Predict next token
- Output of final layer is transformed into logits by unembedding vector, the training objective during pre-training is based on loss comparing logits with the actual subsequent tokens.
Attract Attention
- These are patterns which match attention head projections from later position
Change later token predictions
- These are patterns which are projected into later position spaces

Looking at the interactions between these may allow us to find higher level primitives to analyse points in the latent space.

Notation #

Following “A Mathematical Framework for Transformer Circuits” a transformer can be defined as follows.

$$ r^0 = W_E t$$

$$ z^{i} = r^{i-1} + \sum_{h^{i,j} \in H^i} h^{i,j}(r^{i-1})$$

$$ r^{i} = z^{i} + m^{i}(z^{i})$$

$$ T(t) = W_U x^l$$

Where

$r^i$ are the vectors in the residual stream progressing from transformer block $i$
$z^i$ are the vectors from the weighted sum of values after applying the attention pattern, passed into the MLP $m^i$ from transformer block $i$
$H_i$ is the set of attention heads at layer $i$, which has elements $h^{i,j}$; attention head $j$ in transformer block $i$. $h(x)$ is the operation of attention.
$m^i$ is the MLP at layer $i$,
$t$ is the vector of one-hot encoded tokens
$W_E$ is the embedding matrix
$W_U$ is the unembedding matrix

T can be decomposed into residual blocks (8 for pythia-70), labelled $T^1 … T^{8}$.

Attention $h^{i,j}$ can be decomposed as

$$ h^{i,j}(x) = A $$ $$(A \otimes W_O W_V) . x$$

Where $A$ is the attention matrix, $W_O$ is the output matrix, and $W_V$ is the value matrix.

See notebooks such as https://github.com/prior-technology/annotate-transformer/blob/main/notebooks/Interpreting%20Transformer%20Layers.ipynb and https://github.com/prior-technology/annotate-transformer/blob/main/notebooks/transformer-lens/Annotate.ipynb for updates.

References #

https://transformer-circuits.pub/2021/framework/index.html
Hengeveld and Mackenzie 2008
- http://www.functionaldiscoursegrammar.org/