Transformer Learning Model #

The transformer neural network pattern was first described in arXiv:1706.03762 [cs.CL] “Attention Is All You Need” and https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html. It adapts the [attention] mechanism previously used for recursive neural networks but uses it in a non-recursive way which is much simpler to implement at scale. Recursive networks are those which work on a sequence of inputs, using the results of previous steps to influence subsequent operations. This limits the ability to perform the calculation in parallel across processing nodes as one step cannot continue until the previous has finished. Non-recursive models such as Transformer each result is calculated based only on a block of inputs, so different blocks can be operated on in parallel. Jay Allammar provides an accessible and thorough description of transformers in [https://jalammar.github.io/illustrated-transformer/].

Transformers in Generative Language Models #

The GPT family of language models use layers of transformer blocks as the core component. This document is based on the implementation of a transformer layer in GPT-NeoX.

Text input to t he algorithm is represented as sequence of s tokens - each representing a word or part of a word. The algorithm works on batches containing b sequences in parallel. Each token is mapped to an embedding vector with h numbers (i.e. the vector has h-dimensions, h is the hidden size).

The transformer layer takes in a tensor of size s, b, h and outputs a tensor the same size.