Language Models #

A language model is a representation of the structure of a language. Given some incomplete text a language model can determine which words are most likely to fill the gap based on the rules of the language. Language models have been around for decades, being used in speech recognition and OCR. The most familiar to many of us are the predictive text from our phones

Large Language Models #

From 2017 researchers began to find interesting properties the more they scaled language models 1 2. One of the first algorithms with significant impact was BERT which was created and published in 2018. This was used by Google as part of their search engine. The term Bertology came to be used for research attempting to describe how the algorithm is able to perform certain tasks. Another key algorithm was [GPT]({{ relref GPT}}), which is trained to predict the next word in a text based on earlier words, where many other algorithms are trained to predict words based on the context before and after. This resulted in an algorithm which was able to make more natural appearing output.

There are a variety of langauge model implentations available, and understanding which one to use for a particular goal can be challenging. Different aspects of models available today are described below.

Algorithm #

The algorithm used for a model is a fundamental choice which dictates what options are available in the other aspects described here.

Bidirectional Encoder Representations from Transformers (BERT) is a Transformer model developed by Google which is trained to predict a particular fragment of text (a token) based on the context in both directions.

Generative Pre Training (GPT) are a family of algorithms developed by OpenAI. The pre-training task is to predict the next fragment of text based on earlier context.

Language Model for Dialogue Applications (LAMDA) is the most recent language model algorithm developed by Google as of June 2022.

Model Size - Number of Parameters #

The number of parameters in a model is basically the size of the model in memory. Larger models can recognise more complicated patterns in text and learn a greater volume of patterns. Smaller models respond more quickly. Many algorithms have a selection of different sized language models.

Model Size - Number of Layers #

The structure of a particular model influences what types of patterns a model can recognise and how effectively they can be trained on larger volumes of data. Counting the number of layers gives a rough summary of the structure, in general models with more layers can recognise more complex patterns.

Context Window #

The current LLMs are developed to work with a block of text at a time. Text generated by the algorithm can only depend on the text in the context window and the generalised pre-training. The size of the window is in tokens, many recent models use the Byte Pair Encoding algorithm from GPT2 - this can be thought of as roughly one token per word but less common words use multiple tokens.

Using a fixed size context window makes it easier to run calculations in parallel, this is in contrast to recursive algorithms which in theory can refer back to inputs further back to generate an output, but so far in practice it has not been possible grow to the same scale as recent LLMs.

Pre-Training #

LLMs are trained on a huge volume of text without processing for any specific objective. This is a time consuming and expensive process. The result is a general purpose language model which can already perform well on many tasks.

Prompt Engineering #

Running Inference on the model generates new text based on a block of input text. The algorithm is always trying to complete a pattern, so we can set up the input text so that the natural continuation is to perform some task. For example the input “The phrase ‘General Purpose Language Model’ translated into French is” results in the output “‘Modèle de Langage à Usage Général’”. More complex tasks can be performed by providing several examples of the task being performed.

Prompt Engineering is the term used to describe coming up with prompts that can reliably perform specific tasks. As the prompt as well as the response need to fit in the context window being able to come up with short effective prompts is useful.

Sometimes it is necessary to ask nicely to make inappropriate responses less likely. Since the pre-training data for the language model is not curated there is a risk of inappropriate responses being generated. An example from documentation in one LLM provider is for a hypothetical scenario of generating responses for a customer service application. If the model generates a response to aggressive language from a customer it is likely to respond in kind. The suggestion is to include “Polite Response:” at the end of the prompt for the model to reduce this risk.

Fine Tuning #

The requirement to include examples in the context window for a model can be avoided through fine tuning. This is continuing training of the model but using text which shows a particular task being performed. This is still computationally expensive, but far less expensive than the original pre-training task.