Experimenting with GPT2 #

GPT2 is a machine learning model trained to generate text which continues on from a supplied fragment. It is the precursor to GPT3 - more general details are described there.

This page is to explore the internals of how the basic GPT algorithm works, and suggest ideas for exploring the [latent space](/notes/docs/latent-space/.

This builds on work from Ecco which provides a python library that supports exploration of NLP language models, and from Hugging Face which is an organisation that provide a range of resources to provide access to NLP models in a simple consistent way.

Ecco provides a very straightforward method to use gpt2 to generate text.

First Attempt #

import ecco
lm = ecco.from_pretrained('distilgpt2')
text= "The four Provinces in Ireland are; Ulster, Munster,""
output = lm.generate(text, generate=10, do_sample=True)

Result: ‘The four Provinces in Ireland are; Ulster, Munster, Dublin, Cork Northamptonshire, Co’

Distilgp2 is a light model, developed by distilling the basic gpt2 mode. Completing the list of Irish provinces appears beyond it’s capabilities, but it has recognised that it should return a list and returns some places in Ireland.

The largest currently available Open AI GPT model is gpt2-xl. This is approximately a 6gb download, and is too large to run on the Tesla M60 GPU I was using.

import ecco
lm = ecco.from_pretrained('gpt2-xl', gpu=False)
text= "The 26 counties in the Republic of Ireland are; Ulster, Munster,"
output = lm.generate(text, generate=10, do_sample=True)

Result: ‘The four Provinces in Ireland are; Ulster, Munster, Connacht and Leinster.\n\n"’

What is this code doing #

Inputs #

The generate method works by tokenizing the string using the tokenizer associated with the model retrieved from huggingface. This results in a list of token_ids (inside a pytorch tensor), with approx 1 token per word.

tokenized = lm.tokenizer(text, return_tensors="pt")['input_ids'][0]
tokenized.shape, tokenized

(torch.Size([14]),
 tensor([  464,  1440,  1041,  7114,   728,   287,  7517,   389,    26, 50026,
            11, 12107,  1706,    11]))

These ids are references to each word in the vocabulary built as part of training the model.

len(lm.tokenizer.get_vocab())
50257

The vocabulary size is limited so not every word will be represented. Unknown words are represented by breaking them into fragments through a Byte Pair Encoding. Mapping the token_ids back to words shows this process.

lm.tokenizer.batch_decode(tokenized)
['The',
 ' four',
 ' Pro',
 'vin',
 'ces',
 ' in',
 ' Ireland',
 ' are',
 ';',
 ' Ulster',
 ',',
 ' Mun',
 'ster',
 ',']

Using a nonsense word we can see how words outside the vocabulary are handled

lm.tokenizer.batch_decode(lm.tokenizer('The dsfsmallCdiff in', return_tensors="pt")['input_ids'][0])
['The', ' d', 'sf', 'small', 'C', 'diff', ' in']

This is known as Byte Pair Encoding, a more detailed description can be found at https://leimao.github.io/blog/Byte-Pair-Encoding/.

Transformer Algorithm #

The core language model takes a set of token ids, and predicts the next token. There are wrapper and helper classes available both from HuggingFace and Ecco for specific tasks, including ones for generating an ongoing seqence of text. These are an important part of the tasks in question. The model also includes steps which can be changed or bypassed; an embedding layer inteprets input_ids, and a language modelling head generates predictions which are interpreted using token ids. Even though these are optional I consider them part of the model since they include parameters which are included in the original training.

Token ids are simply integers, they have no meaning beyond the tokens they represent. The embedding layer transorms each input id to an Embedding in an N-dimensional space, where N depends on the model used. These embeddings are evolved as the model is trained, where they are organised in a way that provides a useful input to the task of predicting the next token.