Big picture:
- Most state of the art NLP models today are based on attention mechanisms, specifically multi-layer self attention mechanisms, also known as “Transformer” architectures.
- Landmark models include
- The original transformer model (google Brain, original paper)
- For the first time, use a complex natural language model for multiple tasks (translation, English constituency parsing) that does not use convolutions or recurrence (the previous state of the art). Instead, a multi-layer self attention mechanism generates increasingly powerful contextual embedding representations for every token in the input, which can be used for any task.
- BERT (developed by google, original paper, blog post). Key components include:
- General natural language encoding model, any decoding model can be added and fine tuned for a specific task (e.g. next sentence prediction, word prediction, question answering, etc.)
- Multi-layer self attention mechanism
- Pre-training of the encoding layer is achieved with extremely large corpora of natural language -> general modell
- Generative Pre-Trained Transformers (GPT, developed by OpenAI)
- Standard language modeling objective (next token prediction) as pre-training for powerful transformer based language model
- Task-conditioning as auxiliary input in natural language form to the model (e.g. naming the task and providing some examples)
- This allows few-shot or zero shot learning, i.e. the model can be essentially directly applied to any new task, without fine-tuning or changing parameters, by providing a description of the task as part of the input
- GPT-3 performs well on seemingly unrelated tasks such as writing code from natural language descriptions, or generating subject-specific text that looks like it was written by a human
- GPT-2 and GPT-3 build on this by employing larger text corpora and larger models with longer training
- Parameters: GPT: 100M GPT-2 (paper, blog): 1.5B GPT-3 (blog): 100B (!!!)
- Limitations:
- Large context and summarization are difficult
- Unidirectional training creates limitations
- Models have the biases from the corpora it was trained on
- Large and costly inference
- GPT-2 and GPT-3 are extremely large models that are difficult to train without the computational resources of a cash-flooded company