ChatGPT: How It Works

The Model Behind the Bot

This is a brief overview of the methodology and intuition behind the GPT-3 chatbot you cannot stop hearing about.

 

Large Language Models

ChatGPT is an extrapolation from a large class of machine-learning Natural Language Processing models known as Large Language Models (LLMs). LLMs can process large amounts of text data and draw connections between words in the text. As computing power has improved, these models have become more advanced in the past few years. As a result, LLMs have greater computational power as their input data and parameter spaces grow.

Predicting a word from a sequence of words is the most basic form of training language models. Most commonly, this is observed as either next-token-prediction and masked-language-modeling.

 

GPT and self-attention

GPT models use the transformer architecture. This means it has both an encoder and decoder to process the input sequence. The encoder and the decoder both have a multi-head self-attention mechanism that allows them to differentially weigh parts of the sequence to infer context and meaning. In addition, the encoder leverages masked-language modeling to understand the relationship between words and produce more comprehensible responses.

 

GPT's self-attention mechanism works by converting tokens, which are pieces of text that can be words, sentences, or other groups of text, into vectors that indicate the importance of each token in the input sequence. The model is used to do this.

  1. This creates a query key, value vector, and key for each token in an input sequence.
  2. Calculates the similarity of the query vector in step 1 and the key vector for every other token using the dot product.
  3. It feeds the output from step 2 into a softmax operation and can generate normalized weights.
  4. Creates a final vector representing each token's importance in the sequence. This is done by multiplying the weights generated by step 3 with the value vectors for each token.

 

GPT's'multi-head attention mechanism' is an evolution in self-attention. Instead of performing steps 1 through 4, the model repeatedly iterates the mechanism, creating a new linear projection each time. Each time, the query, key, and value vectors are projected. The model can grasp sub-meanings of the input data and more complex relationships by expanding its self-attention.

 

ChatGPT

GPT 3 chatbot is an InstructGPT spinoff. It introduced a new approach to incorporating human input into the training process to better align model outputs with the user's intent. OpenAI's 2022 paper, Reinforcement Learning From Human Feedback (RLHF), explains how to train language models with human feedback.

 

Step 1: Supervised fine-tuning (SFT) model

The development started with fine-tuning GPT 3 ai model by hiring 40 contractors to create a supervised training dataset. This data has an output that the model can learn from.

The Open API collected inputs (or prompts) from actual user entries. Labelers responded to each prompt with a suitable response, creating an output for each input. This new dataset was used to fine-tune the GPT-3 model, also known as the SFT model.

To maximize the diversity of the prompts dataset, 200 prompts could be generated from any user ID. Therefore, any prompts with common long-term prefixes were also removed. In addition, all prompts containing personally identifiable data (PII) were removed.

 

After creating prompts from OpenAI API to aggregate, labelers were asked to come up with sample prompts for categories with very little data. These were the categories of interest.

  • Plain prompts: any arbitrary ask.
  • Few-shot prompts are instructions that include multiple query/response pairs.
  • User-based prompts correspond to a particular use case that the OpenAI API requested.

 

Labelers were asked to deduce the user's instructions when they generated responses. This paper outlines the three main ways prompts can request information.

  1. Direct: "Tell Me About ..."
  2. Few-shot: Take these two stories and write another one about the same topic.
  3. Continuation: Start a story and finish it.

13,000 input and output samples were compiled from prompts obtained via the OpenAI API. These can be used to build the supervised model.

 

SFT model

Image source

 

 

Step 2: Reward Model

The SFT model produces better-aligned responses to user prompts after training in step 1. Next, we will refine the model by training a reward model. A model input is a sequence of prompts and answers, while the output is a scaler or reward. To leverage Reinforcement Learning, a reward model must be created. This model will learn to produce results that maximize its reward (see step 3).

 

Labelers receive 4-9 SFT model outputs per input prompt to train the reward model. These outputs are ranked from best to worst and they can create combinations of output ranking such as the following:

Each combination was treated as a separate data point in the model, leading to overfitting (failure to extrapolate beyond what is seen). The model was created by leveraging each group as a single datapoint.

 

Reward model GPT-3

Image source

 

Step 3: Reinforcement Learning Model

The final stage presents the model with a random prompt. It then returns a response. The model's learned 'policy' in step 2 generates the response.

The policy is a strategy the machine has learned to use to reach its goal, in this instance, maximizing its reward. The reward model determines a scaler value for each prompt and response pair. To improve the policy, the reward feeds into the model.

 

In 2017, Schulman et al. In 2017, Schulman et al. PPO includes a per-token Kullback–Leibler (KL), a penalty from the SFT model. KL divergence is a measure of the similarity between two distribution functions.

It penalizes extreme distances. KL penalties reduce the distance between the SFT model outputs and the responses. This is done to avoid over-optimizing reward models or deviating from the human intention dataset.

 

PPO reinforcement

Image source

 

Evaluation of the Model

The model's performance is evaluated by removing a test set from training that the model hasn't seen. Then, a series of evaluations are performed on the test set to determine if it is more aligned than GPT-3 text generator.

Helpfulness: GPT-3 chatbot ability to interpret and follow instructions.

Truthfulness: The model's tendency to produce hallucinations. Compared with the TruthfulQA dataset, the PPO model had outputs showing minor increases in truthfulness or informativeness.

Harmlessness: The ability of the model to avoid inappropriate, derogatory, and denigrating material. The RealToxicityPrompts dataset was used to test if Harmlessness is possible.  

 

GPT 3 text generator is a powerful tool that can revolutionize how businesses interact with customers. AI text generator GPT-3 is a natural, human-like tool that automates repetitive tasks and generates valuable data. It can help companies to improve customer satisfaction, efficiency, growth, and productivity. ChatGPT can help you take your marketing efforts to new heights.

We will help you find the right solution for your business. Contact us today!

arrow
Call us now or fill out the form, we will respond within 1 hour

We respect your Privacy.

STAY UPDATED WITH THE
LATEST INDUSTRY NEWS