GopherCite: Teaching models to support answers
Introduction
In this blog, we would discuss GopherCite: Teaching models to support answers. GopherCite, a 280 billion parameter model, can provide responses with excellent justifications and abstain from responding if it is unsure. By evaluating each response to a set of questions from the NaturalQuestions and ELI5 datasets by a human, we can assess how well GopherCite is performing. The model’s response is determined to be of excellent quality 80% of the time on this subset of Natural Questions and 67% of the time on the ELI5 subset.
What is GopherCite
A Gopher language model with 280B parameters was tuned using supervised learning and Reinforcement Learning from Human Preferences to create an Inline Evidence system called GopherCite. The system uses Google Search to find documents required based on an input query and then offers a big context made up of numerous documents to the language model. Although our system relies on these sources, it does not expressly oppose unreliable sources in this version of the work and sends all documents to the model, regardless of their source.
The language model then creates an SQA response, using one of these articles’ exact words as the evidence. GopherCite optimizes the score from a “reward model” during reinforcement learning. This model predicts human pairwise preferences between two candidate responses as well as an auxiliary classification loss for the response’s credibility and support.
Training GopherCite: Teaching models to support answers
Step 1:
Collect data from the most effective models we currently have, and have it evaluated by people. It is presented via model outputs as comparisons for the human labelers who evaluate the quality of individual responses as well as preference judgments between answers. These provide data for reward model training and supervised fine-tuning, respectively. In the initial iteration, the underlying Gopher model is bootstrapped with few-shot prompting.
Step 2:
Train a supervised finetuning (SFT) model: On the samples that were given top ratings by the labelers, we fine-tune a pre-trained Gopher model. The supervised finetuning stage’s objectives are to teach the model how to create verbatim quotes using our syntax and to provide it with a foundational level of self-supported question-answering proficiency.
Step 3:
Train a reward model (RM): Both reinforcement learning and reranking model outputs require a scalar “overall quality” label to be assigned to each output. It makes use of a reward model that was developed using data from comparisons of two answers to a single question.
Step 4:
A reward model should be used to optimize a reinforcement learning (RL) policy: The model’s quoting behavior is adjusted during the RL finetuning step to reflect human preferences.
Step 5:
Repeat the Step 1
Evaluation Metric
Human raters are requested to determine whether the generated responses are plausible and supported by the accompanying quote evidence in order to assess the quality of the generated replies on the task of Self-Supported question-answering (SQA). The first metric, “plausible,” determines if the response is logical and on-topic as if it were being given in a conversation. The second metric, “supported,” is added to show whether the information provided is sufficient to confirm the accuracy of the response. Aligning the language model to human preferences involves a nontrivial effort in order to produce SQA responses that are both plausible and supported.
Also read – AlphaStar: Strategy game StarCraft II expertise
Read – GopherCite: Teaching language models to support answers with verified quotes
Pingback: Differentiable ILP : learning explanatory rules - Study Experts