RLHF
Framework
- Step 0: Pretraining LLM base model
- Step 1: Collection human feedback
- Step 2: Fitting the reward model
- Step 3: Optimizing the policy with RL
Feedback Collection
LLM Models with RLHF
OpenAI GPT4
Anthropic Claude
Google Bard
Meta Llama
Policy Optimization
Reward Modeling
Evaluation
Challenges
Misaligned evaluators
Difficulty of oversight
Data quality
Feedback Type limitations
What is the format of the feedback?
The choice of format has implications on the expressivity of the feedback, the ease of its collection, and how we can use it to improve systems.
Numerical scores: Although easy to leverage, numerical feeback might generally be a hard and ill-defined task for humans, leading to a costly collection process and problems of subjectivity and variance. Extensively used for evalutation.
What is its objective?
The purpose of collection feedback is to align the model's behavior with some (often ill-defined) goal behavior.
Ranking-based: Easier to collect. Tends to be collected to improve model behavior rather than just for evaluation.
Qualitative natural language explanations: Tipically provides more detailed information, either highlightin the short-comings of the current output or suggesting specific actions for improvement.
Helpfulness: A necessary (but not sufficient) condition for a helpful system is that it perfors the task well, and so feedback related to task performance generally falls under this umbrella. - Machine translation: quality of translation; - Summarization: relevance, consistency and accuracy; - Ability to follow instructions.
Harmlessness: We want our models not to produce certain types of output or violate certain norms. Ask humans to provide feedback on the harmlessness of their system, by defining a set of rules and asking humans if the outputs violate these rules.
When is it used?
Training stage to optimize the model parameters directly
Used at inference time to guide the decoding process
Feedback-based imitation learning: Surpevised learning with a dataset composed of positively-labeled generations: maximizing the likehood of the model's answers labeled as correct by humans.
Joint-feedback modeling: Leverages all the information collected by directly using human feedback to optimize the model. Some works simply train the model to predict the feedback given to each generation. Other works train the model to predict the generations and the corresponding human feedback.
Reinforcement learning: More versatile approach , allowing for direct optimization of a model's parameters based on human feedback.
Feedback memory: Maintaining a repository of feedback from prior sessions. Then, when processing new inputs, the system use relevant feedback from similar inputs in its memory to guide the model toward generation more desirable outputs based on past experiences.
Iterative output refinement: Users can provide feedback on intermediatre responses, enabling the model to adjust its outputs until it meets the user's satisfaction.
Feedback models: Sampling a large number of candidate generations, and reraking them according to thje feedback model.
How is it modeled?
Direct feedback from humans
Surrogate models that approximate human preferences: Models that can predict or approximate human preferences.
Challenges
Challenges
RL difficulties
Policy misgeneralization
Distributional challenges
Joint RM/Policy training challenges
Problem misspecification
Missgeneralization/Hacking
Evaluation Difficulty
Joint RM/Policy training challenges
click to edit
click to edit