RLHF

Framework

- Step 0: Pretraining LLM base model

- Step 1: Collection human feedback

- Step 2: Fitting the reward model

- Step 3: Optimizing the policy with RL

Feedback Collection

LLM Models with RLHF

OpenAI GPT4

Anthropic Claude

Google Bard

Meta Llama

Policy Optimization

Reward Modeling

Evaluation

Challenges

Misaligned evaluators

Difficulty of oversight

Data quality

Feedback Type limitations

What is the format of the feedback?
The choice of format has implications on the expressivity of the feedback, the ease of its collection, and how we can use it to improve systems.

Numerical scores: Although easy to leverage, numerical feeback might generally be a hard and ill-defined task for humans, leading to a costly collection process and problems of subjectivity and variance. Extensively used for evalutation.

What is its objective?
The purpose of collection feedback is to align the model's behavior with some (often ill-defined) goal behavior.

Ranking-based: Easier to collect. Tends to be collected to improve model behavior rather than just for evaluation.

Qualitative natural language explanations: Tipically provides more detailed information, either highlightin the short-comings of the current output or suggesting specific actions for improvement.

Helpfulness: A necessary (but not sufficient) condition for a helpful system is that it perfors the task well, and so feedback related to task performance generally falls under this umbrella. - Machine translation: quality of translation; - Summarization: relevance, consistency and accuracy; - Ability to follow instructions.

Harmlessness: We want our models not to produce certain types of output or violate certain norms. Ask humans to provide feedback on the harmlessness of their system, by defining a set of rules and asking humans if the outputs violate these rules.

When is it used?

Training stage to optimize the model parameters directly

Used at inference time to guide the decoding process

Feedback-based imitation learning: Surpevised learning with a dataset composed of positively-labeled generations: maximizing the likehood of the model's answers labeled as correct by humans.

Joint-feedback modeling: Leverages all the information collected by directly using human feedback to optimize the model. Some works simply train the model to predict the feedback given to each generation. Other works train the model to predict the generations and the corresponding human feedback.

Reinforcement learning: More versatile approach , allowing for direct optimization of a model's parameters based on human feedback.

Feedback memory: Maintaining a repository of feedback from prior sessions. Then, when processing new inputs, the system use relevant feedback from similar inputs in its memory to guide the model toward generation more desirable outputs based on past experiences.

Iterative output refinement: Users can provide feedback on intermediatre responses, enabling the model to adjust its outputs until it meets the user's satisfaction.

Feedback models: Sampling a large number of candidate generations, and reraking them according to thje feedback model.

How is it modeled?

Direct feedback from humans

Surrogate models that approximate human preferences: Models that can predict or approximate human preferences.

Challenges

Challenges

RL difficulties

Policy misgeneralization

Distributional challenges

Joint RM/Policy training challenges

Problem misspecification

Missgeneralization/Hacking

Evaluation Difficulty

Joint RM/Policy training challenges

click to edit

click to edit