如果在SFT过程中添加rejected answer相关的loss是否可以替代RLHF?

1. 为什么需要强化学习

1.1 各个学习阶段的目的

https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81

pre-training:

image.png

supervised training:

image.png

Reinforcement Learning (RL)

image.png

RL is much harder than supervised training

image.png