如果在SFT过程中添加rejected answer相关的loss是否可以替代RLHF?
https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81
pre-training:
supervised training:
Reinforcement Learning (RL)
RL is much harder than supervised training