عجفت الغور

ml

Tags: Computers, math

Failures

ML Tools

Tree Based Models

Post Training Methods

Method Who supplies preferences? Reward model? Optimizer / loss
RLHF Humans Yes RL (policy‑gradient) with KL to reference policy
RLAIF LLM (AI) Yes RL (policy‑gradient) with KL to reference policy
d‑RLAIF LLM (AI), on‑policy scoring No — LLM provides scalar reward RL (policy‑gradient) with KL to reference policy
DPO Humans or AI (pairwise prefs) No Direct Preference Optimization (classification‑style)

DPO (direct preference optimization)

  • Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” arXiv:2305.18290. Preprint, arXiv, July 29, 2024. https://doi.org/10.48550/arXiv.2305.18290. - papers

  • Used generally now, llama 4 was done with a combination of online RL (where we have ground truths) and then DPO

  • DPO saves you the effort of having to do any specific RM training

  • Core claim: the standard RLHF objective (maximize learned reward under a KL constraint to a reference policy) can be optimized exactly with a simple binary cross-entropy on preference pairs—no actor–critic loop, no on-policy sampling, minimal tuning. (Fig. 1, p.2)

  • Reparameterization (the “LM is a reward model” insight):

    • Optimal policy under KL control: $ π^*(y \mid x) ∝ π\text{ref}(y \mid x)exp\!\left(\tfrac{1}{\beta} r(x,y)\right) $.
    • Invert to express reward via policy: $ r(x,y) = β log \tfrac{\pi^\star(y \mid x)}{π\text{ref}(y \mid x)} + β log Z(x) $.
    • Bradley–Terry preference prob becomes a function of policy log-ratio only; the partition cancels. Resulting DPO loss: \[ L_{\text{DPO}}(\pi_\theta;\pi_{\text{ref}})= -\mathbb{E}_{(x,y_w,y_l)\sim D}\log\sigma\!\Big(\beta\big[\log\tfrac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)}-\log\tfrac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\big]\Big). \]
  • Why this avoids degeneration: the gradient upweights examples where the current implicit reward \(\hat r_\theta(x,y)=\beta\log\tfrac{\pi_\theta}{\pi_{\text{ref}}}\) ranks the loser above the winner: \[ \nabla_\theta L_{\text{DPO}} = -\beta\,\mathbb{E}\!\left[\sigma\!\left(\hat r_\theta(x,y_l)-\hat r_\theta(x,y_w)\right)\,\big(\nabla\log\pi(y_w\!\mid\!x)-\nabla\log\pi(y_l\!\mid\!x)\big)\right]. \] The per-example weight curbs the naive “push ratios” failure mode observed with unweighted objectives. (pp.4–5)

  • Theoretical framing:

    • Rewards differing by any \(f(x)\) are equivalent for both preference likelihood and the induced KL-constrained optimal policy. DPO picks the unique representative in each equivalence class with $ ∑_y π\text{ref}(y\mid x)exp(\tfrac{1}{\beta}r(x,y))=1 $, i.e., one whose optimal policy is directly \(\pi_\theta\). (pp.5–6)
    • Actor–critic diagnosis: PPO needs a baseline/“soft value” (normalizer) for stability; DPO’s reparameterization bakes in that normalization, avoiding variance/baseline hacks. (p.6)
  • Practical recipe:

    • Data: \((x,y_w,y_l)\) from human (or AI) preferences.
    • Reference: use \(\pi_{\text{SFT}}\) if available; otherwise fit \(\pi_{\text{ref}}\) by MLE on \(y_w\) only to reduce ref–data shift. (p.5)
    • One loss, no rollouts, no separate RM training, no KL penalties at sequence time—KL pressure is implicit via the log-ratios.
  • Empirical signals (paper-scale models up to ~6B):

    • Reward–KL frontier (IMDb sentiment): DPO strictly dominates PPO (even PPO with ground-truth reward). (Fig. 2 left, p.7)
    • Summarization (Reddit TL;DR): DPO ≈ 61% GPT‑4 win rate vs ref at temp 0.0 vs PPO ≈ 57% at its best temp; more robust to sampling temperature. (Fig. 2 right, p.7; p.9)
    • Dialogue (Anthropic HH, 1‑turn): DPO is the only efficient method that beats the dataset “chosen” responses; roughly matches or exceeds “best‑of‑128” sampling without its test‑time cost. (Fig. 3, p.8; p.9)
    • OOD generalization (CNN/DailyMail): DPO > PPO on GPT‑4 win rate when evaluated off‑distribution. (Table 1, p.9)
    • Best‑of‑N plateau: gains saturate around N≈64–128; DPO rivals this without expensive sampling. (Fig. 4, p.23)
  • Hyperparams the authors actually used (sanity defaults, not sacred):

    • \(\beta\) often \(0.1\); TL;DR used \(\beta=0.5\).
    • Batch 64; RMSProp; LR \(1\mathrm{e}{-6}\) with 150‑step linear warmup. (App. B, p.20)
  • Caveats / limits worth remembering:

    • OOD and self‑labeling dynamics need more study; reward over‑optimization can still bite (small late‑training dips). Scaling beyond 6B left for future work. (p.10)
    • Unlikelihood training is unstable on open‑ended tasks (degenerate text). (App. C/D; Table on p.22)
  • TL;DR of the TL;DR: Train one model; treat \(\log\frac{\pi_\theta}{\pi_{\text{ref}}}\) as the implicit reward; fit it with BCE on pairwise prefs. You get PPO‑level (often better) alignment without PPO’s brittleness or cost. (Fig. 1–3, pp.2,7–8)

RL

  • Fisher information matrix?

Computer Vision