ml

Failures

Zillow winds down their ml model house flipping scheme: https://www.prnewswire.com/news-releases/zillow-group-reports-third-quarter-2021-financial-results--shares-plan-to-wind-down-zillow-offers-operations-301414460.html?tc=eml_cleartime

ML Tools

visualizer: https://github.com/lutzroeder/netron
voice coloning: https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Installation
https://seamless.metademolab.com/expressive/?utm_source=metaai&utm_medium=web&utm_campaign=seamless&utm_content=landing_page
- seamless voice models from FB

Tree Based Models

https://arxiv.org/abs/2207.08815

Post Training Methods

Method	Who supplies preferences?	Reward model?	Optimizer / loss
RLHF	Humans	Yes	RL (policy‑gradient) with KL to reference policy
RLAIF	LLM (AI)	Yes	RL (policy‑gradient) with KL to reference policy
d‑RLAIF	LLM (AI), on‑policy scoring	No — LLM provides scalar reward	RL (policy‑gradient) with KL to reference policy
DPO	Humans or AI (pairwise prefs)	No	Direct Preference Optimization (classification‑style)

DPO (direct preference optimization)

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” arXiv:2305.18290. Preprint, arXiv, July 29, 2024. https://doi.org/10.48550/arXiv.2305.18290. - papers
Used generally now, llama 4 was done with a combination of online RL (where we have ground truths) and then DPO
DPO saves you the effort of having to do any specific RM training
Core claim: the standard RLHF objective (maximize learned reward under a KL constraint to a reference policy) can be optimized exactly with a simple binary cross-entropy on preference pairs—no actor–critic loop, no on-policy sampling, minimal tuning. (Fig. 1, p.2)
Reparameterization (the “LM is a reward model” insight):
- Optimal policy under KL control: $ π^*(y \mid x) ∝ π\text{ref}(y \mid x)exp\!\left(\tfrac{1}{\beta} r(x,y)\right) $.
- Invert to express reward via policy: $ r(x,y) = β log \tfrac{\pi^\star(y \mid x)}{π\text{ref}(y \mid x)} + β log Z(x) $.
- Bradley–Terry preference prob becomes a function of policy log-ratio only; the partition cancels. Resulting DPO loss: \[ L_{\text{DPO}}(\pi_\theta;\pi_{\text{ref}})= -\mathbb{E}_{(x,y_w,y_l)\sim D}\log\sigma\!\Big(\beta\big[\log\tfrac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)}-\log\tfrac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\big]\Big). \]
Why this avoids degeneration: the gradient upweights examples where the current implicit reward $\hat r_\theta(x,y)=\beta\log\tfrac{\pi_\theta}{\pi_{\text{ref}}}$ ranks the loser above the winner: \[ \nabla_\theta L_{\text{DPO}} = -\beta\,\mathbb{E}\!\left[\sigma\!\left(\hat r_\theta(x,y_l)-\hat r_\theta(x,y_w)\right)\,\big(\nabla\log\pi(y_w\!\mid\!x)-\nabla\log\pi(y_l\!\mid\!x)\big)\right]. \] The per-example weight curbs the naive “push ratios” failure mode observed with unweighted objectives. (pp.4–5)
Theoretical framing:
- Rewards differing by any $f(x)$ are equivalent for both preference likelihood and the induced KL-constrained optimal policy. DPO picks the unique representative in each equivalence class with $ ∑_y π\text{ref}(y\mid x)exp(\tfrac{1}{\beta}r(x,y))=1 $, i.e., one whose optimal policy is directly $\pi_\theta$. (pp.5–6)
- Actor–critic diagnosis: PPO needs a baseline/“soft value” (normalizer) for stability; DPO’s reparameterization bakes in that normalization, avoiding variance/baseline hacks. (p.6)
Practical recipe:
- Data: $(x,y_w,y_l)$ from human (or AI) preferences.
- Reference: use $\pi_{\text{SFT}}$ if available; otherwise fit $\pi_{\text{ref}}$ by MLE on $y_w$ only to reduce ref–data shift. (p.5)
- One loss, no rollouts, no separate RM training, no KL penalties at sequence time—KL pressure is implicit via the log-ratios.
Empirical signals (paper-scale models up to ~6B):
- Reward–KL frontier (IMDb sentiment): DPO strictly dominates PPO (even PPO with ground-truth reward). (Fig. 2 left, p.7)
- Summarization (Reddit TL;DR): DPO ≈ 61% GPT‑4 win rate vs ref at temp 0.0 vs PPO ≈ 57% at its best temp; more robust to sampling temperature. (Fig. 2 right, p.7; p.9)
- Dialogue (Anthropic HH, 1‑turn): DPO is the only efficient method that beats the dataset “chosen” responses; roughly matches or exceeds “best‑of‑128” sampling without its test‑time cost. (Fig. 3, p.8; p.9)
- OOD generalization (CNN/DailyMail): DPO > PPO on GPT‑4 win rate when evaluated off‑distribution. (Table 1, p.9)
- Best‑of‑N plateau: gains saturate around N≈64–128; DPO rivals this without expensive sampling. (Fig. 4, p.23)
Hyperparams the authors actually used (sanity defaults, not sacred):
- $\beta$ often $0.1$; TL;DR used $\beta=0.5$.
- Batch 64; RMSProp; LR $1\mathrm{e}{-6}$ with 150‑step linear warmup. (App. B, p.20)
Caveats / limits worth remembering:
- OOD and self‑labeling dynamics need more study; reward over‑optimization can still bite (small late‑training dips). Scaling beyond 6B left for future work. (p.10)
- Unlikelihood training is unstable on open‑ended tasks (degenerate text). (App. C/D; Table on p.22)
TL;DR of the TL;DR: Train one model; treat $\log\frac{\pi_\theta}{\pi_{\text{ref}}}$ as the implicit reward; fit it with BCE on pairwise prefs. You get PPO‑level (often better) alignment without PPO’s brittleness or cost. (Fig. 1–3, pp.2,7–8)

RL

Fisher information matrix?

عجفت الغور

ml

Failures

ML Tools

Tree Based Models

Post Training Methods

DPO (direct preference optimization)

RL

Computer Vision

Links to this note