ml
Failures
- Zillow winds down their ml model house flipping scheme: https://www.prnewswire.com/news-releases/zillow-group-reports-third-quarter-2021-financial-results--shares-plan-to-wind-down-zillow-offers-operations-301414460.html?tc=eml_cleartime
ML Tools
- visualizer: https://github.com/lutzroeder/netron
- voice coloning: https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Installation
- https://seamless.metademolab.com/expressive/?utm_source=metaai&utm_medium=web&utm_campaign=seamless&utm_content=landing_page
- seamless voice models from FB
Tree Based Models
Post Training Methods
Method | Who supplies preferences? | Reward model? | Optimizer / loss |
---|---|---|---|
RLHF | Humans | Yes | RL (policy‑gradient) with KL to reference policy |
RLAIF | LLM (AI) | Yes | RL (policy‑gradient) with KL to reference policy |
d‑RLAIF | LLM (AI), on‑policy scoring | No — LLM provides scalar reward | RL (policy‑gradient) with KL to reference policy |
DPO | Humans or AI (pairwise prefs) | No | Direct Preference Optimization (classification‑style) |
DPO (direct preference optimization)
-
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” arXiv:2305.18290. Preprint, arXiv, July 29, 2024. https://doi.org/10.48550/arXiv.2305.18290. - papers
-
Used generally now, llama 4 was done with a combination of online RL (where we have ground truths) and then DPO
-
DPO saves you the effort of having to do any specific RM training
-
Core claim: the standard RLHF objective (maximize learned reward under a KL constraint to a reference policy) can be optimized exactly with a simple binary cross-entropy on preference pairs—no actor–critic loop, no on-policy sampling, minimal tuning. (Fig. 1, p.2)
-
Reparameterization (the “LM is a reward model” insight):
- Optimal policy under KL control: $ π^*(y \mid x) ∝ π\text{ref}(y \mid x)exp\!\left(\tfrac{1}{\beta} r(x,y)\right) $.
- Invert to express reward via policy: $ r(x,y) = β log \tfrac{\pi^\star(y \mid x)}{π\text{ref}(y \mid x)} + β log Z(x) $.
- Bradley–Terry preference prob becomes a function of policy log-ratio only; the partition cancels. Resulting DPO loss: \[ L_{\text{DPO}}(\pi_\theta;\pi_{\text{ref}})= -\mathbb{E}_{(x,y_w,y_l)\sim D}\log\sigma\!\Big(\beta\big[\log\tfrac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)}-\log\tfrac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\big]\Big). \]
-
Why this avoids degeneration: the gradient upweights examples where the current implicit reward \(\hat r_\theta(x,y)=\beta\log\tfrac{\pi_\theta}{\pi_{\text{ref}}}\) ranks the loser above the winner: \[ \nabla_\theta L_{\text{DPO}} = -\beta\,\mathbb{E}\!\left[\sigma\!\left(\hat r_\theta(x,y_l)-\hat r_\theta(x,y_w)\right)\,\big(\nabla\log\pi(y_w\!\mid\!x)-\nabla\log\pi(y_l\!\mid\!x)\big)\right]. \] The per-example weight curbs the naive “push ratios” failure mode observed with unweighted objectives. (pp.4–5)
-
Theoretical framing:
- Rewards differing by any \(f(x)\) are equivalent for both preference likelihood and the induced KL-constrained optimal policy. DPO picks the unique representative in each equivalence class with $ ∑_y π\text{ref}(y\mid x)exp(\tfrac{1}{\beta}r(x,y))=1 $, i.e., one whose optimal policy is directly \(\pi_\theta\). (pp.5–6)
- Actor–critic diagnosis: PPO needs a baseline/“soft value” (normalizer) for stability; DPO’s reparameterization bakes in that normalization, avoiding variance/baseline hacks. (p.6)
-
Practical recipe:
- Data: \((x,y_w,y_l)\) from human (or AI) preferences.
- Reference: use \(\pi_{\text{SFT}}\) if available; otherwise fit \(\pi_{\text{ref}}\) by MLE on \(y_w\) only to reduce ref–data shift. (p.5)
- One loss, no rollouts, no separate RM training, no KL penalties at sequence time—KL pressure is implicit via the log-ratios.
-
Empirical signals (paper-scale models up to ~6B):
- Reward–KL frontier (IMDb sentiment): DPO strictly dominates PPO (even PPO with ground-truth reward). (Fig. 2 left, p.7)
- Summarization (Reddit TL;DR): DPO ≈ 61% GPT‑4 win rate vs ref at temp 0.0 vs PPO ≈ 57% at its best temp; more robust to sampling temperature. (Fig. 2 right, p.7; p.9)
- Dialogue (Anthropic HH, 1‑turn): DPO is the only efficient method that beats the dataset “chosen” responses; roughly matches or exceeds “best‑of‑128” sampling without its test‑time cost. (Fig. 3, p.8; p.9)
- OOD generalization (CNN/DailyMail): DPO > PPO on GPT‑4 win rate when evaluated off‑distribution. (Table 1, p.9)
- Best‑of‑N plateau: gains saturate around N≈64–128; DPO rivals this without expensive sampling. (Fig. 4, p.23)
-
Hyperparams the authors actually used (sanity defaults, not sacred):
- \(\beta\) often \(0.1\); TL;DR used \(\beta=0.5\).
- Batch 64; RMSProp; LR \(1\mathrm{e}{-6}\) with 150‑step linear warmup. (App. B, p.20)
-
Caveats / limits worth remembering:
- OOD and self‑labeling dynamics need more study; reward over‑optimization can still bite (small late‑training dips). Scaling beyond 6B left for future work. (p.10)
- Unlikelihood training is unstable on open‑ended tasks (degenerate text). (App. C/D; Table on p.22)
-
TL;DR of the TL;DR: Train one model; treat \(\log\frac{\pi_\theta}{\pi_{\text{ref}}}\) as the implicit reward; fit it with BCE on pairwise prefs. You get PPO‑level (often better) alignment without PPO’s brittleness or cost. (Fig. 1–3, pp.2,7–8)
RL
- Fisher information matrix?
Computer Vision
Links to this note
- adam optimizer
- annotations
- approximate nearest neighbor
- atomic
- automatic differentiation
- backprop
- chatgpt
- clustering
- convolutional neural networks (cnn)
- convolutions
- cyc
- datasets
- deep learning
- embedding text into vector spaces
- energy based models
- ensemble learning
- ESIM model
- explaining away
- Fact Extraction and Verification (2018)
- fasttext
- federenko - EMNLP Keynote
- glue
- Goode: Artifical Intelligence and the Future of Nationalism
- gpt2
- gradient descent
- hierarchical navigable small-world graph (HNSW)
- higher (fb library)
- jupyter notebooks
- kubeflow
- llms
- loss function
- misinterpretations of what ai is
- ml conferences
- multi-level IR
- MultiRC
- nearest neighbor
- nlp
- OpenBook
- optuna
- pandas
- positive labeling
- product quantization
- pytorch
- recurrent neural networks (rnn)
- relu
- sacred
- softmax
- structured space modeling
- superintelligence as an infohazard
- Sutton: The Bitter Lesson
- tensorboard
- tensorflow
- training, validation, and tests sets
- visual question answering
- winograd schemas
- Zhang et al: Tropical Geometry of Deep Neural Networks