عجفت الغور

scaling law seminar

fall 2021 classes, transformers

Seminar on scaling laws

Scaling Laws for Acoustic Models

Scaling laws for solution compressibility - aka Information Bottleneck

  • are solutions in bits smaller as the model size increases?
    • how would you measure the solution bit size?
    • intuitively the solution size actually should just increase as the transofmer gets bigger
  • read
    • information probing on description length of solutions
    • do softmaxes become hardmaxes?
  • people
  • three approaches
    • circuit complexity - establishing upper bounds on what this thing is
    • renormalization groups - is there something there? information bottleneck?
    • probing - probing the description length?
  • motivate the paper by saying, “we should want to predict what types of capabiltiies can emerge, and it’s important to take lessons from statistical physics”

On the Information Bottleneck Theory of Deep Learning

  • Cites
    • Tishby and Zalavsky 2015,Shwartz-Ziv & Tishby, 2017
    • Tishby 1999, information bottleneck
    • deep nn complex non-linear learning trajectories (Saxe et al 2014)
    • baldi and hornik, optimization problems remain non-convex (1989)
  • Deep learning as representation learners
  • information bottleneck deep learning, 3 main claims
    • deep nns undergo two distcint phases, an initial fitting phase and a compression one, where fitting is when the mutual information increases, compression is where mutual information decreases
    • unclear b/c tanh exhibits a compression, relu does not
    • networks that do not do compression exhibit generalization
    • SIDEBAR: does this mean that mesa optimizers require compression?
  • learning trajectories are not easily predictable, mesa optimizers may arise
  • SGD - two phases from shwartz-ziv and tishby, 2017, distinguishing between drift phase (mean of gradients over training samples) and diffusion phase (mean becomes smaller than the standard devianation of the gradients)

Paper questions

  • does information probing, circuit complexity, or information theory help us get anywhere?

Effects of Parameter Norm Growth During Transformer Training

  • suggests that transformers learn due to inductive bias (aka projecting out the set of assumptions)
  • studies the growth in the l2 norm during training
  • norm growth occurs, then reaches a discretized network with saturated network activations
    • saturation - neuron predominantly outputs values close to the asymptotic ends of the bounded activation function
    • the l2 norm grows, aka norm growth
    • previous work on feedforward networks
      • li and auorara 2019, ji and telgarsky 2020
    • anywhere the norm diverges during the training approaches a saturated network
  • saturation allows for approximation via circuit complexity
  • transformers can implement a countering mechanism (Bhattmishra et al 2020)
    • Bhattmishra also finds that trained networks learn to recongize counter languages that rely on computing means

Discussions

reddit

Other papers

Videos

The Information Bottleneck Problem and Its Applications in Machine Learning

Deep variational information bottleneck

Saxe et al Paper

Code

Scalable Mutual Information Using Dependence Graphs