scaling law seminar

Seminar on scaling laws

Scaling Laws for Acoustic Models

papers:
- https://arxiv.org/pdf/2106.09488.pdf
how do acoustic models scale with model size?
maybe ask haoran on this?
https://assets.amazon.science/ef/a6/60f65ed543c9af2a519caba269bd/scaling-laws-for-acoustic-models.pdf
what proposals are there? what does the current field look like for scaling laws?
most models use asr + LM’s now, how does this scale with it overall?zz
does wav2vec scale? what are the profiles for wer in terms of longer speech datasets?

are solutions in bits smaller as the model size increases?
- how would you measure the solution bit size?
- intuitively the solution size actually should just increase as the transofmer gets bigger
read
- information probing on description length of solutions
- do softmaxes become hardmaxes?
people
- https://www.cs.washington.edu/people/faculty/lsz
three approaches
- circuit complexity - establishing upper bounds on what this thing is
- renormalization groups - is there something there? information bottleneck?
- probing - probing the description length?
motivate the paper by saying, “we should want to predict what types of capabiltiies can emerge, and it’s important to take lessons from statistical physics”

Cites
- Tishby and Zalavsky 2015,Shwartz-Ziv & Tishby, 2017
- Tishby 1999, information bottleneck
- deep nn complex non-linear learning trajectories (Saxe et al 2014)
- baldi and hornik, optimization problems remain non-convex (1989)
Deep learning as representation learners
information bottleneck deep learning, 3 main claims
- deep nns undergo two distcint phases, an initial fitting phase and a compression one, where fitting is when the mutual information increases, compression is where mutual information decreases
- unclear b/c tanh exhibits a compression, relu does not
- networks that do not do compression exhibit generalization
- SIDEBAR: does this mean that mesa optimizers require compression?
learning trajectories are not easily predictable, mesa optimizers may arise
SGD - two phases from shwartz-ziv and tishby, 2017, distinguishing between drift phase (mean of gradients over training samples) and diffusion phase (mean becomes smaller than the standard devianation of the gradients)

does information probing, circuit complexity, or information theory help us get anywhere?

suggests that transformers learn due to inductive bias (aka projecting out the set of assumptions)
studies the growth in the l2 norm during training
norm growth occurs, then reaches a discretized network with saturated network activations
- saturation - neuron predominantly outputs values close to the asymptotic ends of the bounded activation function
  - softmaxes become hardmaxes - https://ieeexplore.ieee.org/document/7376778
  - measuring saturation in neural networks - https://ieeexplore.ieee.org/document/7376778
- the l2 norm grows, aka norm growth
- previous work on feedforward networks
  - li and auorara 2019, ji and telgarsky 2020
- anywhere the norm diverges during the training approaches a saturated network
saturation allows for approximation via circuit complexity
transformers can implement a countering mechanism (Bhattmishra et al 2020)
- Bhattmishra also finds that trained networks learn to recongize counter languages that rely on computing means