# scaling law seminar

fall 2021 classes, transformers

Seminar on scaling laws

## Scaling Laws for Acoustic Models

- papers:
- how do acoustic models scale with model size?
- maybe ask haoran on this?
- https://assets.amazon.science/ef/a6/60f65ed543c9af2a519caba269bd/scaling-laws-for-acoustic-models.pdf
- what proposals are there? what does the current field look like for scaling laws?
- most models use asr + LM’s now, how does this scale with it overall?zz
- does wav2vec scale? what are the profiles for wer in terms of longer speech datasets?

## Scaling laws for solution compressibility - aka Information Bottleneck

- are solutions in bits smaller as the model size increases?
- how would you measure the solution bit size?
- intuitively the solution size actually should just increase as the transofmer gets bigger

- read
- information probing on description length of solutions
- do softmaxes become hardmaxes?

- people
- three approaches
- circuit complexity - establishing upper bounds on what this thing is
- renormalization groups - is there something there? information bottleneck?
- probing - probing the description length?

- motivate the paper by saying, “we should want to predict what types of capabiltiies can emerge, and it’s important to take lessons from statistical physics”

### On the Information Bottleneck Theory of Deep Learning

- Cites
- Tishby and Zalavsky 2015,Shwartz-Ziv & Tishby, 2017
- Tishby 1999, information bottleneck
- deep nn complex non-linear learning trajectories (Saxe et al 2014)
- baldi and hornik, optimization problems remain non-convex (1989)

- Deep learning as representation learners
- information bottleneck deep learning, 3 main claims
- deep nns undergo two distcint phases, an initial fitting phase and a compression one, where fitting is when the mutual information increases, compression is where mutual information decreases
- unclear b/c tanh exhibits a compression, relu does not
- networks that do not do compression exhibit generalization
- SIDEBAR: does this mean that mesa optimizers require compression?

- learning trajectories are not easily predictable, mesa optimizers may arise
- SGD - two phases from shwartz-ziv and tishby, 2017, distinguishing between drift phase (mean of gradients over training samples) and diffusion phase (mean becomes smaller than the standard devianation of the gradients)

### Paper questions

- does information probing, circuit complexity, or information theory help us get anywhere?

### Effects of Parameter Norm Growth During Transformer Training

- suggests that transformers learn due to inductive bias (aka projecting out the set of assumptions)
- studies the growth in the l2 norm during training
- norm growth occurs, then reaches a discretized network with saturated network activations
- saturation - neuron predominantly outputs values close to the asymptotic ends of the bounded activation function
- softmaxes become hardmaxes - https://ieeexplore.ieee.org/document/7376778
- measuring saturation in neural networks - https://ieeexplore.ieee.org/document/7376778

- the l2 norm grows, aka norm growth
- previous work on feedforward networks
- li and auorara 2019, ji and telgarsky 2020

- anywhere the norm diverges during the training approaches a saturated network

- saturation - neuron predominantly outputs values close to the asymptotic ends of the bounded activation function
- saturation allows for approximation via circuit complexity
- transformers can implement a countering mechanism (Bhattmishra et al 2020)
- Bhattmishra also finds that trained networks learn to recongize counter languages that rely on computing means

## Discussions

- https://old.reddit.com/r/MachineLearning/comments/em3ynp/d_trying_to_wrap_my_head_around_the_information/
- https://old.reddit.com/r/MachineLearning/comments/elmgsz/r_on_the_information_bottleneck_theory_of_deep/

### Other papers

- survey - https://arxiv.org/pdf/1904.03743.pdf
- similar paper that tries to measure model complexity with curve activation functions https://arxiv.org/abs/2006.08962
- https://arxiv.org/pdf/1909.11396.pdf
- investigating dropout regularization and model complexity in NN: https://arxiv.org/abs/2108.06628
- https://arxiv.org/pdf/1909.11396.pdf

## Videos

## The Information Bottleneck Problem and Its Applications in Machine Learning

- survey - https://arxiv.org/pdf/2004.14941.pdf

## Deep variational information bottleneck

## Saxe et al Paper

- https://openreview.net/forum?id=ry_WPG-A
- https://old.reddit.com/r/MachineLearning/comments/79efus/r_on_the_information_bottleneck_theory_of_deep/
- https://old.reddit.com/r/MachineLearning/comments/elmgsz/r_on_the_information_bottleneck_theory_of_deep/

## Code

- https://github.com/ravidziv/IDNNs - code for 2015 paper

## Scalable Mutual Information Using Dependence Graphs

- https://arxiv.org/abs/1801.09125
- Claims to go against the saxe paper
- https://github.com/mrtnoshad/EDGE/blob/master/information_plane/main.py - code for 2018 paper on EDGE, the scalable linear measure of Mutual Information