عجفت الغور

scaling laws survey paper (information bottleneck, minimum description length, etc)

scaling law seminar

Angle: all three measure generalization in some way

  • can we match these together?
  • MI is the most backed, but has issues with high dimensions and hasn’t been shown on transformers
    • (also IB may be just the existence of geometric compression)
  • norm growth may have X
  • “multiple descent”
  • do these all criticall validate each other?
%
% File acl2015.tex
%
% Contact: car@ir.hit.edu.cn, gdzhou@suda.edu.cn
%%
%% Based on the style files for ACL-2014, which were, in turn,
%% Based on the style files for ACL-2013, which were, in turn,
%% Based on the style files for ACL-2012, which were, in turn,
%% based on the style files for ACL-2011, which were, in turn,
%% based on the style files for ACL-2010, which were, in turn,
%% based on the style files for ACL-IJCNLP-2009, which were, in turn,
%% based on the style files for EACL-2009 and IJCNLP-2008...

%% Based on the style files for EACL 2006 by
%%e.agirre@ehu.es or Sergi.Balari@uab.es
%% and that of ACL 08 by Joakim Nivre and Noah Smith

\documentclass[11pt]{article}
\usepackage{acl2015}
\usepackage{times}
\usepackage{url}
\usepackage{latexsym}
\usepackage{pgfplots}
\usepackage{amsmath}
\usepackage{tabularx} % in the preamble
\usepackage[square,numbers]{natbib}
\bibliographystyle{abbrvnat}

%\setlength\titlebox{5cm}
% You can expand the titlebox if you need extra space
% to show all the authors. Please do not make the titlebox
% smaller than 5cm (the original size); we will check this
% in the camera-ready version and ask you to change it back.

\title{Measuring Generalization}

\author{Peixian Wang \\
  Bloomberg L.P. \\
  {\tt pwang272@bloomberg.net}\\}
\date{}

\begin{document}
\maketitle
\begin{abstract}
  Recent years have seen an increase in theories on the role of information during the learning stage, especially regarding generalization. One particular approach to this has been work on the information plane \cite{tishby_deep_2015}, which measures the mutual information between the input layers, output layers, and latent representations. However, as most of the work has analyzed only small neural networks, we analyze the results and speculate on . In addition, we motivate the problem space by discussing the mesa optimizer: a problem within AI alignment theory that relies on sufficient generalization.\end{abstract}


\section{Introduction}

Deep neural networks have shown impressive results across a variety of tasks, with the transformer based models \cite{vaswani_attention_2017} such as BERT \cite{devlin_bert_2019}, RoBERTa \cite{liu_roberta_2019} and T5 \cite{raffel_exploring_2020} breaking state of the art on NLP benchmarks. Research on these models have attempted to formalize the behaviors of these models, usually around the learning curve during training.

Typically, a model during training is considered to have two phases, one where the model has not yet sufficiently learned and a generalization gap remains, and a phase where the model begins to over-fit and increase in test error. A variety of studies have discussed the phase transition. \cite{merrill_effects_nodate} note that discrete structures appear within transformers during training time, where the gradients during gradient descent (GD) start to become similar to hardmaxes as training increases. Various information bottleneck (IB) papers, first proposed by \cite{tishby_deep_2015}, note that the you can observe a sufficient generalization and overfitting occur on by plotting mutual information \cite{shwartz-ziv_opening_2017}.


This paper seeks to consolidate and summarize the findings from these sometimes conflicting approaches to measuring compression on the IP. To do this, this paper is split into two sections. The first section summarizes the work done on network saturation and the IB, with Table \ref{sources-table} denoting the sources used for this survey. The second part discusses how the work done on network saturation relate to the work done on the IB.

Our main conclusions are:
\begin{itemize}
%\item Multiple descent may only appear on datasets within a "critical regime", which means that sufficiently large datasets may not exhibit this.
\item Information bottleneck principle holds true with some limitations for nearly every case of neural network tested so far. However, some limitations exist, such as compression only appearing on the information plane on particular layers.
\item Information bottleneck and mutual information measurements research is largely focused on setting theoretical bounds and have rarely been applied outside of small models. Furthermore, there is no information plane calculation for Adam optimizations. Calculating mutual information within a transformer may be a fruitful area of study.
\item While compression occurs, saturation is not guarenteed to occur in smaller settings. However, for transformer models, saturation \textit{is} is observed for nearly all state of the art models.

\end{itemize}

\subsubsection{Related Works}
We note that \cite{geiger_information_2021} contains much of the same work surveying the information bottleneck papers, and includes some work on drawing attention to the relationship between geometric compression and the information theoretic bounds within the IB. Our work differs from \cite{geiger_information_2021} as we highlight the role of network saturation in line with compression, as well as cites work relating to pratical applications of IB on used models.
Two other surveys of IB should be brought up: \cite{hafez-kolahi_information_2019}, which surveys the variety of information extraction methods, and \cite{goldfeld_information_2020}, which surveyed much of the same information as \cite{geiger_information_2021}. We contrast our work by discussing the particularities around network saturation and relating it work done on transformers, which neither of the other two papers attempt to do.

\begin{figure*}[htbp]
\label{sources-table}
\begin{tabularx}{\textwidth}{X|X|X|X|X}
  \textbf{Reference} & \textbf{Obs. Approach} & \textbf{Architecture} & \textbf{dataset} & \textbf{Train}\\
\hline
%\cite{nakkiran_deep_2019} & Descent & ResNets, CNN, transformer & CIFAR-? & ADAM, SGD \\
%\cite{dascoli_triple_2020} & Descent & MLP & CIFAR-? & SGD \\
\cite{chelombiev_adaptive_2019} & Saturation & MLP & ADAM & Custom \\
\cite{merrill_effects_nodate} & Saturation & transformer & Various & ADAM \\
\cite{wickstrom_information_2019}& IB & MLP, CNN, VGG16 & MNIST,  CIFAR-10 & SGD \\
\cite{saxe_information_2019} & IB & MLP & SGD, BGD & SZT, MNIST* (augmented) \\
\cite{shwartz-ziv_opening_2017} & IB & MLP & SGD & SZT \\
\cite{noshad_scalable_2018} & IB & MLP, CNN & ADAM & MNIST \\
\cite{shen_information-theoretic_2020} & IB & CNN & SGD & MNIST, CIFAR-10 \\
\end{tabularx}
\end{figure*}

\section{Information Bottleneck}
% talk about shannon? \
\subsection{Mutual Information}
\label{mutual-information}
In information theory, Claude Shannon introduced the concept of entropy as a measure of uncertainty within outcome, quantifying the information and choice alongside.

Many measures of entropy have been discovered, from discrete entropy, to differential entropy for continous random variables, to conditional entropy for conditional random variables.

Discrete entropy can be defined as
\begin{equation} \label{discrete-entropy}
H(X) = -\sum_{i=1}^{N}p_i(x)\log p_i(x)
\end{equation}

Entropy in this case satisfies three main features: it is a positive measurement, it is maximal for a uniformly distributed random variable, and deterministic distributions have a entropy of zero.

Continuous entropy can be defined by replacing the sum symbol with the integral

\begin{equation} \label{cont-entropy}
H(X) = -\int p_t (x) \log p_t(x) dt
\end{equation}

 Conditional entropy can be interpreted as the average amount of information left in the random variable $X$ when the outcome of $Y$ is known. Notably, it satisfies the following properties:
\begin{itemize}
  \item For any two variables $X$ and $Y$, the conditional entropy $H$ of $H(X|Y)$ is always lower than or equal to $H(X)$, or $H(X|Y) \leq H(X)$
  \item If $X$ and $Y$ are independent, then the conditional entropies are the same: $H(X|Y) = H(X)$
  \item If $X$ is completely determined by $Y$, then $H(X|Y) = 0$
\end{itemize}

  From these three properties it is possible to construct the concept of mutual information $I$:

  \begin{equation}
    \begin{gathered}
      I(X;Y) = H(X) - H(X|Y) \\
      = H(Y) - H(Y|X)
    \end{gathered}
  \end{equation}

  $I(X;Y)$ implies:
  \begin{itemize}
  \item If $X$ and $Y$ are independent, then $I(X;Y) = 0$
  \item If $X$ and $Y$ have some non-deterministic statistical relationship, then $I(X;Y)$ captures the relationship that can be accounted for without random chance
    \item If $X$ and $Y$ are deterministic in relation by $f(X) = Y$, then $H(Y|X) = 0$ and $I(X;Y) = H(Y)$
    \end{itemize}


\subsection{Information Bottleneck Principle}
The information bottleneck (IB) was first described by \cite{tishby_deep_2015}, where for random variables (RVs) $X$ and $Y$, there exists a  maximally compressed representation $T$. For two RVs with a statistical dependency on each other, $T$ is always assumed to exist. A Markov chain of T-X-Y can then be drawn with the joint probability distribution that:

\[p(X,Y,T) = p(T|X)p(Y|X)p(X) \]

Assuming each hidden layer of the neural network is $T_{n}$, the combination of encoder layers can be modeled as predicting $P(T|X)$, whereas the decoder layers predict $P(Y|T)$ \cite{shwartz-ziv_opening_2017}. By reusing the mutual information defined \ref{mutual-information}, a few properties around $T$ can be formalized:
\begin{itemize}
    \item $I(Y;T)$ describes the \textit{accuracy} of a representation
    \item $I(T; X)$ provides the \textit{compression} or \textit{complexity} of a representation.
\end{itemize}

Given these two measurements, it is possible to construct an information plane (IP) plotting $I(X; T)$ and $I(Y; T)$, such as \ref{fig:neuron-ip}.

\begin{figure}
    \begin{tikzpicture}
  \begin{axis}[
    xlabel=$x$,
    ylabel={$I$}
  ]
    \addplot {x^2 - x +4};
  \end{axis}
\end{tikzpicture}
    \caption{Caption}
    \label{fig:neuron-ip}
\end{figure}

Using the training epoch parameter $t$, the IP can be used to examine the two distinct phases: the first, a $fitting$ phase, where $I(X;T)$ and $I(Y;T)$ both increase as the model learns to fit to the target, and then a $compression$ phase, where $I(X; T)$ decreases, but $I(Y;T)$ remains about the same, until reaching a point of \textit{overfitting}, where $I(Y;T)$ drops. In otherwords, the critical phase transition on the regular learning curve is when $I(Y;T)$ begins to decrease.

% discuss how only saxe found it, but actually this may be an incorrect measurement of mutual information
In \cite{shwartz-ziv_opening_2017}, the authors observed these two phases occuring for all their experiments. Saxe et all in \cite{saxe_information_2019} found that there was no compression to be to be observed on the IP for a ReLU MLP network trained with either SGD and BGD. The authors of \cite{noshad_scalable_2018} note that there were differences in estimates of MI between the \cite{saxe_information_2019} and \cite{shwartz-ziv_opening_2017} and offer a new way of estimating MI, proposing a linear complexity estimator of MI, and their estimation of MI finds that ReLU networks to show a compression phase.

The authors of \cite{wickstrom_information_2019} use a different, kernel based estimator where they find compression within the softmax layer of a CNN with three convolutional layers and two fully connected layers of width $400-256$ trained on MNIST. \cite{wickstrom_information_2019} also shows how compression occurs for all layers in a $1000 - 20 - 20 - 20$ MLP trained on MNIST. Given that all networks investigated in \cite{wickstrom_information_2019} showed some form of compression, and that the work done in \cite{noshad_scalable_2018} refutes the only case where compression is \textit{not} observed, IB theory holds and may have further applications, albiet on particular layers.


% discuss how to actually calculate mutual information
As described above, measurements of mutual information within DNNs are not exact, as they are a result of the estimation process and inherit uncertainty related to the process itself \cite{chelombiev_adaptive_2019}. In addition, mutual information can be estimated in a variety of ways \cite{noshad_scalable_2018}, \cite{poole_variational_2019}, \cite{davis_network_2020}, \cite{gabrie_entropy_2019}, \cite{shen_information-theoretic_2020}, which further increases the uncertainties around estimating mutual information.

Fundamentally, all estimators of mutual information lie around that if mutual information between $T$ and $X$ were to be defined as
\begin{equation}
  \begin{gathered}
    I(T;X) = H(T) - H(T|X)
  \end{gathered}
\end{equation}

and from \ref{cont-entropy} $H(T|X)$ from \cite{chelombiev_adaptive_2019} can be defined as

\begin{equation}
  \begin{gathered}
    H(T|X) = - \int p_t(t) \log p_t(t) dt = -\infty
  \end{gathered}
\end{equation}

As a result, the mutual information becomes infinite. In order to solve this problem, \cite{shwartz-ziv_opening_2017} note the importance of adding noise to the process, confirmed by \cite{saxe_information_2019}. Noise $Z$ can be added to $T$ as either a Gaussian mixture or or by discretizing into bins, resulting in the noisy variable $\hat{T} = T + Z$, which then transforms the condition entropy to $H(T|X) = H(Z)$, and MI becomes $I(\hat{T};X) = H(\hat{T}) + H(Z)$.

If the noise is added is Gaussian, then a kernel density estimator could be used to estimate the mutal information. If descritized into bins, then the noise arises as part of the discretization process. As \cite{chelombiev_adaptive_2019} specifically note, DNNs with double-sided saturating activation functions are not as sensitive to estimation differences in mutual information, but non-saturating activations may or may not show compression depending on how MI is defined.

\cite{shen_information-theoretic_2020} describe a framework for estimation of entropy between layers. The authors formalize the entire data available in a CNN as a four dimensional array $dCNN(X, L, C, T)$, where $X$ is the input data, $L$ is the layer dimension, $C$ is the convolutional filters, and $T$ is the epoch number. Using a CNN trained on MNIST and CIFAR-10, they use this four dimensional representation to track information flow by estimating change in entropy between layers. \cite{shen_information-theoretic_2020} sidesteps the MI estimation debates by simply measuring entropy, which may prove to be far more tractable in terms of larger networks. The authors also note that they wish to apply their methods to transformer-based models and RNNs in the future.
% Generally, when considering mutual information in DNNs, the analyzed values are technically the result of the estimation process and, therefore, are highly sensitive to it.- chelombiev_adaptive_2019
% mutual information is not easy to estimate

% discuss how none of these are very deep, and concerns with calculating mutual information beyond a certain amount
For all five papers that discussed IB, only two papers implemented something other than a multilayer perceptron (MLP) \cite{noshad_scalable_2018} \cite{shen_information-theoretic_2020}. Furthermore, for all the papers that implemented an MLP, the network itself was fairly deep and parameters were very small, never going beyond 5 layers. As a result, it is unclear how well bound holds with larger neural networks. To be the best of the authors knowledge, \cite{shen_information-theoretic_2020} has been the only paper which has attempted to formalize IB for larger networks.
\subsection{Network Saturation}
\cite{chelombiev_adaptive_2019} uses two types of adapative estimation schemes for mutual information (MI). For both approaches, they find that compression is always present, even when saturation does not. The authors note that compression does not always happen at later stages in the training, but often occurs from intialization depending on network archiecture. They also note that the activation function plays a role in not only whether compression does occur, but how compression occurs. As discussed above, while \cite{saxe_information_2019} notes that single-sided saturating nonlinearities or linear activation functions do not yield compression, \cite{noshad_scalable_2018} argues that \cite{saxe_information_2019} contained a poor measurement of MI, and found that ReLU networks do show compression. Nevertheless, all authors agree that double-sided nonlinearities such as $tanh$ or $sigmoid$ exhibit compression phases.

\cite{merril_effects_nodate} note that as parameters grow within training for a transformer network, the network approaches an discretized network with saturated activation functions. The authors find that internal representations of pretrained transformers approximate saturated networks, while randomly initialized transformers do not. Specifically, they find that for T5 \cite{raffel_exploring_2020}, they establish that the parameter $\ell_2$ norm grows during training for every transformer-based model. As the $\ell_2$ grows during training, divergence between the parameters leads to a saturated network, where the softmaxes of each layer begin to minimic hardmaxes.


\section{Discussion}
The information bottleneck (IB) principle provides a potentially fruitful area of research when it comes to analysis of neural networks. One of the main areas of difficulty for measurement of IB is the information plane (IP) behaviors are reliant on estimators of mutual information (MI), which are subject to artifacts during the estimation process.
%Despite there being close relationships between the information plane (IP) and learning rates, no research exists where the effects are compared. Given the variety of methods that can be used to calculate mutual information (MI) $I$, it would seem important to validate MI with a learning curve on a model. For example, in the given case of the double descent small model \cite{nakkiran_deep_2019} figure 9, where the intermediate and large models exhibit double descent at higher epochs \textit{after} initially appearing to overfit, it would be expected that the IP plot would exhibit similar behaviors, where $I(Y;T)$ would decrease, then eventually increase again.


%In addition, \cite{merrill_effects_nodate} find that the parameter norm growth also occurs during the training process, before the model begins to overfit. If there were a measure of MI for transformers, where

\cite{chelombiev_adaptive_2019} develops a more robust form of estimating mutual information, which then showed that compression within the IP is dependent on the activation function used and notes that saturation of the activation function is not required for compression. Like all the IB papers, the networks analyzed in \cite{chelombiev_adaptive_2019} were very small multilayer perceptrons for tractability purposes. However, given a weak relationship could be found between saturation and compression, combining this work with \cite{merrill_effects_nodate} may offer new insights on how transformers learn, even without a robust measurement of MI due to tractability purposes.

For a large DNN, \cite{shen_information-theoretic_2020} provides the most tractable usage of

Within learning theory, it is possible to estimate the generalization error bounds of a learning algorithm. As IB formalizes into a learning algorithm, it is possible to construct the generalization error for a given neural network during training, which would prove useful for network analysis. In \cite{shamir_learning_nodate}, the authors describe mutual information as a measure of performance, providing the example of a document classifier. The authors state that the likelihood of misclassifying a particular document using its words as a sample of points has a strict upper bound, leading to a quantified measure of generalization error.


Given it remains difficult and ambiguous to calculate MI for high dimensions \cite{poole_variational_2019}, if a correlation to IB could be found or approximated with the measurement of norm growth, then it may be possible to approximate the generalization error. The relationship between saturated neural networks and IB, alongside how transformers in some cases are already saturated may suggested a useful meeting point for newer studies.


% include your own bib file like this:
\bibliographystyle{acl}
\bibliography{refs.bib}



\end{document}