2.2. Entropy

$\newcommand{\argmin}{\mathop{\mathrm{argmin}}\limits}$ $\newcommand{\argmax}{\mathop{\mathrm{argmax}}\limits}$

While computing complexity of classes of functions by its cardinality is natural, it reports complexity of every infinite dimensional classes to be the same. We need tools that can differentiate complexity more meticulously. Such tools are entropies of various kinds.

For this, we denote the $L^p(Q)$-norm by

\[\|g\|_{p,Q}:=\left( \int g^p dQ \right)^{1/p}\]

and denote the supremum norm by

\[\|g\|_\infty := \sup_{x \in \mathcal X} |g(x)|.\]

Notations here are a little different than that of van de Geer (2000). I indicated $p$ and $Q$ as subscripts while van de Geer (2000) prefered to indicate them as arguments of a function.


Entropy for the $L^p$-norm

The first entropy is defined by the $\delta$-covering of the function class $\mathcal G.$1


(1) For $\delta>0,$ $\mathcal C = \{g_1,\cdots,g_N\} \subset \mathcal G$ is a $\delta$-covering of $\mathcal G,$ if for all $g \in \mathcal G,$ there exists $g_i\in\mathcal G$ such that $\| g - g_i \|_{p, Q} \le \delta.$ i.e. $\mathcal G \subset \cup_{i=1}^N \overline B(g_i, \delta) ,$ where $\overline B(g_i, \delta) := \{ f \in \mathcal G:~ \|f-g_i\|_{p,Q} \le \delta \}.$

(2) $$N_{p,Q}(\delta, \mathcal G) := \inf\{\text{card}(\mathcal C):~ \mathcal C \text{ is a $\delta$-covering of $\mathcal G$} \}$$ is the $\delta$-covering number of $\mathcal G.$

(3) $$H_{p,Q}(\delta,\mathcal G) := \log N_{p,Q}(\delta,\mathcal G)$$ is the $\delta$-entropy of $\mathcal G$ for the $L^p(Q)$-norm.

For simplicity, we will simply denote “$\delta$-entropy of $A$ for the $L^p$-norm” by “$\delta$-entropy of $A$”.


Entropy for general metric

We only defined the entropies for the class of Lebesgue-integrable functions. However, entropies for sets in general metric spaces can be defined similarly. For example, consider the real line with the standard metric $(\mathbb R, d).$ We can define the $\delta$-entropy of the subset $A\subset\mathbb R$ for the metric $d$ by

\[N(\delta, A, d) := \log \left( \inf\{ \text{card}(I):~ I\subset\mathbb R,~ I \text{ is the $\delta$-covering of $\mathbb R$}\} \right)\]

where the covering is also defined accordingly.

In fact, there is a known entropy inequality for $\mathbb R^d.$ It would be helpful to mention it before we see the entropy inequality of the entropy of a function class.

Let $A=[-1,1]^p \subset \mathbb R^p,$ $N=N(\delta,A,d).$ Then $$N \le \left( \frac{\sqrt p}{\delta} + 1 \right)^p.$$


The “change-of-metric” lemma2

The relationship between the entropies for the two different metrics follows the relationship between the metrics.

Let $d_1, d_2$ be metrics on a space $\mathcal X.$ Suppose there is a strictly increasing function $u:[0,\infty)\to[0,\infty)$ such that $u(0)=0.$ Then $$ d_1(x,y) \le u(d_2(x,y)),~ \forall x,y\in\mathcal X \implies N(\delta,A,d_1) \le N(u^{-1}(\delta),A,d_2). $$

Let $\varepsilon=u^{-1}(\delta).$ Given $C_2=\{x_1,\cdots,x_m\}$ be a $\varepsilon$-covering of $A$ with respect to $d_2.$ For any $x\in A,$ there exists $x_i\in C_2$ such that $d_2(x_i,x) \le \varepsilon.$ This implies $$ d_1(x_i, x) \le u(d_2(x_i, x)) \le u(\varepsilon) = \delta. $$ Hence $C_2$ is a $\delta$-covering of $A$ with respect to $d_1.$


(optional) Relationship between entropies with respect to different metrics

The following relationships between the metrics are direct from their definition.

Let $p, q$ be densities with respect to the measure $\mu.$

(1) $\|p-q\|_1 \le h(p,q) \cdot \sqrt{4-h^2(p,q)} \le 2h(p,q).$

(2) $h^2(p,q) \le \|p-q\|_1.$

(3) $\|p-q\|_1 \le 2K(p,q).$

(4) $h^2(p,q) \le \frac12 K(p,q).$

Applying the “change-of-metric” lemma with these facts gives us useful relationships.

Let $\mathcal F$ be a collection of densities.

(1) $N(\delta, \mathcal F, h) \le N(\delta^2, \mathcal F, \|\cdot\|_1).$

(2) $N(\delta, \mathcal F, \|\cdot\|_1) \le N(\delta/2,\mathcal F, h).$


Entropy for the supremum norm

The entropy for the sup-norm is defined by plugging in $p=\infty$ to the definition of the $L^p$-norm version. However, notice that the sup-norm version is independent to the measure. We can simply denote the sup-norm version of the entropy by

\[N_\infty(\delta, \mathcal G) \equiv N_{\infty,Q}(\delta,\mathcal G),\]

regardless of $Q.$


Entropy with bracketing for the $L^p$-norm

Another entropy of importance is the bracketing entropy, which is defined by bracket, an “interval” of functions.


(1) Let $\ell, u: \mathcal X \to \mathbb R$ satisfy $\ell \le u,$ $\|u-\ell\|_{p,Q} \le \delta.$ The set $[\ell, u]:=\{f:\mathcal X\to\mathbb R,~ \ell\le f\le u\}$ is a $\delta$-bracket for the $L^p(Q)$-norm.

(2) $d(\ell, u) = \sup_{s,t\in[\ell,u]}d(s,t)$ is the size of a bracket [\ell, u].

(3) $$N_{B,p,Q}(\delta, \mathcal G) := \inf \left\{ n\in\mathbb N \cup\{\infty\} :~ \{[\ell_i,u_i]\}_{i=1}^n \text{ are $\delta$-brackets such that } \mathcal G \subset \cup_{i=1}^n[\ell_i, u_i] \right\}$$ is the $\delta$-bracketing number of $\mathcal G$ for the $L^p(Q)$-norm.

(4) $$H_{B,p,Q}(\delta,\mathcal G) := \log N_{B,p,Q}(\delta,\mathcal G, Q)$$ is the $\delta$-bracketing entropy of $\mathcal G$ for the $L^p(Q)$-norm.

We will simply call it the $\delta$-bracketing entropy if $p$ and $Q$ are clear.

The lemma provides the relationship between bracketing entropy and the entropy.

(1) If $1\le p<\infty,$ then for all $\delta>0,$ $H_{p,Q}(\delta,\mathcal G) \le H_{B,p,Q}(\delta,\mathcal G).$

(1) If $1\le p<\infty$ and $Q$ is a probability measure, then for all $\delta>0,$ $H_{B,p,Q}(\delta,\mathcal G) \le H_{\infty}(\delta/2,\mathcal G).$


(optional) Entropy with packing for the $L^p$-norm

Still other entropy that uses the notion of packing exists.


(1) Let $\mathcal C := \{g_1, \cdots, g_m\} \subset \mathcal G.$ If $\|g_i-g_j\| > \delta$ for all $g_i,g_j\in\mathcal C,$ $i\ne j,$ then $\mathcal C$ is a $\delta$-packing for the $L^p(Q)$-norm.

(2) $$D_{p,Q}(\delta, \mathcal G) := \sup \left\{ \text{card}(\mathcal C):~ \mathcal C \text{ is a $\delta$-packing of }\mathcal G \right\}$$ is the $\delta$-packing number of $\mathcal G$ for the $L^p(Q)$-norm.

(3) $$\log D_{p,Q}(\delta,\mathcal G, Q)$$ is the $\delta$-packing entropy of $\mathcal G$ for the $L^p(Q)$-norm.

Packing entropy for general metrics can also be defined as the above. In fact, the packing entropy conveys the same information on complexity of sets.

$$ N(\delta,A,d) \le D(\delta,A,d) \le N(\delta/2,A,d). $$

(1) By definition, even the biggest $\delta$-packing of $A$ is a $\delta$-covering of $A.$ Therefore the first inequality follows.
(2) Now, let $\mathcal C = \{y_i\}_{i=1}^N$ be the biggest $\delta$-packing and $\mathcal D=\{x_i\}_{i=1}^M$ be the smallest $\delta/2$-covering. Supposer $\text{card}(\mathcal D) < \text{card}(\mathcal C).$ Since $A \subset \cup_{x_i \in \mathcal D} \overline B(x_i,\delta/2),$ $$ \exists y_i, y_j \in \mathcal C \text{ and } x_\ell \in \mathcal D \text{ such that } y_i, y_j \in \overline B(x_\ell, \delta/2). $$ Hence by the triangle inequality, $d(y_i, y_j) \le \delta,$ which is a contradiction.


Totally bounded classes

We say a class $\mathcal G$ is totally bounded with respect to $L^p(Q)$-norm if and only if for all $\delta>0,$ there exists a finite $\delta$-covering of $\mathcal G$ for $L^p(Q)$-norm. If the equipped metric is clear, we simply say that $\mathcal G$ is totally bounded. The definition can be simplified by using the notion of entropy:

\[\mathcal G \text{ is totally bounded.} \iff H_{p,Q}(\delta,\mathcal G) < \infty,~ \forall \delta>0.\]

Since the bracketing entropy bounds the entropy from above, if the $\delta$-bracketing entropy of $\mathcal G$ is finite for all $\delta>0,$ it implies that $\mathcal G$ is totally bounded.

By definition, totally bounded $\mathcal G$ is bounded. Hence we get for some $R \in\mathbb R$ that

\[\sup_{g\in\mathcal G} \|g\|_{p,Q} \le R < \infty.\]

Totally boundedness is important since it generalizes the concept of compactness. It also allows us to approximate such class by finite one, which makes possible to apply useful results from finite-dimensional classes into many of the infinite-dimensional ones.


References

  • van de Geer. 2000. Empirical Processes in M-estimation. Cambridge University Press.
  • Theory of Statistics II (Fall, 2020) @ Seoul National University, Republic of Korea (instructor: Prof. Jaeyong Lee).


  1. The definition in the textbook allows the center of a $\delta$-covering to not to be an element of $\mathcal G.$ In this sense, it is actually a external $\delta$-covering. If one defines the covering so that its center must be an element of $\mathcal G,$ we say it is an internal $\delta$-covering. Actually whether the covering is internal or external is not important since for all $p$ and $Q,$ the relationship holds: \(N^\text{ext}_{p,Q}(\delta,\mathcal G) \le N^\text{int}_{p,Q}(\delta,\mathcal G) \le N^\text{ext}_{p,Q}(\delta/2,\mathcal G).\) 

  2. It is not a standard name.