2.2. Entropy
While computing complexity of classes of functions by its cardinality is natural, it reports complexity of every infinite dimensional classes to be the same. We need tools that can differentiate complexity more meticulously. Such tools are entropies of various kinds.
For this, we denote the $L^p(Q)$norm by
\[\g\_{p,Q}:=\left( \int g^p dQ \right)^{1/p}\]and denote the supremum norm by
\[\g\_\infty := \sup_{x \in \mathcal X} g(x).\]Notations here are a little different than that of van de Geer (2000). I indicated $p$ and $Q$ as subscripts while van de Geer (2000) prefered to indicate them as arguments of a function.
 Entropy for the $L^p$norm
 Entropy for the supremum norm
 Entropy with bracketing for the $L^p$norm
 (optional) Entropy with packing for the $L^p$norm
 Totally bounded classes
Entropy for the $L^p$norm
The first entropy is defined by the $\delta$covering of the function class $\mathcal G.$^{1}
(1) For $\delta>0,$ $\mathcal C = \{g_1,\cdots,g_N\} \subset \mathcal G$ is a $\delta$covering of $\mathcal G,$ if for all $g \in \mathcal G,$ there exists $g_i\in\mathcal G$ such that $\ g  g_i \_{p, Q} \le \delta.$ i.e. $\mathcal G \subset \cup_{i=1}^N \overline B(g_i, \delta) ,$ where $\overline B(g_i, \delta) := \{ f \in \mathcal G:~ \fg_i\_{p,Q} \le \delta \}.$
(2) $$N_{p,Q}(\delta, \mathcal G) := \inf\{\text{card}(\mathcal C):~ \mathcal C \text{ is a $\delta$covering of $\mathcal G$} \}$$ is the $\delta$covering number of $\mathcal G.$
(3) $$H_{p,Q}(\delta,\mathcal G) := \log N_{p,Q}(\delta,\mathcal G)$$ is the $\delta$entropy of $\mathcal G$ for the $L^p(Q)$norm.
For simplicity, we will simply denote “$\delta$entropy of $A$ for the $L^p$norm” by “$\delta$entropy of $A$”.
Entropy for general metric
We only defined the entropies for the class of Lebesgueintegrable functions. However, entropies for sets in general metric spaces can be defined similarly. For example, consider the real line with the standard metric $(\mathbb R, d).$ We can define the $\delta$entropy of the subset $A\subset\mathbb R$ for the metric $d$ by
\[N(\delta, A, d) := \log \left( \inf\{ \text{card}(I):~ I\subset\mathbb R,~ I \text{ is the $\delta$covering of $\mathbb R$}\} \right)\]where the covering is also defined accordingly.
In fact, there is a known entropy inequality for $\mathbb R^d.$ It would be helpful to mention it before we see the entropy inequality of the entropy of a function class.
The “changeofmetric” lemma^{2}
The relationship between the entropies for the two different metrics follows the relationship between the metrics.
Let $\varepsilon=u^{1}(\delta).$ Given $C_2=\{x_1,\cdots,x_m\}$ be a $\varepsilon$covering of $A$ with respect to $d_2.$ For any $x\in A,$ there exists $x_i\in C_2$ such that $d_2(x_i,x) \le \varepsilon.$ This implies $$ d_1(x_i, x) \le u(d_2(x_i, x)) \le u(\varepsilon) = \delta. $$ Hence $C_2$ is a $\delta$covering of $A$ with respect to $d_1.$
(optional) Relationship between entropies with respect to different metrics
The following relationships between the metrics are direct from their definition.
(1) $\pq\_1 \le h(p,q) \cdot \sqrt{4h^2(p,q)} \le 2h(p,q).$
(2) $h^2(p,q) \le \pq\_1.$
(3) $\pq\_1 \le 2K(p,q).$
(4) $h^2(p,q) \le \frac12 K(p,q).$
Applying the “changeofmetric” lemma with these facts gives us useful relationships.
(1) $N(\delta, \mathcal F, h) \le N(\delta^2, \mathcal F, \\cdot\_1).$
(2) $N(\delta, \mathcal F, \\cdot\_1) \le N(\delta/2,\mathcal F, h).$
Entropy for the supremum norm
The entropy for the supnorm is defined by plugging in $p=\infty$ to the definition of the $L^p$norm version. However, notice that the supnorm version is independent to the measure. We can simply denote the supnorm version of the entropy by
\[N_\infty(\delta, \mathcal G) \equiv N_{\infty,Q}(\delta,\mathcal G),\]regardless of $Q.$
Entropy with bracketing for the $L^p$norm
Another entropy of importance is the bracketing entropy, which is defined by bracket, an “interval” of functions.
(1) Let $\ell, u: \mathcal X \to \mathbb R$ satisfy $\ell \le u,$ $\u\ell\_{p,Q} \le \delta.$ The set $[\ell, u]:=\{f:\mathcal X\to\mathbb R,~ \ell\le f\le u\}$ is a $\delta$bracket for the $L^p(Q)$norm.
(2) $d(\ell, u) = \sup_{s,t\in[\ell,u]}d(s,t)$ is the size of a bracket [\ell, u].
(3) $$N_{B,p,Q}(\delta, \mathcal G) := \inf \left\{ n\in\mathbb N \cup\{\infty\} :~ \{[\ell_i,u_i]\}_{i=1}^n \text{ are $\delta$brackets such that } \mathcal G \subset \cup_{i=1}^n[\ell_i, u_i] \right\}$$ is the $\delta$bracketing number of $\mathcal G$ for the $L^p(Q)$norm.
(4) $$H_{B,p,Q}(\delta,\mathcal G) := \log N_{B,p,Q}(\delta,\mathcal G, Q)$$ is the $\delta$bracketing entropy of $\mathcal G$ for the $L^p(Q)$norm.
We will simply call it the $\delta$bracketing entropy if $p$ and $Q$ are clear.
The lemma provides the relationship between bracketing entropy and the entropy.
(1) If $1\le p<\infty,$ then for all $\delta>0,$ $H_{p,Q}(\delta,\mathcal G) \le H_{B,p,Q}(\delta,\mathcal G).$
(1) If $1\le p<\infty$ and $Q$ is a probability measure, then for all $\delta>0,$ $H_{B,p,Q}(\delta,\mathcal G) \le H_{\infty}(\delta/2,\mathcal G).$
(optional) Entropy with packing for the $L^p$norm
Still other entropy that uses the notion of packing exists.
(1) Let $\mathcal C := \{g_1, \cdots, g_m\} \subset \mathcal G.$ If $\g_ig_j\ > \delta$ for all $g_i,g_j\in\mathcal C,$ $i\ne j,$ then $\mathcal C$ is a $\delta$packing for the $L^p(Q)$norm.
(2) $$D_{p,Q}(\delta, \mathcal G) := \sup \left\{ \text{card}(\mathcal C):~ \mathcal C \text{ is a $\delta$packing of }\mathcal G \right\}$$ is the $\delta$packing number of $\mathcal G$ for the $L^p(Q)$norm.
(3) $$\log D_{p,Q}(\delta,\mathcal G, Q)$$ is the $\delta$packing entropy of $\mathcal G$ for the $L^p(Q)$norm.
Packing entropy for general metrics can also be defined as the above. In fact, the packing entropy conveys the same information on complexity of sets.
(1) By definition, even the biggest $\delta$packing of $A$ is a $\delta$covering of $A.$ Therefore the first inequality follows.
(2) Now, let $\mathcal C = \{y_i\}_{i=1}^N$ be the biggest $\delta$packing and $\mathcal D=\{x_i\}_{i=1}^M$ be the smallest $\delta/2$covering. Supposer $\text{card}(\mathcal D) < \text{card}(\mathcal C).$ Since $A \subset \cup_{x_i \in \mathcal D} \overline B(x_i,\delta/2),$ $$ \exists y_i, y_j \in \mathcal C \text{ and } x_\ell \in \mathcal D \text{ such that } y_i, y_j \in \overline B(x_\ell, \delta/2). $$ Hence by the triangle inequality, $d(y_i, y_j) \le \delta,$ which is a contradiction.
Totally bounded classes
We say a class $\mathcal G$ is totally bounded with respect to $L^p(Q)$norm if and only if for all $\delta>0,$ there exists a finite $\delta$covering of $\mathcal G$ for $L^p(Q)$norm. If the equipped metric is clear, we simply say that $\mathcal G$ is totally bounded. The definition can be simplified by using the notion of entropy:
\[\mathcal G \text{ is totally bounded.} \iff H_{p,Q}(\delta,\mathcal G) < \infty,~ \forall \delta>0.\]Since the bracketing entropy bounds the entropy from above, if the $\delta$bracketing entropy of $\mathcal G$ is finite for all $\delta>0,$ it implies that $\mathcal G$ is totally bounded.
By definition, totally bounded $\mathcal G$ is bounded. Hence we get for some $R \in\mathbb R$ that
\[\sup_{g\in\mathcal G} \g\_{p,Q} \le R < \infty.\]Totally boundedness is important since it generalizes the concept of compactness. It also allows us to approximate such class by finite one, which makes possible to apply useful results from finitedimensional classes into many of the infinitedimensional ones.
References
 van de Geer. 2000. Empirical Processes in Mestimation. Cambridge University Press.
 Theory of Statistics II (Fall, 2020) @ Seoul National University, Republic of Korea (instructor: Prof. Jaeyong Lee).

The definition in the textbook allows the center of a $\delta$covering to not to be an element of $\mathcal G.$ In this sense, it is actually a external $\delta$covering. If one defines the covering so that its center must be an element of $\mathcal G,$ we say it is an internal $\delta$covering. Actually whether the covering is internal or external is not important since for all $p$ and $Q,$ the relationship holds: \(N^\text{ext}_{p,Q}(\delta,\mathcal G) \le N^\text{int}_{p,Q}(\delta,\mathcal G) \le N^\text{ext}_{p,Q}(\delta/2,\mathcal G).\) ↩

It is not a standard name. ↩