3.4.2. Uniform Law of Vapnik-Chervonenkis Subgraph Classes
In the previous article, results from lemma 3.1 were covered. Here, the uniform law for Vapnik-Chervonenkis subgraph class, a useful class of functions that has a close relationship with classification problem, will be derived from theorem 3.7.
Vapnik-Chervonenkis class
Similar to various kinds of entropies, the Vapnik-Chervonenkis index (VC index, for short) is frequently used to measure complexity of sets of interest.
Let $\mathcal D$ be a collection of subsets of $\mathcal X.$ Let $\xi_1,\cdots,\xi_n \in \mathcal X$ be fixed points on $\mathcal X.$
(1) $$\Delta^\mathcal{D}(\xi_1,\cdots,\xi_n) := \text{Card}\left(\left\{D \cap \{\xi_1,\cdots,\xi_n\}:~ D\in\mathcal D\right\}\right)$$ is the number of subsets of $\{\xi_1,\cdots,\xi_n\}$ that can be selected using $\mathcal D.$
(2) $$m^\mathcal{D}(n) := \sup_{x_1,\cdots,x_n\in\mathcal X}\left\{ \Delta^{\mathcal D}(x_1,\cdots,x_n) \right\}$$ is the supremum of $\Delta^{\mathcal D}$ that can be achieved by all possible sets in $\mathcal X$ of cardinality $n.$ If $\mathcal D$ can select all subsets of $\{\xi_1,\cdots,\xi_n\},$ then we say $\mathcal D$ shatters $\{\xi_1,\cdots,\xi_n\}.$
(3) $$V(\mathcal D) := \inf\{n:~ m^\mathcal{D} < 2^n\},$$ the minimal size of a set that is not shattered by $\mathcal D$, is the Vapnik-Chervonenkis index of $\mathcal D.$
(4) $\mathcal D$ is a Vapnik-Chervonenkis class, if $V(\mathcal D) < \infty.$
For intuition, consider a “classification” (or should I say “separation” since there are only covariates) task. Suppose that $\mathcal X$ is a space of training data for classification problem and $\mathcal D$ is a collection of subsets of $\mathcal X$ that are consist of values of the same target class. Assume in addition that $\{\xi_1,\cdots,\xi_n\}$ is an unseen data. Then the VC index of $\mathcal D$ indicates plausibility or complexity of the task:
- If $V(\mathcal D) = \infty,$ the problem cannot be solved completely using the training data.
- If $V(\mathcal D)=k<\infty,$ the problem can be solved and it requires $k$-times to completely separate the unseen data in the worst case scenario.
Since the complete shattering of a set of cardinality $n$ requires $2^n$-times of separation, if $m^\mathcal{D}(n)$ grows polynomially in $n,$ (i.e. grows slower than exponential rate) $\mathcal D$ is a VC class.
For examples, one can show that the following classes are all VC classes.
- $\mathcal D_1 = \{(-\infty, t]:~ t\in\mathbb R\} \subset \mathbb R.$ $\implies V(\mathcal D_1)=n+1.$
- $\mathcal D_2 = \{(-\infty, t]:~ t\in\mathbb R^d\} \subset \mathbb R^d.$ $\implies V(\mathcal D_2)=(n+1)^d.$
- $\mathcal D_3 = \left\{ \{ x:~ \theta^\intercal x > y \}:~ \theta \in \mathbb R^d,~ y\in\mathbb R \right\} \subset \mathbb R^d.$ $\implies V(\mathcal D_3)=d+2.$
Vapnik-Chervonenkis subgraph class
VC index is for a collection of subsets. To fit VC theory into the theory of empirical process, we need to characterize functions by sets. Subgraph can be a choice.
(1) $$\text{sub.}f := \left\{ (x,y) \in {\mathcal X}\times{\mathbb R}:~ f(x) > y \right\}$$ is the subgraph of a function $f:\mathcal\to\mathbb R.$
(2) $\mathcal G$ is a Vapnik-Chervonenkis subgraph class if $V(\mathcal G) < \infty,$ where $$V(\mathcal G) := V\left( \{\text{sub.}g:~ g\in\mathcal G\} \right)$$ is the VC index of $\mathcal G$
As an example, consider again a class of functions with finite basis:
\[\mathcal G = \{\theta_1\psi_1+\cdots+\theta_d\psi_d:~ \boldsymbol\theta\in\mathbb R^d\}, \\ \psi_1,\cdots,\psi_d \text{ are fixed functions.} \tag{ex. 3.7.4d}\]Such $\mathcal G$ is a VC subgraph class with $V(\mathcal G) \le d+2$ (follows directly from the third example of VC class).
ULLN of VC subgraph class
It is natural to relate VC theory with entropy. One can easily speculate that a VC subgraph class has a well-controlled entropy. The next lemma is a confirmation to this.
Here, $\text{p.m.}$ stands for “probability measure”. Notice that the constant $C$ exists uniformly over every probability measures $Q$’s. By the vanishing random entropy condition, this, together with the envelope condition, implies the ULLN.
(a) $\mathcal G$ is a VC subgraph class.
(b) $G=\sup_{g\in\mathcal G}|g| \in L^1(P).$
then $\mathcal G$ follows the ULLN.$$ \frac1n H_{1,P_n}(\delta\cdot\|G\|_{1,P_n}) \le \frac1n \log\left( C\frac{V(\mathcal G)\cdot(16e)^{V(\mathcal G)}} {\delta^{p(V(\mathcal G)-1)}} \right) \to 0. $$
As an example, consider again the example 3.7.4d. Because the envelope is not integrable, for a probability measure $Q$ and $R>0,$ define
\[\mathcal G_{R} = \{g\in\mathcal G:~ \|g\|_{2,Q} \le R\}.\]Suppose that the basis is in fact orthogonal. i.e. $\Sigma_Q = \int \Psi\Psi^\intercal dQ = \mathbf I_d$ where $\mathbf I_d \in \mathbb R^{d\times d}$ is an identity. Then we can derive two entropy bounds:
- \[H_{2,Q}(\delta, \mathcal G_R) \le A_1d\log\left( \frac R\delta \right)\]
-
(from corollary 3.12)
\[H_{2,Q}(\delta, \mathcal G_R) \le A_2d\log\left( \frac {dR}\delta \right)\]
The former is better since the logarithm is independent to dimension.
By the Cauchy-Schwarz inequality, $$ G(x) = \sup_{g\in\mathcal G}|g(x)| = \|\Psi(x)\|_2 \cdot R. $$ Hence $$ \|G\|_{2,Q} = \sqrt{\text{tr}(\mathbf I_d)} R = \sqrt d R. $$ Now apply theorem 3.11.
References
- van de Geer. 2000. Empirical Processes in M-estimation. Cambridge University Press.
- Theory of Statistics II (Fall, 2020) @ Seoul National University, Republic of Korea (instructor: Prof. Jaeyong Lee).