3. Bandwidth Selection

$\newcommand{\argmin}{\mathop{\mathrm{argmin}}\limits}$ $\newcommand{\argmax}{\mathop{\mathrm{argmax}}\limits}$

Recall that the optimal bandwidth $h_\text{opt}$ derived from MISE was given as the following:

\[h_\text{opt} = \left( \frac{\int K^2}{\mu_2(K) \int p''(x)^2 dx} \right)^{1/5} n^{-1/5}.\]

However since $p$ is unknown, we cannot use this directly. Rather, we need an appropriate estimate of $\int p''(x) dx$ to get a usable bandwidth.


Normal reference

If we assume that $p=\phi_\sigma,$ that is the true density $p$ is actually a density of normal distribution with mean zero and variance $\sigma^2,$ then using the properties of gaussian derivatives

\[\phi'(x) = -x\phi(x), \\ \phi''(x) = (x^2-1)\phi(x),\]

we get

\[\begin{aligned} \int (p'')^2 &= \int \left( \frac{1}{\sigma^3}\phi''(x/\sigma) \right)^2 dx \\ &= \frac{1}{\sigma^5} \int \phi''(y)^2 dy \\ &= \frac{1}{\sigma^5} \frac{1}{\sqrt{4\pi}} \int (y^4 - 2y^2 + 1) \phi_\frac{1}{\sqrt{2}}(y) dy \\ &= \frac{1}{\sigma^5} \frac{1}{2\sqrt{\pi}} \int (z^4/4 - z^2 + 1) \phi(z) dz \\ &= \frac{1}{\sigma^5} \frac{1}{2\sqrt{\pi}} \left( \frac{1}{4} K \cancel{- 1} \cancel{+ 1} \right) \\ &= \frac{1}{\sigma^5} \frac{1}{2\sqrt{\pi}} \frac{3}{4} \\ &= \frac{3}{8\sqrt{\pi}}\frac{1}{\sigma^5}. \end{aligned}\]

By plugging this in, we get our normal reference bandwidth

\[\hat h_\text{NS} = \left( \frac{8\sqrt{\pi}}{3} \frac{\int K^2}{\mu_2(K)^2} \right)^{1/5}n^{-1/5} \cdot\hat\sigma^5,\]

where $\hat\sigma$ can be estimated in two different ways: (1) use sample standard deviation, or (2) use interquartile range (IQR). Motivation behind IQR-based $\hat\sigma$ estimation is that

\[\begin{aligned} \hat F^{-1}(3/4) - \hat F^{-1}(1/4) &\approx \hat \Phi_\sigma^{-1}(3/4) - \hat \Phi_\sigma^{-1}(1/4) \\ &= \sigma \left(\hat \Phi^{-1}(3/4) - \hat \Phi^{-1}(1/4)\right) \end{aligned}\]

so we estimate $\sigma$ with

\[\hat\sigma = \frac{\hat F^{-1}(3/4) - \hat F^{-1}(1/4)}{\hat \Phi^{-1}(3/4) - \hat \Phi^{-1}(1/4)}.\]


Oversmoothing

It is known that

\[\sup_{p \in \mathcal{F}_\sigma} \frac{\int K^2}{\mu_2(K)^2 \int (p'')^2} = \frac{243}{35} \frac{\int K^2}{\mu_2(K)^2} \sigma,\]

where $\mathcal{F}_\sigma$ is the collection of densities with variance $\sigma^2.$ ($\int u^2p(u)du - (\int up(u)du)^2 = \sigma^2.$) We can simply use this upper bound as our bandwidth, with $\hat\sigma$ estimated as the above.

\[\hat h_{OS} = \frac{243}{35} \frac{\int K^2}{\mu_2(K)^2} \hat\sigma.\]

We call this the oversmoothed bandwidth selector, since it will certainly be larger than $h_\text{opt}$ if the variance assumption is true, and the larger the bandwidth, the smoother the density estimate we get.


Least squares cross-validation

Recall that $h_\text{opt}$ minimizes AMISE, and hopefully MISE.

\[\begin{aligned} \text{MISE}(\hat p) &= E\int \left( \hat p(x) - p(x) \right)^2 dx \\ &= E\int \hat p(x)^2 dx - 2E\int \hat p(x)p(x)dx + E\int p(x)^2dx. \end{aligned}\]

Observe that the last term is independent to $h$ so we only focus on the first two terms.

\[\begin{aligned} \text{MISE}(\hat p) - E\int p(x)^2dx &= E\int \hat p(x)^2 dx - 2E\int \hat p(x)p(x)dx \\ &= E\left[ \int \hat p(x)^2 dx - \frac{2}{n} \sum_{i=1}^n \hat p_{-i}(X_i) \right], \end{aligned}\]

where

\[\hat p_{-i}(X_i) := \frac{1}{n-1} \sum_{j\ne i} K_h(x-X_j).\]

$$ \begin{aligned} E\hat p_{-i}(X_i) &= E\left[ E(\hat p_{-i}(X_i)|X_i) \right] \\ &= E\left[ \frac{1}{\cancel{n-1}}\cancel{(n-1)} \int K_h(X_i-X_j)p(X_i)dX_i \right] \\ &= E\left[ K_h * p(X_i) \right] \\ &= E\left[ E(\hat p(X_i) | X_i) \right] \\ &= E\int \hat p(x)p(x) dx. \end{aligned} $$

The least squares cross-validation selector (LSCV) chooses $\hat h_\text{LSCV}$ by obtaining a minimizer of

\[\begin{aligned} \text{LSCV(h)} ~=& \int \hat p(x)^2 dx - \frac{2}{n} \sum_{i=1}^n \hat p_{-i}(X_i) \\ =& \frac{1}{n^2} \sum_i\sum_j K_h * K_h(X_i-X_j) \\ & - \frac{2}{n(n-1)} \sum_{i} \sum_{j\ne i} K_h(X_i - X_j). \end{aligned}\]

which is the integrand of MISE. The second equality holds if $K$ is symmetric.


$$ \begin{aligned} \int \hat p^2 &= \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \int K_h(x-X_i)K_h(x-X_j) dx \\ &= \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \int K_h(w + X_j - X_i) K_h(w) dw \\ &= \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \int K_h(X_i - X_j - w) K_h(w) dw \\ &= \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n K_h * K_h (X_i-X_j). \end{aligned} $$ In the second equality, $x-X_j$ was substituted to $w,$ and in the third equality, symmetry of $K$ was used.

We define LSCV bandwidth selector as

\[\hat h_\text{LSCV} = \argmin_{h>0} \text{LSCV}(h).\]


$k$-fold LSCV

Although the LSCV gives better bandwidth than normal referencing or oversmoothing, it is computationally heavy since it requires $n^2$ order of operations. To relieve the computational stress, $k$-fold LSCV was devised.

Instead of using “leave-one-out” density estimate $\hat p_{-i},$ $k$-fold LSCV uses “leave-many-out” estimate $\hat p_{A_i}.$ Formal definition is as follows: let $A_1,\cdots,A_k$ be a partition of the index set $\{1,2,\cdots,n\},$ and define

\[\hat p_{-A_i} := \left( \sum_{i=1}^n \mathbf{1}(i \notin A_i) \right)^{-1} \sum_{i \in A_i} K_h(x-X_i).\]

$k$-fold LSCV is defined accordingly.

\[\text{LSCV}^\text{$k$-fold}(h) = \int \hat p(x)^2 dx - \frac{2}{k} \sum_{l=1}^k \left( \left(\sum_{i=1}^n \mathbf{1}(i \in A_l)\right )^{-1} \sum_{i \in A_l} \hat p_{-A_l}(X_i) \right).\]

Notice that now linear order ($\sim N$) of computation is enough since $k$ is prespecified (mostly, $k=5$ for $10$). $E \left[ \text{LSCV}^\text{$k$-fold}(h) \right]$ is the same as in standard LSCV. Finally, $k$-fold LSCV selector is

\[\hat h_\text{LSCV}^\text{$k$-fold} = \argmin_{h>0} \text{LSCV}^\text{$k$-fold}(h).\]


Other selectors

Biased cross-validation

Biased cross-validation (BCV) is a hybrid of normal reference-type plug in selector and LSCV-type cross validation selector. Recall that the bias expansion gives

\[\text{AMISE} = \frac{1}{4}h^4 \mu_2(K)^2 \int (p'')^2 + \frac{1}{nh} \int K^2\]

The idea of BCV is to replace $\int (p’’)^2$ by

\[\int (\hat p'')^2 = \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n K_h'' * K_h'' (X_i-X_j)\]

to get

\[\text{BCV}(h) = \frac{1}{4}h^4 \mu_2(K)^2 \int (\hat p'')^2 + \frac{1}{nh} \int K^2.\]

(The proof is the same as in LSCV.) $\hat h_\text{BCV}$ is defined as a minimizer of BCV.


Selectors that converges faster

One can show that LSCV and BCV selector both converge unbearably slowly to the true $h_\text{opt}$:

\[\left( \frac{\hat h_\text{LSCV}}{h_\text{opt}} - 1 \right) n^{1/10} \overset{\text{d}}{\to} \mathcal{N}(0, \sigma^2_\text{LSCV}).\]

That is the order of $n^{1/10}$: to make $n^{1/10}$ grow from 0 to 2, a thousand of samples is required! As remedies of this problem, solve-the-equation rule ($\sim n^{4/13},$ later improved to $\sim n^{5/14}$) and smoothed CV ($\sim n^{1/2}$) selectors are proposed.


References

  • Nonparametric Function Estimation (Spring, 2021) @ Seoul National University, Republic of Korea (instructor: Prof. Byeong U. Park).
  • Nonparametric Statistics, by Eduardo García Portugués.