[Nonparametric] 4. Multivariate Kernel Density Estimation
Until now, I only covered univariate cases. From now on, our focus is on multivariate cases where samples $X_1,\cdots,X_n\in\mathbb{R}^d$ are independently drawn from the density $p:\mathbb{R}^d\to[0,\infty)$ with respect to the Lebesgue measure.
 Kernel and bandwidth settings
 Error analysis
 Optimal bandwidth
 Bandwith selection
 Curse of dimensionality
Kernel and bandwidth settings
Kernel function
We need a $d$variate kernel function $\kappa$ for this. There are several possible ways to define it.

Product kernel:
\[\kappa(x) := \prod_{j=1}^d K(x_j),\]where $K:\mathbb{R}\to\mathbb{R}$ is a univariate function with $\int K = 1.$

Radially symmetric kernel:
\[\kappa(x) := K(x^\intercal x),\]where $K:\mathbb{R}\to\mathbb{R}$ is a univariate function with $\int K(x^\intercal x)dx = 1.$
We will focus on the simple case of using product kernel.
Bandwidth
Now the bandwidth is also multivariate. Three choices of bandwidth are available.

Scalar bandwidth:
Use the same $h\in\mathbb{R}$ for all $d$ coordinates. i.e. \(\hat p(x) = \frac{1}{nh^d} \sum_{i=1}^n \kappa\left(\frac{xX_i}{h}\right).\)

Vector bandwidth (coordinatewise bandwidth):
If the density has different smoothness along different coordinates, then setting a scalar bandwidth might not provide good result. To account for the problem, vector bandwidth use $h_j\in\mathbb{R}$ for $j$th coordinates, $j=1,\cdots,d.$
\[\hat p(x) = \frac{1}{n\prod_{j=1}^d h_j} \sum_{i=1}^n \kappa\left(\frac{x_1X_{i1}}{h_1}, \cdots, \frac{x_dX_{id}}{h_d} \right).\] 
Bandwidth matrix:
In order to further account for possible rotation of the density contour, we can use a bandwidth matrix $H \in \mathbb{R}^{d\times d}$ so that our estimate be
\[\hat p(x) = \frac{1}{nH} \sum_{i=1}^n \kappa\left( H^{1} (xX_i) \right).\]
Throughout this post, we will only focus on multivariate KDE cases with product kernel function and coordinatewise bandwidth.
Error analysis
(A4): $u^3 K(u) \to 0$ as $u\to\infty,$ $\int u^2 K(u) du < \infty,$ $\int K = 1,$ $K$ is symmetric.
Then, $$ E\hat p(x)  p(x) = \frac{1}{2}\mu_2(K) \sum_{j=1}^d h_j^2 \frac{\partial^2}{\partial x_j^2}p(x) + o\left( \sum_{j=1}^d h_j \right). $$
(A3’) is a multivariate version of (A3) that was discussed earlier, and (A4) is the same as before.
$$ \begin{aligned} E\hat p(x) &= \kappa_h * p(x) \\ &= \int \frac{1}{h_1}K\left( \frac{x_1u_1}{h_1} \right) \cdots \frac{1}{h_1}K\left( \frac{x_du_d}{h_d} \right) p(u_1,\cdots,u_d) d\mathbf{u} \\ &= \int \prod_{j=1}^d K(w_j) p(x_1w_1h_1,\cdots,x_dw_dh_d)d\mathbf{w} \\ &= p(x) + \int \prod_{j=1}^d K(w_j) \sum_{j=1}^d\left[ \frac{\partial}{\partial x_j} p(x) \cdot (h_jw_j) \right] d\mathbf{w} + R\big(p(x)\big) + o\left( \sum_{j=1}^d h_j^2 \right) \\ &= p(x) + R\big(p(x)\big) + o\left( \sum_{j=1}^d h_j^2 \right) \end{aligned} \tag{1} $$ The second equality is by substitution $w_j=\frac{x_ju_j}{h_j},$ and the third follows from the first order Taylor expansion. Here, $R\big(p(x)\big)$ can be organized as follows, thus we get the desired result. $$ \begin{aligned} R\big(p(x)\big) &= \frac{1}{2} \sum_{j=1}^d \sum_{k=1}^d \left( \frac{\partial^2}{\partial x_j \partial x_k} p(x) h_jh_k\int w_jw_k\prod_{j=1}^d K(w_j) d\mathbf{w} \right) \\ &= \frac{1}{2} \sum_{j=1}^d \left( \frac{\partial^2}{\partial x_j^2 p(x)} h_j^2 \int w_j^2 K(w_j) dw_j \right) \\ &= \frac{1}{2} \mu_2(K) \sum_{j=1}^d h_j^2 \frac{\partial^2}{\partial x_j^2} p(x). \end{aligned} $$
(A5): $p$ is continuous at $x.$
(A6): $uK(u) \to 0$ as $u\to\infty,$ $\int K^2 <\infty.$
Then, $$ \text{Var}(\hat p(x)) = \frac{1}{n \prod_{j=1}^d h_j} \mu_0(K^2) p(x) + o\left(\frac{1}{n \prod_{j=1}^d h_j}\right). $$
Note that $\text{Var}(\hat p(x)) = \mathcal{O}\left( \frac{1}{n\prod_{j=1}^d h_j} \right).$ What it says, intuitively, is that the variance of multivariate KDE is reciprocal to the number of data inside the region around the point $x$ that the kernel function looks for (I would like to call it the receptive field^{1} of the kernel) when it estimates the density at $x.$
By similar arguments from univariate cases, we can derive the mean integrated squared error.
$K$: $\int u^2 K(u) du <\infty,$ $\int K = 1,$ $K$ is symmetric, $\int K^2 < \infty.$
Then, $$ \begin{aligned} \text{MISE}(\mathbf{h}) =~& \frac{1}{4} \mu_2(K)^2 \int \left( \sum_{j=1}^d h_j^2 \frac{\partial^2}{\partial x_j^2} p(x) \right)^2 dx \\ &+ \frac{1}{n \prod_{j=1}^d h_j} \mu_0(K^2) p(x) \\ &+ o\left( \sum_{j=1}^d h_j^2 + \frac{1}{n \prod_{j=1}^d h_j}\right). \end{aligned} $$
The optimal $h_\text{opt}$ is the one that makes the first and second terms of MISE to have the same order. Letting
\[n^{1} h_j^{d} \sim h_j^4,~ j=1,\cdots,d,\]yields
\[h_{j, \text{opt}} \sim n^{1/(d+4)}. \tag{2}\]Optimal bandwidth
Coordinatewise scaling
One method for finding optimal bandwidth is to use the result (2) and assume that
\[h_j = c_j \cdot n^{1/(d+4)}.\]That is, each coordinatewise bandwidth $h_j$ is a scalar multiple of another. As before, letting the first and second term of MISE to be the same, we get the system of $d$ equations:
\[\int \left( \sum_{k=1}^d c_k^2 \frac{\partial^2}{\partial x_k^2} p(x) \right) \left( c_j^2 \frac{\partial^2}{\partial x_j^2} p(x) \right) dx = \frac{\mu_0(K^2)}{\mu_2(K)} \left(\prod_{j=1}^d c_j\right)^{1},~ 1\le j\le d. \tag{3}\]Solve the equation and let $h_{j, \text{opt}} = c_{j, \text{opt}}\cdot n^{1/(d+4)}.$
Global scaling
However, the equation (3) cannot be explicitly solved. To get the idea of the constant $c,$ we fallback to the scalar bandwidth
\[h_\text{opt} = c_\text{opt} \cdot n^{1/(d+4)}\]where $c_\text{opt}$ is the solution of the equation (3) with $c=c_j$ for all $1\le j\le d.$ This time we can explicitly get the solution
\[c_\text{opt} = \left( \frac{d \cdot \mu_0(K^2)}{\mu_2(K)^2 \int \left( \sum_{j=1}^d \frac{\partial^2}{\partial x_j^2} p(x) \right)^2 dx} \right)^{1/(d+4)}.\]Bandwith selection
As in the univariate cases, we need to select $\hat h$ that best acts the optimal $h.$ Recall that the LSCV selector is defined as
\[\text{LSCV}(\mathbf{h}) = \int \hat p(x)^2 dx  \frac{2}{n} \sum_{i=1}^n \hat p_{i}(X_i).\]However, fulldimensional minimization of LSCV is intractable. We use coordinatewise approach to avoid the problem.
Bandwidth shrinkage
Bandwidth shrinkage scheme uses a minimizer of marginal $p_j(x_j)$ to infer $\hat h_j.$

Minimize marginal.
For each $j,$ let $\hat h_j^\text{mg} = \argmin_{h>0} \text{LSCV}_j(h_j),$ where $\text{LSCV}_j = \int \hat p_j(x_j)^2 dx_j  \frac{2}{n} \hat p_{i,j}(X_{ij}).$

Grid search a global scaler.
Let $\hat \alpha = \argmin_{\alpha \in \mathcal{G}} \text{LSCV}(\alpha \hat h_1^\text{mg}, \cdots, \alpha \hat h_d^\text{mg}),$ where $\mathcal{G} \subset \mathbb{R}$ is a finite grid.

Let $\hat h_j = \hat \alpha \cdot \hat h_j^\text{mg}$ be our bandwidth.
Coordinatewise bandwidth selection
Coordinatewise bandwidth selection (CBS) is simply a coordinate descent algorithm for LSCV. Instead of minimizing marginals, CBS minimizes conditional joints.

Initialize.
Choose $\left( \hat h_1^{(0)}, \cdots, \hat h_d^{(0)} \right)$ from a finite grid $\mathcal{G}_1 \times \cdots \times \mathcal{G}_d \subset \mathbb{R}^d.$

Coordinate descent until convergence.
\[\hat h_1^{(t+1)} = \argmin_{g_1 \in \mathcal{G}_1} \text{LSCV}\left( g_1, \hat h_2^{(t)}, \cdots, \hat h_d^{(t)} \right), \\ \hat h_2^{(t+1)} = \argmin_{g_2 \in \mathcal{G}_2} \text{LSCV}\left( \hat h_1^{(t+1)}, g_2, \cdots, \hat h_d^{(t)} \right), \\ \vdots \\ \hat h_d^{(t+1)} = \argmin_{g_d \in \mathcal{G}_d} \text{LSCV}\left( \hat h_1^{(t+1)}, \hat h_2^{(t+1)}, \cdots, g_d \right).\]
Curse of dimensionality
The optimal MISE under the assumption $h_j = c_j \cdot n^{1/(d+4)}$ is
\[\inf_{\mathbf{h}>0} \text{MISE}(\mathbf{h}) \sim n^{4/(d+4)}.\]As $d$ increases, the error deteriorate fast. This is because the receptive field $r_\kappa(x)$^{1} of $\kappa$ around $x$ is defined as
\[r_\kappa = \prod_{j=1}^d \left[ x_j  c_jh_j,~ x_j + c_jh_j \right],\]the volume of it is proportional to $\prod_{j=1}^d h_j,$ which goes to zero exponentially fast as $d \to \infty.$ Since using the the concept of receptive field as in univariate cases are useless in highdimensional setting, we shoud think of other methods.
Multiplicative model
Assume that the density is multiplicative.
\[\hat p(x) = \prod_{j=1}^d f_j(x_j),~ x \in S \subset \mathbb{R}^d.\]This can be viewed as assuming that $p$ is a product of marginals. If $S$ is rectangular, the problem becomes solving $d$ different univariate KDE problems.
Projection pursuit density estimation
PPDE estimates the density by projecting the data into lowerdimensional space, and feeding it to a multiplicativelike model. Friedman, Stuetzle and Schroeder (1984) described that “it attempts to overcome the curse of dimensionality by extending the classical univariate density estimation methods to higher dimensions in a manner that involves only univariate estimation.”
PPDE is defined as
\[\hat p_{d_1}(x) = \hat p_0(x) \prod_{j=1}^{d_1} f_j(\beta_j^\intercal x),\]where $\hat p_0$ is a known initial guess, $\beta_j \in \mathbb{R}^{d_1},$ $d_1 \le d$ and $f_j$ is a univariate function. Again quoting the authors, “PPDE is seen to approximate the multivariate density by an initially proposed density $\hat p_0,$ multiplied (augmented) by a product of univariate functions $f_m$ of linear combinations $\beta_j\cdot x$ of the coordinates.” $\hat p_0$ should be prespecified by the user, and PPDE’s goal is to choose the projections $\beta_j$’s and functions $f_j$’s. The model can be reorganized into recursive form:
\[\hat p_{d_1} = \hat p_{d_1  1} \cdot f_{d_1}(\beta_{d_1}^\intercal x).\]Thus PPDE can be seen as a process of adjusting the initial guess $\hat p_0$ with information from each coordinate $\beta_j^\intercal x$ using the augmenting function $f_j.$
References
 Nonparametric Function Estimation (Spring, 2021) @ Seoul National University, Republic of Korea (instructor: Prof. Byeong U. Park).
 Friedman, Stuetzle and Schroeder. 1984. Projection Pursuit Density Estimation. Journal of the American Statistical Association. 79: 387. pp599608.