### A metric derived from KL divergence

information theory

For probability densities $p_0$ and $p_1$, KL divergence of $p_1$ from $p_0$ is defined as $K(p_0, p_1) = -E_0(\log \frac{p_1(X)}{p_0(X)})$, where $E_0$ is an expectation with respect to $p_0$.

KL divergence is regarded as “a distance” between the two probability distributions. However, it is not a metric in mathematical sense, since $K(p_0, p_1) \neq K(p_1, p_0)$ in general.

### Determination of random variables and random number generation

probability

$F$ is a distribution of a random variable $X$, iff $F$ is (1) non decreasing, (2) right-continuous functions s.t. (3) $\lim\limits_{x\to -\infty}F(x) = 0$ and $\lim\limits_{x\to\infty}F(x) = 1$.

($\Rightarrow$) is trivial.
($\Leftarrow$) let $\Omega = [0, 1]$, $\mathcal{F} = \mathcal{B}\big( [0, 1] \big)$, $P=\lambda$ where $\lambda$ is a lebesgue measure. Define $X(\omega) := \sup\{y: F(y) < \omega\}$, then our claim is that X is a random variable that has $F$ as its distribution. To show the claim is true, we need to show $$P(X\leq x) = F(x) = P(\{\omega: 0\leq \omega \leq F(x)\})$$ which can be inferred by $\{\omega: X(\omega) \leq x\} = \{\omega: \omega \leq F(x)\}$, $\forall x$.

Given $x$, pick $\omega_0 \in \{\omega: \omega \leq F(x)\}$, since $F(x) \geq \omega_0$, $x \notin \{y: F(y) < \omega_0\}$. Therefore $X(\omega_0) \leq x$ and $\omega_0 \in \{\omega: X(\omega) \leq x\}$. $$$$\therefore \: \{\omega: X(\omega) \leq x\} \supset \{\omega: \omega \leq F(x)\}, \forall x$$$$ Given $x$, pick $\omega_0 \notin \{\omega: \omega \leq F(x)\}$, then $\omega_0 > F(x)$. Since $F$ is right-continuous, $\exists\epsilon > 0$ such that $F(x) \leq F(x+\epsilon) < \omega_0$. $x+\epsilon \leq X(\omega_0)$ because $X$ is defined as $\sup$, and this gives $x < X(\omega_0)$ and thus $\omega_0 \notin \{\omega: X(\omega) \leq x\}$. $$\therefore \: \{\omega: X(\omega) \leq x\} \subset \{\omega: \omega \leq F(x)\}, \forall x$$ Hence the claim is true, and \begin{align} F(x) &= \lambda\big( [0, F(x)] \big) = P(\{\omega: \omega \leq F(x)\})\\ &= P(\{\omega: X(\omega) \leq x\}) = P(X \leq x) \end{align}

### Borel-Cantelli lemmas are converses of each other

probability

(1) Let $A_n$ be a sequence of events. If $\sum\limits_{k=1}^\infty P(A_k) < \infty$, then $P(A_n \:\: i.o) = 0$.
(2) If $A_n$'s are independent, $\sum\limits_{k=1}^\infty P(A_k) = \infty$, then $P(A_n \:\: i.o) = 1$
(1) Let $N := \sum\limits_{k=1}^\infty \mathbf{1}_{A_k}$. The given condition implies $E(N)<\infty$, and thus $P(N<\infty)=1$. Hence at most finitely many $\mathbf{1}_{A_k}$'s should be $1$, and this is equivalent to $P(A_n \:\: i.o) = 0$
(2) \begin{align*} P(\bigcap\limits_{k \geq m}{A_k}^c) &= \prod\limits_{k \geq m}\big( 1-P({A_k}) \big) \\ &\leq \prod\limits_{k \geq m}e^{-P(A_k)} = e^{-\sum\limits_{k \geq m} P(A_k)} = 0, \:\: \forall m>0 \end{align*} $\therefore P(\bigcup\limits_{k \geq m}{A_k}) = 1$ and $P(\limsup\limits_n{A_n}) = P(A_n \:\: i.o.) = 1$.
If $X_1, X_2, \cdots$ are independent, $A \in \mathcal{T}$, where $\mathcal{T} := \bigcap\limits_{k\geq1} \sigma(X_k, X_{k+1}, \cdots)$, then $P(A)=0$ or $1$.
Let $A \in \sigma(X_1, \cdots, X_k)$ and $B \in \sigma(X_{k+1}, \cdots, X_{k+j})$ for some $j$. Because $X_i$'s are independent, $A \perp B$. Thus $\sigma(X_1, \cdots, X_k) \perp \bigcup_j \sigma(X_{k+1}, \cdots, X_{k+j})$. Since both are $\pi$-systems, by Dynkin's $\pi$-$\lambda$ theorem, $$\sigma(X_1, \cdots, X_k) \perp \sigma(X_{k+1}, \cdots)$$ Now, let $A \in \sigma(X_1, \cdots, X_j)$ for some $j$ and $B \in \mathcal{T} \subset \sigma(X_{j+1}, \cdots)$. By (1) and similar process, $$\sigma(X_1, \cdots) \perp \mathcal{T}$$ By (2), since $\mathcal{T} \subset \sigma(X_1, \cdots)$, $\mathcal{T}$ is independent of itself.
$\therefore A \in \mathcal{T} \to P(A) = P(A \cap A) = P(A)P(A)$.

Borel-Cantelli lemmas are widely used to prove almost sure convergence/existence of limit points of random variables. e.g. By showing that $P(|X_n - X|>\epsilon)$ is summable for any given $\epsilon > 0$, one can easily check almost sure convergence from convergence in probability.

### Limitation of $R^2$

linear model

For a linear regression $y_i = \beta_0 + \sum\limits_{j=1}^{p} \beta_j x_{ij}$, $1 \leq i \leq n$, suppose $x_{ij}$'s does not have any relationship with $y_i$'s. i.e. true model is $y_i = \beta_0 = \bar{y}$. Under this assumption, \begin{align*} \frac{\mathrm{SSE}}{\sigma^2} &\sim \chi^2(n-p-1) \overset{D}{=} \Gamma(\frac{n-p-1}{2}, 2)\\ \frac{\mathrm{SSR}}{\sigma^2} &\sim \chi^2(p) \overset{D}{=} \Gamma(\frac{p}{2}, 2)\\ \mathrm{SSE} &\perp \mathrm{SSR} \end{align*} Thus, $R^2 = \frac{\mathrm{SSR}}{\mathrm{SSR} + \mathrm{SSE}} \sim \mathcal{B}(\frac{p}{2}, \frac{n-p-1}{2})$, and $E(R^2) = \frac{p}{n-1}$.

Hence expectation of $R^2$ increases as the dimension of predictors increases, regardless of fit of the model.

### Irregularity of almost sure convergence

probability

Let $y_n$ be a sequence on topological space. If $\forall$ subsequence $y_{n_k}$, $\exists$ a further subsequence $y_{n(m_k)}$ s.t. $y_{n(m_k)} \to y$, then $y_n \to y$.
A sequence of random variables $X_n \to X$ in probability $\iff$ $\forall$ subsequence $X_{n_k}$, $\exists$ a further subsequence $X_{n(m_k)}$ s.t. $X_{n(m_k)} \to X$ a.s.

Theorem 1 and 2 combined implies that almost sure convergence does not come from topology. In fact, while convergence in probability forms convergence class, a.s. convergence does not. This shows that a.s. convergense is actually not a “convergence” concept that we generally think of.