### A metric derived from KL divergence

information theory

For probability densities $p_0$ and $p_1$, KL divergence of $p_1$ from $p_0$ is defined as $K(p_0, p_1) = -E_0(\log \frac{p_1(X)}{p_0(X)})$, where $E_0$ is an expectation with respect to $p_0$.

KL divergence is regarded as “a distance” between the two probability distributions. However, it is not a metric in mathematical sense, since $K(p_0, p_1) \neq K(p_1, p_0)$ in general.

### Determination of random variables and random number generation

probability

$F$ is a distribution of a random variable $X$, iff $F$ is (1) non decreasing, (2) right-continuous functions s.t. (3) $\lim\limits_{x\to -\infty}F(x) = 0$ and $\lim\limits_{x\to\infty}F(x) = 1$.

($\Rightarrow$) is trivial.
($\Leftarrow$) let $\Omega = [0, 1]$, $\mathcal{F} = \mathcal{B}\big( [0, 1] \big)$, $P=\lambda$ where $\lambda$ is a lebesgue measure. Define $X(\omega) := \sup\{y: F(y) < \omega\}$, then our claim is that X is a random variable that has $F$ as its distribution. To show the claim is true, we need to show $$P(X\leq x) = F(x) = P(\{\omega: 0\leq \omega \leq F(x)\})$$ which can be inferred by $\{\omega: X(\omega) \leq x\} = \{\omega: \omega \leq F(x)\}$, $\forall x$.

Given $x$, pick $\omega_0 \in \{\omega: \omega \leq F(x)\}$, since $F(x) \geq \omega_0$, $x \notin \{y: F(y) < \omega_0\}$. Therefore $X(\omega_0) \leq x$ and $\omega_0 \in \{\omega: X(\omega) \leq x\}$. $$$$\therefore \: \{\omega: X(\omega) \leq x\} \supset \{\omega: \omega \leq F(x)\}, \forall x$$$$ Given $x$, pick $\omega_0 \notin \{\omega: \omega \leq F(x)\}$, then $\omega_0 > F(x)$. Since $F$ is right-continuous, $\exists\epsilon > 0$ such that $F(x) \leq F(x+\epsilon) < \omega_0$. $x+\epsilon \leq X(\omega_0)$ because $X$ is defined as $\sup$, and this gives $x < X(\omega_0)$ and thus $\omega_0 \notin \{\omega: X(\omega) \leq x\}$. $$\therefore \: \{\omega: X(\omega) \leq x\} \subset \{\omega: \omega \leq F(x)\}, \forall x$$ Hence the claim is true, and \begin{align} F(x) &= \lambda\big( [0, F(x)] \big) = P(\{\omega: \omega \leq F(x)\})\\ &= P(\{\omega: X(\omega) \leq x\}) = P(X \leq x) \end{align}

### Borel-Cantelli lemmas are converses of each other

probability

(1) Let $A_n$ be a sequence of events. If $\sum\limits_{k=1}^\infty P(A_k) < \infty$, then $P(A_n \:\: i.o) = 0$.
(2) If $A_n$'s are independent, $\sum\limits_{k=1}^\infty P(A_k) = \infty$, then $P(A_n \:\: i.o) = 1$
(1) Let $N := \sum\limits_{k=1}^\infty \mathbf{1}_{A_k}$. The given condition implies $E(N)<\infty$, and thus $P(N<\infty)=1$. Hence at most finitely many $\mathbf{1}_{A_k}$'s should be $1$, and this is equivalent to $P(A_n \:\: i.o) = 0$
(2) \begin{align*} P(\bigcap\limits_{k \geq m}{A_k}^c) &= \prod\limits_{k \geq m}\big( 1-P({A_k}) \big) \\ &\leq \prod\limits_{k \geq m}e^{-P(A_k)} = e^{-\sum\limits_{k \geq m} P(A_k)} = 0, \:\: \forall m>0 \end{align*} $\therefore P(\bigcup\limits_{k \geq m}{A_k}) = 1$ and $P(\limsup\limits_n{A_n}) = P(A_n \:\: i.o.) = 1$.
If $X_1, X_2, \cdots$ are independent, $A \in \mathcal{T}$, where $\mathcal{T} := \bigcap\limits_{k\geq1} \sigma(X_k, X_{k+1}, \cdots)$, then $P(A)=0$ or $1$.
Let $A \in \sigma(X_1, \cdots, X_k)$ and $B \in \sigma(X_{k+1}, \cdots, X_{k+j})$ for some $j$. Because $X_i$'s are independent, $A \perp B$. Thus $\sigma(X_1, \cdots, X_k) \perp \bigcup_j \sigma(X_{k+1}, \cdots, X_{k+j})$. Since both are $\pi$-systems, by Dynkin's $\pi$-$\lambda$ theorem, $$\sigma(X_1, \cdots, X_k) \perp \sigma(X_{k+1}, \cdots)$$ Now, let $A \in \sigma(X_1, \cdots, X_j)$ for some $j$ and $B \in \mathcal{T} \subset \sigma(X_{j+1}, \cdots)$. By (1) and similar process, $$\sigma(X_1, \cdots) \perp \mathcal{T}$$ By (2), since $\mathcal{T} \subset \sigma(X_1, \cdots)$, $\mathcal{T}$ is independent of itself.
$\therefore A \in \mathcal{T} \to P(A) = P(A \cap A) = P(A)P(A)$.

Borel-Cantelli lemmas are widely used to prove almost sure convergence/existence of limit points of random variables. e.g. By showing that $P(|X_n - X|>\epsilon)$ is summable for any given $\epsilon > 0$, one can easily check almost sure convergence from convergence in probability.

### Limitation of $R^2$

linear model

For a linear regression $y_i = \beta_0 + \sum\limits_{j=1}^{p} \beta_j x_{ij}$, $1 \leq i \leq n$, suppose $x_{ij}$'s does not have any relationship with $y_i$'s. i.e. true model is $y_i = \beta_0 = \bar{y}$. Under this assumption, \begin{align*} \frac{\mathrm{SSE}}{\sigma^2} &\sim \chi^2(n-p-1) \overset{D}{=} \Gamma(\frac{n-p-1}{2}, 2)\\ \frac{\mathrm{SSR}}{\sigma^2} &\sim \chi^2(p) \overset{D}{=} \Gamma(\frac{p}{2}, 2)\\ \mathrm{SSE} &\perp \mathrm{SSR} \end{align*} Thus, $R^2 = \frac{\mathrm{SSR}}{\mathrm{SSR} + \mathrm{SSE}} \sim \mathcal{B}(\frac{p}{2}, \frac{n-p-1}{2})$, and $E(R^2) = \frac{p}{n-1}$.

Hence expectation of $R^2$ increases as the dimension of predictors increases, regardless of fit of the model.
Let $y_n$ be a sequence on topological space. If $\forall$ subsequence $y_{n_k}$, $\exists$ a further subsequence $y_{n(m_k)}$ s.t. $y_{n(m_k)} \to y$, then $y_n \to y$.
A sequence of random variables $X_n \to X$ in probability $\iff$ $\forall$ subsequence $X_{n_k}$, $\exists$ a further subsequence $X_{n(m_k)}$ s.t. $X_{n(m_k)} \to X$ a.s.