# 4.1. Conditional expectation


In this chapter we study convergence of a sequence of random variables with dependency. To be specific, I will cover theory of martingales. The first subsection is about conditional expectation which is essential for defining martingales.

## Definition

Let $(\Omega, \mathcal{F}, P)$ be a probability space, $\mathcal{F}_0 \subset \mathcal{F}$ be a sub $\sigma$-algebra. For a random variable $X \in \mathcal{F}_0,$ $E|X|<\infty,$ we say $Y$ a version of $E(X|\mathcal{F_0}),$ conditional expectation of $X$ given $\mathcal{F},$ if (i) $Y \in \mathcal{F}$ and (ii) $\int_A X dP = \int_A Y dP$ for all $A \in \mathcal{F}.$

The term “versions” means they are almost surely equivalent. So in the following sections, I will just call such $Y$ a conditional expectation instead of a version.

### Non-negative random variables

We need to know the existence of such $Y$ and if it is unique (in almost sure sense) if exists at all. For a non-negative $X,$ it can be constructed as the Radon-Nikodym derivative.

For measures $\mu,\nu$ on a measurable space $(\Omega,\mathcal{F}),$ we say $\nu$ is absolutely continuous to $\mu$ and write $\nu \ll \mu$ if $\mu(A)=0$ implies $\nu(A)=0$ for all $A\in\mathcal{F}.$
Let $(\Omega,\mathcal{F})$ be a measurable space and $\mu,\nu$ be $\sigma$-finite measures. If $\nu \ll \mu,$ then there exists $f=\frac{d\nu}{d\mu}\in\mathcal{F}$ such that $f \ge 0$ almost everywhere and $\nu(A) = \int_A{f}d\mu$ for all $A\in\mathcal{F}.$ $f=\frac{d\nu}{d\mu}$ is called the Radon-Nikodym derivative of $\nu$ with respect to $\mu.$

Let $Q(A)=\int_A{X}dP$ for all $A \in \mathcal{F}_0$ then $Q$ is a $\sigma$-finite measure such that $Q \ll P.$ Thus by Radon-Nikodym theorem, there exists $\frac{dQ}{dP} \in \mathcal{F}_0$ such that $\int_A X dP = \int_A \frac{dQ}{dP}dP$ for all $A\in\mathcal{F}_0.$ By definition $\frac{dQ}{dP}$ satisfies conditions for being a conditional expectation of $X$ given $\mathcal{F}.$

Notice that for a non-negative random variable, conditional expectation exists even for random variables that are not integrable.

### General case

For a general $X,$ let $Y^+, Y^-$ be conditional expectations of $X^+,X^-$ respectively. Let $E(X|\mathcal{F}_0) = Y^+ - Y^-,$ then clearly $Y \in \mathcal{F}_0)$ and for given $A \in \mathcal{F}_0,$ \begin{aligned} \int_A{X}dP &= \int_A{X^+}dP - \int_A{X^-}dP \\ &= \int_A{Y^+}dP - \int_A{Y^-}dP = \int_A{Y}dP. \end{aligned}

### Uniqueness

Suppose $Y,Y’$ are $E(X|\mathcal{F}0)$ Then $\int_A (Y-Y’) dP = 0$ for all $A \in \mathcal{F}_0.$ Let $A_1 = \{Y-Y’ \ge 0\}$ and $A_2 = \{Y-Y’ \le 0\},$ $A_1,A_2 \in \mathcal{F}_0.$ $\int_{A_1} (Y-Y') dP = 0 \implies Y-Y'=0 \text{ on } A_1.\\ \int_{A_2} (Y-Y') dP = 0 \implies Y-Y'=0 \text{ on } A_2.\\$ Thus $Y=Y’$ almost surely.

Not only we get $Y=Y’ \text{ a.s.}$ but we can also be sure that for any $X_1, X_2 \in \mathcal{F}$ that satisfy $\int_A X_1 dP = \int_A X_2 dP$ for all $A \in \mathcal{F},$ it always follows $X_1=X_2 \text { a.s.}$

## Examples and insight

Think of $\mathcal{F}_0 \subset \mathcal{F}$ as the information we have at our disposal. For $A \in \mathcal{F}_0,$ we can interpret it as an event that we know whether $A$ occurred or not. In this sense, $E(X|\mathcal{F}_0)$ is our best guess of $X$ given the information we have.

Let $X$ be a random variable such that $EX^2 < \infty.$ Let $\mathcal{C} = \{Y\in\mathcal{F}_0:~ EY^2 < \infty\} \subset L^2.$ Then $$E[X-E(X|\mathcal{F}_0)]^2 = \inf_{Y\in\mathcal{C}}E(X-Y)^2.$$

The proof requires a property yet to be mentioned, so I will leave it until the end of the section. The following examples will help getting a grasp of the intuition behind conditional expectations. Proofs are clear so I will not mention it.

$$X \in \mathcal{F}_0 \implies E(X|\mathcal{F}_0) = X \text{ a.s.}$$
$$X \perp \mathcal{F}_0 \implies E(X|\mathcal{F}_0) = EX \text{ a.s.}$$

Here $X \perp \mathcal{F}_0$ means $P((X\in B)\cap A) = P(X\in B)P(A),~ \forall B \in \mathcal{B}(\mathbb{R}), A\in\mathcal{F}_0.$

As an extension of undergraduate definition, we can define conditional probability.

(i) For $(\Omega, \mathcal{F}, P),$ suppose $\Omega = \cup_{i=1}^\infty \Omega_i,$ where $\Omega_i$'s are disjoint and $P(\Omega_i)>0$ for all $i.$ Let $\mathcal{F}_0 = \sigma(\Omega_1,\Omega_2,\cdots),$ then $$E(X|\mathcal{F}_0) = \sum_{i=1}^\infty \frac{\int_{\Omega_i} X dP}{P(\Omega_i)}\mathbf{1}_{\Omega_i}.$$ i.e. $$E(X|\mathcal{F}_0) = \frac{\int_{\Omega_i} X dP}{P(\Omega_i)} \text{ on } \Omega_i.$$ (ii) $$P(A|\mathcal{F}_0) := E(\mathbf{1}_A|\mathcal{F}_0).\\ P(A|B) := \frac{P(A\cap B)}{P(B)}.$$

(ii) follows naturally from (i).

In undergraduate statistics, instead of giving $\sigma$-field, we gave random variables. This can be regarded as a special case of our definition.

$$E(Y|X) := E(Y|\sigma(X)).$$

Furthermore, we get some form of “conditional density”.1

(i) Suppose $X,Y$ have a joint density $f(x,y).$ i.e. $P((X,Y)\in B) = \int_B f(x,y)dxdy$ for all $B \in \mathcal{B}(\mathbb{R}^2).$ If $E|g(X)| < \infty,$ then $$E(g(X)|Y) = h(Y), \text{ where } h(y)\int f(x,y) dx = \int g(x)f(x,y) dx.$$ (ii) $X\perp Y,$ $\varphi:\mathbb{R}^2\to\mathbb{R}$ is a Borel function such that $E|\varphi(X,Y)|<\infty,$ then $$E(\varphi(X,Y)|X) = h(X), \text{ where } h(x) = E\varphi(x,Y).$$
expand proof

(i) Since $f,g$ are Borel, $h$ is also a Borel function. Let $(X,Y)$ be a random vector on a product space $(\Omega, \mathcal{F}, P)$ of $(\Omega_X, \mathcal{F}_X, P_X)$ and $(\Omega_Y, \mathcal{F}_Y, P_Y)$. Given $A \in \sigma(Y),$ let $B\in\mathcal{B}(\mathbb{R})$ so that $A = Y^{-1}(B).$ \begin{aligned} \int_A g(X) dP &= \int g(X) \mathbf{1}_A dP \\ &= \int g(X) \mathbf{1}_B(Y) dP \\ &= \int\int g(X) \mathbf{1}_B(Y) dP_X dP_Y \\ &= \int\int g(x) \mathbf{1}B(Y) f(x,y) dxdy \\ &= \int_B\int g(x) f(x,y) dxdy \\ &= \int_B h(y) \int f(x,y) dx dy \\ &= \int_A h(Y) dP. \end{aligned} The third and the fifth equality is from the Fubini's theorem.
(ii) By the Fubini's theorem, $h \in \sigma(X).$ Given $A \in \sigma(X),$ let $B\in\mathcal{B}(\mathbb{R})$ so that $A = X^{-1}(B).$ Similar to (i), we get \begin{aligned} \int_A h(X) dP_X &= \int\int \varphi(X,Y) dP_Y \mathbf{1}_B(X) dP_X \\ &= \int \varphi(X,Y)\mathbf{1}_B(X) dP \\ &= \int_A \varphi(X,Y) dP_X \end{aligned} The second equality is from the Fubini's theorem.

## Properties

Next I would like to cover fundamental properties of conditional expectations. These will be used throughout this chapter.

Suppose $E|X|<\infty,$ $E|Y|<\infty.$
(i) $E(aX+bY|\mathcal{F}_0) = aE(X|\mathcal{F}_0) + bE(Y|\mathcal{F}_0).$
(ii) $X\ge 0$ a.s. $\implies E(X|\mathcal{F}_0) \ge 0$ a.s.

Notable result from (ii) is that $\|E(X\|\mathcal{F}_0)\| \le E(\|X\|\|\mathcal{F}_0).$

### Inequalities

These are conditional version of some of the inequalities that we covered earlier in chapter 1.

Suppose $E|X|<\infty,$ $X \ge 0.$ $$P(X \ge a | \mathcal{F}_0) \le \frac{1}{a} E(X|\mathcal{F}_0).$$

$$P(X \ge a | \mathcal{F}_0) \le E(\mathbf{1}_{X \ge a} \frac{X}{a} | \mathcal{F}_0) \le \frac{1}{a} E(X|\mathcal{F}_0).$$

Similarly, Chebyshev’s inequality also holds for conditional expectation.

$E|X|<\infty,$ $\varphi:\mathbb{R}\to\mathbb{R}$ is convex, $E|\varphi(X)|<\infty.$ Then $E(\varphi(X)|\mathcal{F}_0) \ge \varphi(E(X|\mathcal{F}_0)).$

Note that $\varphi(x) = \sup\{ax+b:~ (a,b)\in S\}$ where $S=\{(a,b):~ ax+b \le \varphi(x), \forall x\}.$ So $\varphi(X) \ge aX+b$ for all $(a,b) \in S.$ \begin{aligned} E(\varphi(X)|\mathcal{F}_0) &\ge aE(X|\mathcal{F}_0)+b,~ \forall(a,b)\in S.\\ E(\varphi(X)|\mathcal{F}_0) &\ge \sup\{aE(X|\mathcal{F}0)+b:~ (a,b)\in S\} \\ &= \varphi(E(X|\mathcal{F}_0)). \end{aligned}

### Convergence theorems

If $X_n \ge 0$ a.s. and $X_n \uparrow X$ a.s. with $E|X| <\infty,$ then $E(X_n|\mathcal{F}_0) \uparrow E(X|\mathcal{F}_0)$ a.s.

In fact, the condition $E|X|<\infty$ is not required since we can always define conditional expectation for non-negative random variables as the Radon-Nikodym derivative. I wrote the condition only because Durrett did so.

Note that $E(X_n|\mathcal{F}_0) \le E(X_{n+1}|\mathcal{F}_0) \le E(X|\mathcal{F}_0)$ for all $n.$ Given $A\in\mathcal{F}_0,$ by using MCT twice, \begin{aligned} \int_A \lim_n E(X_n|\mathcal{F}_0) dP &= \lim_n\int_A E(X_n|\mathcal{F}_0) dP \\ &= \lim_n \int_A X_n dP \\ &= \int_A X dP \\ &= \int_A E(X|\mathcal{F}) dP. \end{aligned}
$X_n\to X$ a.s. and $|X_n|\le Y$ for all $n$ where $EY < \infty.$ Then $E(X_n|\mathcal{F}_0) \to E(X|\mathcal{F}_0)$ a.s.

The proof is similar to that of conditional MCT.

$X\ge0$ a.s. Then $E(\liminf_n X_n|\mathcal{F}_0) \le \liminf_n E(X_n | \mathcal{F}_0).$

Given $M>0,$ $X_n \wedge M$ is dominated by $M.$ There exists a subsequence $(X_{n_k})$ such that $X_{n_k} \to \liminf_n X_n.$ By conditional DCT, \begin{aligned} E(\liminf_n X_n \wedge M | \mathcal{F}) &= \lim_k E(X_{n_k} \wedge M | \mathcal{F}_0) \\ &\le \liminf_n E(X_n | \mathcal{F}),~ \forall M>0. \end{aligned} By conditional MCT, letting $M\uparrow\infty$ gives the result.

The obvious consequences are $B_n \subset B_{n+1} \uparrow B,~ B=\cup_{n=1}^\infty B_n \implies P(B_n|\mathcal{F}_0) \uparrow P(B|\mathcal{F}_0).$ and $C_n\in\mathcal{F}_0 \text{ are disjoint} \implies P(\cup_{n=1}^\infty C_n|\mathcal{F}_0) = \sum_{n=1}^\infty P(C_n|\mathcal{F}_0).$

### Smoothing property

(i) $X\in\mathcal{F}_0,$ $E|Y|<\infty,$ $E|XY|<\infty.$ Then $E(XY|\mathcal{F}_0) = XE(Y|\mathcal{F}_0).$
(ii) $\mathcal{F}_1\subset\mathcal{F}_2$ are sub $\sigma$-fields. $E|X|<\infty.$ Then \begin{aligned} &E[E(X|\mathcal{F}_1)|\mathcal{F}_2] = E(X|\mathcal{F}_1) \\ \text{and } &E[E(X|\mathcal{F}_2)|\mathcal{F}_1] = E(X|\mathcal{F}_1). \end{aligned}

(i) is clear by using the standard machine. (ii) is also clear by the definition of (nested) conditional expectations.

Finishing the section, let me prove the second theorem of this section.

$\DeclareMathOperator*{\argmin}{arg\,min}$
\begin{aligned} E(X-Y)^2 &= E[X - E(X|\mathcal{F}_0) + E(X|\mathcal{F}_0) - Y]^2 \\ &= E[X-E(X|\mathcal{F}_0)]^2 + E[E(X|\mathcal{F}_0)-Y]^2 \\ &\;\;\;\;+ 2\cancel{E[(E(X|\mathcal{F}_0)-Y)E((X-E(X|\mathcal{F}_0))|\mathcal{F}_0)]} \end{aligned} The canceled term in the second equality is by the smoothing property. Thus $E(X|\mathcal{F}_0) = \argmin_{Y\in\mathcal{C}}E(X-Y)^2.$

Acknowledgement

This post series is based on the textbook Probability: Theory and Examples, 5th edition (Durrett, 2019) and the lecture at Seoul National University, Republic of Korea (instructor: Prof. Sangyeol Lee).

1. There is a formal notion of (regular) conditional distribution. The actual conditional distribution is a function defined on a product space of $\mathcal{B}(\mathbb{R})$ and $\Omega.$