4.1. Conditional expectation
In this chapter we study convergence of a sequence of random variables with dependency. To be specific, I will cover theory of martingales. The first subsection is about conditional expectation which is essential for defining martingales.
Definition
The term “versions” means they are almost surely equivalent. So in the following sections, I will just call such $Y$ a conditional expectation instead of a version.
Nonnegative random variables
We need to know the existence of such $Y$ and if it is unique (in almost sure sense) if exists at all. For a nonnegative $X,$ it can be constructed as the RadonNikodym derivative.
Let $Q(A)=\int_A{X}dP$ for all $A \in \mathcal{F}_0$ then $Q$ is a $\sigma$finite measure such that $Q \ll P.$ Thus by RadonNikodym theorem, there exists $\frac{dQ}{dP} \in \mathcal{F}_0$ such that $\int_A X dP = \int_A \frac{dQ}{dP}dP$ for all $A\in\mathcal{F}_0.$ By definition $\frac{dQ}{dP}$ satisfies conditions for being a conditional expectation of $X$ given $\mathcal{F}.$
Notice that for a nonnegative random variable, conditional expectation exists even for random variables that are not integrable.
General case
For a general $X,$ let $Y^+, Y^$ be conditional expectations of $X^+,X^$ respectively. Let $E(X\mathcal{F}_0) = Y^+  Y^,$ then clearly $Y \in \mathcal{F}_0)$ and for given $A \in \mathcal{F}_0,$
\(\begin{aligned}
\int_A{X}dP
&= \int_A{X^+}dP  \int_A{X^}dP \\
&= \int_A{Y^+}dP  \int_A{Y^}dP = \int_A{Y}dP.
\end{aligned}\)
Uniqueness
Suppose $Y,Y’$ are $E(X\mathcal{F}0)$ Then $\int_A (YY’) dP = 0$ for all $A \in \mathcal{F}_0.$ Let $A_1 = \{YY’ \ge 0\}$ and $A_2 = \{YY’ \le 0\},$ $A_1,A_2 \in \mathcal{F}_0.$ \(\int_{A_1} (YY') dP = 0 \implies YY'=0 \text{ on } A_1.\\ \int_{A_2} (YY') dP = 0 \implies YY'=0 \text{ on } A_2.\\\) Thus $Y=Y’$ almost surely.
Not only we get $Y=Y’ \text{ a.s.}$ but we can also be sure that for any $X_1, X_2 \in \mathcal{F}$ that satisfy $\int_A X_1 dP = \int_A X_2 dP$ for all $A \in \mathcal{F},$ it always follows $X_1=X_2 \text { a.s.}$
Examples and insight
Think of $\mathcal{F}_0 \subset \mathcal{F}$ as the information we have at our disposal. For $A \in \mathcal{F}_0,$ we can interpret it as an event that we know whether $A$ occurred or not. In this sense, $E(X\mathcal{F}_0)$ is our best guess of $X$ given the information we have.
The proof requires a property yet to be mentioned, so I will leave it until the end of the section. The following examples will help getting a grasp of the intuition behind conditional expectations. Proofs are clear so I will not mention it.
Here $X \perp \mathcal{F}_0$ means \(P((X\in B)\cap A) = P(X\in B)P(A),~ \forall B \in \mathcal{B}(\mathbb{R}), A\in\mathcal{F}_0.\)
As an extension of undergraduate definition, we can define conditional probability.
(ii) follows naturally from (i).
In undergraduate statistics, instead of giving $\sigma$field, we gave random variables. This can be regarded as a special case of our definition.
Furthermore, we get some form of “conditional density”.^{1}
expand proof
(i) Since $f,g$ are Borel, $h$ is also a Borel function. Let $(X,Y)$ be a random vector on a product space $(\Omega, \mathcal{F}, P)$ of $(\Omega_X, \mathcal{F}_X, P_X)$ and $(\Omega_Y, \mathcal{F}_Y, P_Y)$. Given $A \in \sigma(Y),$ let $B\in\mathcal{B}(\mathbb{R})$ so that $A = Y^{1}(B).$ $$\begin{aligned} \int_A g(X) dP &= \int g(X) \mathbf{1}_A dP \\ &= \int g(X) \mathbf{1}_B(Y) dP \\ &= \int\int g(X) \mathbf{1}_B(Y) dP_X dP_Y \\ &= \int\int g(x) \mathbf{1}B(Y) f(x,y) dxdy \\ &= \int_B\int g(x) f(x,y) dxdy \\ &= \int_B h(y) \int f(x,y) dx dy \\ &= \int_A h(Y) dP. \end{aligned}$$ The third and the fifth equality is from the Fubini's theorem.
(ii) By the Fubini's theorem, $h \in \sigma(X).$ Given $A \in \sigma(X),$ let $B\in\mathcal{B}(\mathbb{R})$ so that $A = X^{1}(B).$ Similar to (i), we get $$\begin{aligned} \int_A h(X) dP_X &= \int\int \varphi(X,Y) dP_Y \mathbf{1}_B(X) dP_X \\ &= \int \varphi(X,Y)\mathbf{1}_B(X) dP \\ &= \int_A \varphi(X,Y) dP_X \end{aligned}$$ The second equality is from the Fubini's theorem.
Properties
Next I would like to cover fundamental properties of conditional expectations. These will be used throughout this chapter.
(i) $E(aX+bY\mathcal{F}_0) = aE(X\mathcal{F}_0) + bE(Y\mathcal{F}_0).$
(ii) $X\ge 0$ a.s. $\implies E(X\mathcal{F}_0) \ge 0$ a.s.
Notable result from (ii) is that
\(\E(X\\mathcal{F}_0)\ \le E(\X\\\mathcal{F}_0).\)
Inequalities
These are conditional version of some of the inequalities that we covered earlier in chapter 1.
$$P(X \ge a  \mathcal{F}_0) \le E(\mathbf{1}_{X \ge a} \frac{X}{a}  \mathcal{F}_0) \le \frac{1}{a} E(X\mathcal{F}_0).$$
Similarly, Chebyshev’s inequality also holds for conditional expectation.
Note that $\varphi(x) = \sup\{ax+b:~ (a,b)\in S\}$ where $S=\{(a,b):~ ax+b \le \varphi(x), \forall x\}.$ So $\varphi(X) \ge aX+b$ for all $(a,b) \in S.$ $$\begin{aligned} E(\varphi(X)\mathcal{F}_0) &\ge aE(X\mathcal{F}_0)+b,~ \forall(a,b)\in S.\\ E(\varphi(X)\mathcal{F}_0) &\ge \sup\{aE(X\mathcal{F}0)+b:~ (a,b)\in S\} \\ &= \varphi(E(X\mathcal{F}_0)). \end{aligned}$$
Convergence theorems
In fact, the condition $EX<\infty$ is not required since we can always define conditional expectation for nonnegative random variables as the RadonNikodym derivative. I wrote the condition only because Durrett did so.
Note that $E(X_n\mathcal{F}_0) \le E(X_{n+1}\mathcal{F}_0) \le E(X\mathcal{F}_0)$ for all $n.$ Given $A\in\mathcal{F}_0,$ by using MCT twice, $$\begin{aligned} \int_A \lim_n E(X_n\mathcal{F}_0) dP &= \lim_n\int_A E(X_n\mathcal{F}_0) dP \\ &= \lim_n \int_A X_n dP \\ &= \int_A X dP \\ &= \int_A E(X\mathcal{F}) dP. \end{aligned}$$
The proof is similar to that of conditional MCT.
Given $M>0,$ $X_n \wedge M$ is dominated by $M.$ There exists a subsequence $(X_{n_k})$ such that $X_{n_k} \to \liminf_n X_n.$ By conditional DCT, $$\begin{aligned} E(\liminf_n X_n \wedge M  \mathcal{F}) &= \lim_k E(X_{n_k} \wedge M  \mathcal{F}_0) \\ &\le \liminf_n E(X_n  \mathcal{F}),~ \forall M>0. \end{aligned}$$ By conditional MCT, letting $M\uparrow\infty$ gives the result.
The obvious consequences are
\(B_n \subset B_{n+1} \uparrow B,~ B=\cup_{n=1}^\infty B_n \implies P(B_n\mathcal{F}_0) \uparrow P(B\mathcal{F}_0).\)
and
\(C_n\in\mathcal{F}_0 \text{ are disjoint} \implies P(\cup_{n=1}^\infty C_n\mathcal{F}_0) = \sum_{n=1}^\infty P(C_n\mathcal{F}_0).\)
Smoothing property
(ii) $\mathcal{F}_1\subset\mathcal{F}_2$ are sub $\sigma$fields. $EX<\infty.$ Then $$\begin{aligned} &E[E(X\mathcal{F}_1)\mathcal{F}_2] = E(X\mathcal{F}_1) \\ \text{and } &E[E(X\mathcal{F}_2)\mathcal{F}_1] = E(X\mathcal{F}_1). \end{aligned}$$
(i) is clear by using the standard machine. (ii) is also clear by the definition of (nested) conditional expectations.
Finishing the section, let me prove the second theorem of this section.
Acknowledgement
This post series is based on the textbook Probability: Theory and Examples, 5th edition (Durrett, 2019) and the lecture at Seoul National University, Republic of Korea (instructor: Prof. Sangyeol Lee).

There is a formal notion of (regular) conditional distribution. The actual conditional distribution is a function defined on a product space of $\mathcal{B}(\mathbb{R})$ and $\Omega.$ ↩