Understanding Latent Dirichlet Allocation (2) The Model
In the last article, topic models frequently used at the time of development of LDA was covered. At the end of the post, I briefly introduced the rationale behind LDA. In this post, I would like to elaborate on details of the model architecture.
This article is the second part of the series “Understanding Latent Dirichlet Allocation”.
Generative process
Suppose we have a corpus with vocabulary of size $V$ and every words $w$ and every topics $z$ are onehot encoded (i.e. $w^i=1$ and $w^j=0, j\ne i$ if $w$ is the ith term in the vocabulary). Define a document $\mathbf{w}$ as a finite sequence of $N$ words
\[\mathbf{w} = (w_1, \cdots, w_N)\]and denote a corpus $D$ of $M$ documents
\[D = \{\mathbf{w}_1, \cdots, \mathbf{w}_M\}.\]For $k$ prespecified number of topics, LDA assumes that a $d$th document $\mathbf{w}_d \in D$ with length $N_d$ is “generated” in the following steps^{1}:
 $\theta_d \sim \mathcal{D}_k(\alpha)$. The multinomial parameter $\theta_d \in \mathbb{R}^k$ is drawn from $k$dimensional Dirichlet distribution.
 For each words $w_{dn}, n=1,\cdots,N_d$ in the document:
 $z_{dn} \sim \mathcal{M}_k(1, \theta_d)$. A topic $z_{dn}$ for a word $w_{dn}$ is drawn from $k$dim multinomial distribution.
 $w_{dn}  z_{dn}, \beta \sim \mathcal{M}_V(1, \beta_{i:z_{dn}^i=1})$. A word $w_{dn}$ given the topic $z_{dn}$ is drawn from $V$dim multinomial distribution.
$\alpha \in \mathbb{R}^k, \beta \in \mathbb{R}^{k \times V}$ are hyperparameters that is common to all of the documents in a corpus. $\beta$ is an unknown constant that satisfies $\beta_{ij} = P(w^j=1z^i=1)$. That is, given a topic $z$ such that $z^i = 1$, probability of $w_n$ given $\beta$ follows multinomial distribution with parameter $\beta_i$.
$\theta = (\theta_1, \cdots, \theta_M) \in \mathbb{R}^{M \times k}$ can be seen as the documentspecific random topic matrix as $\theta_d$ generates topic for each words in $d$th document as in the process.
Blei et al. presented a graphical representation of the model. This helps understanding the structure as it illustrates the “scope” of each parameters. Only the words $w_{dn}$’s, that are filled with grey, are observable. Note that each word $w$ is generated by $z$ which is generated by $\theta$ which is again generated by $\alpha$; This makes LDA a threelevel hierarchical model.
Mixture representations
In the previous article, I mentioned that under LDA model both words and documents can be represented as mixture distributions by de Finetti’s theorem. This property takes full advantage of bagofwords assumption to reduce the number of parameters in large corpus and makes LDA a proper document model. Here, I would like to elaborate on it.
Words
By marginalizing over topics $z$, the word distribution is given by
\[P(w_{dn}\theta_d, \beta) = \sum_z P(w_{dn}z, \beta) P(z\theta_d),\]where it is a mixture distribution of mixture weights $P(z\theta_d)$ and components $P(w_{dn}z,\beta)$.
Documents
We assumed in LDA that words are generated by topics and topics are exchangable within a document. Thus the joint density of $\theta_d, \mathbf{z}_d, \mathbf{w}_d$ is
\[p(\theta_d, \mathbf{z}_d, \mathbf{w}_d\alpha, \beta) = P(\theta_d\alpha)\prod_{n=1}^{N_d} P(z_{dn}\theta_d)p(w_{dn}z_{dn}, \beta)\]Marginalize it over $z$ and use the mixture representation of word distribution to get
\[p(\theta_d, \mathbf{w}_d  \alpha, \beta) = P(\theta_d\alpha)\prod_{n=1}^{N_d} P(w_{dn}\theta_d, \beta).\]Marginalize again with respect to $\theta_d$ and we get the continuous mixture representation of the document $\mathbf{w}_d$
\[p(\mathbf{w}_d  \alpha, \beta) = \int P(\theta_d\alpha)\prod_{n=1}^{N_d} P(w_{dn}\theta_d, \beta) ~d\theta_d,\]where $P(\theta_d\alpha)$ are weights and $\prod_{n=1}^{N_d} P(w_{dn}\theta_d, \beta)$ are components.
References
 Blei, Ng, Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research. 3 (4–5): 993–1022.

The number of words in a document is determined by Poisson distribution ($N \sim \mathcal{P}(\xi)$). I omitted this part because $N$ is an ancillary variable and it does not make the model description different even if it is considered as a constant. ↩