The idea is to redo the LDA lab we did in class (link here), but with Student's t classes (with different covariances and different numbers of degrees of freedom) instead of Gaussians.
We assume the following model, where \(x \in \mathbb{R}^D\) is a data point, and \(k \in \{1,...,K\}\) is a class:
\[p(x|Z=k) = \mathcal{S}(x;\mu_k,\Sigma_k, \nu)\]
and \(P(Z=k) = \pi_k\). The function \(\mathcal{S}\) corresponds to the denisty of a multivariate Student's t distribution (link here).
What is the maximum-likelihood estimate of the class proportions \(\pi_1,...,\pi_K\)?
Fit our model using maximum likelihood to the same data we used for the LDA lab in class.
To help you, here's a quick way to fit a multivariate Student's t distribution to some data using a package. First we load the data:
library("fitHeavyTail")
library("mvtnorm")
library("pgmm")
data(wine)
X = wine[,-1]
z = wine$Type
Then we use the function fit_mvt:
res = fit_mvt(X, nu = "iterative",ftol = 1e-9)
We can evaluate the log-likelihood of the data from the output of fit_mvt, or using dmvt from the mvtnorm package:
print(res$log_likelihood)
## [1] -11016.75
print(sum(dmvt(X,delta = res$mu, sigma = res$scatter,df = res$nu, log = TRUE)))
## [1] -11016.75
Let \(X \in \mathbb{R}^{n \times d}\) a dataset which contains missing values. The missing-data pattern \(M \in \{0,1\}^{n\times d}\) is a binary matrix which indicates where are the missing values in \(X\): \(m_{ij}=1\) is \(x_{ij}\) is missing, \(m_{ij}=0\) otherwise. Let us denote \(x_i^{\mathrm{obs}}\) (resp. \(x_i^{\mathrm{mis}}\)) the values of the observed variable(s) (resp. the values of the missing variable(s)) for the individual \(i\).
For example, if we observe \((\mathrm{NA},2,3,\mathrm{NA})\) and the true values are \((1,2,3,4)\), \(x_i^{\mathrm{obs}}=(2,3)\) and \(x_i^{\mathrm{mis}}=(1,4)\).
The aim of model-based clustering is to estimate an (unknown) partition of the data \(X\) in \(K\) groups. This partition is denoted as \(Z=(z_1|\dots|z_n)^T \in \{0,1\}^{n\times K}\) with \(z_i=(z_{i1},\dots,z_{iK})^T \in \{0,1\}^{K}\) and where \(z_{ik}=1\) if \(x_i\) belongs to the class \(k\), \(z_{ik}=0\) otherwise. In this context, note that both \(x_i^{\textrm{mis}}\) and \(z_i\) are missing. In the following, we assume that the data are missing completely at random (MCAR).
We consider i.i.d. data \(x_1,\dots,x_n\) (\(x_i \in \mathbb{R}^d\)) genereted under a Gaussian mixture model with \(K\) clusters and the following density: \[ p_\theta(x)=\sum_{k=1}^K \pi_k f_k(x;\mu_k,\Sigma_k),\]
with \(\theta=(\pi_1,\dots,\pi_K,\mu_1,\dots,\mu_K,\Sigma_1,\dots,\Sigma_K)\) and \(f_k\) is the Gaussian density function. \(\forall k \in \{1,\dots,K\}, \mu_k\in \mathbb{R}^d\) and \(\Sigma_k \in \mathbb{R}^{d\times d}\).
The goal of this exercise is to implement an EM algorithm for a Gaussian mixture model when the data are missing (MCAR).
The EM algorithm is an iterative algorithm that permits to maximize the likelihood function under missingness. Let be \(\theta^{[0]}\) the initialization point. The iteration \([r]\) of the algorithm consists in proceeding:
the E-step: computation of \(Q(\theta;\theta^{[r-1]})=\mathbb{E}[\ell(\theta;X,Z)|X^{\mathrm{obs}};\theta^{[r-1]}],\) where \(\ell(\theta;X,Z)\) is the complete log-likelihood.
the M-step: update of the parameters by maximizing the function \(Q(\theta;\theta^{[r-1]})\): \[\theta^{[r]}=\mathrm{argmax}_{\theta} Q(\theta;\theta^{[r-1]}).\]
Note that
\[\ell(\theta;X,Z)=\log\left(\prod_{i=1}^n p_\theta(x_i)\right)= \sum_{i=1}^n\sum_{k=1}^K z_{ik} \log(\pi_k f_k(x_i;\mu_k,\Sigma_k))\]
We can show that \[Q(\theta;\theta^{[r-1]}=\sum_{i=1}^n\sum_{k=1}^K t_{ik}(\theta^{[r-1]}) [\log(\pi_k) + \tau_x(\mu_k,\Sigma_k;x_i^\mathrm{obs},\theta^{[r-1]})]\]
Explicit the terms \(t_{ik}\) and \(\tau_x\) as conditional probability and conditional expectation.
Write the term \(t_{ik}\).
To compute \(\tau_x\), we have to make explicit the distribution \(f(x_i^\mathrm{mis}|x_i^\mathrm{obs},z_{ik}=1;\theta^{[r-1]})\).
Up to a reorganization of the variables, we have
\[f(x_i|z_{ik}=1;\theta^{[r-1]})=f\left(\begin{pmatrix} x_i^\mathrm{obs} \\ x_i^\mathrm{mis} \end{pmatrix}|z_{ik}=1;\theta^{[r-1]}\right)=\mathcal{N}\left(\begin{pmatrix} (\mu_{ik}^\mathrm{obs})^{[r-1]} \\ (\mu_{ik}^\mathrm{mis})^{[r-1]} \end{pmatrix}, \begin{pmatrix} (\Sigma_{ik}^{\mathrm{obs},\mathrm{obs}})^{[r-1]} & (\Sigma_{ik}^{\mathrm{obs},\mathrm{mis}})^{[r-1]} \\ (\Sigma_{ik}^{\mathrm{obs},\mathrm{mis}})^{[r-1]} & (\Sigma_{ik}^{\mathrm{mis},\mathrm{mis}})^{[r-1]} \end{pmatrix}\right),\]
where
\((\mu_{ik}^\mathrm{obs})^{[r-1]}\) (resp. \((\mu_{ik}^\mathrm{mis})^{[r-1]}\)) is the values of the vector \((\mu_{ik})^{[r-1]}\) for the observed variables (resp. for the missing variables),
\((\Sigma_{ik}^{\mathrm{obs},\mathrm{obs}})^{[r-1]}\) is sub-matrix of \(\Sigma_{ik}^{[r-1]}\) for the observed variables only, ...
Therefore, we have:
\[\left( x_i^{\mathrm{mis}} \mid x_i^{\mathrm{obs}},z_{ik}=1;\theta^{[r-1]} \right) \sim \mathcal{N}\left((\tilde{\mu}_{ik}^{\mathrm{mis}})^{[r-1]},(\tilde{\Sigma}_{ik}^{\mathrm{mis}})^{[r-1]}\right)\]
Hint: use the classic formulae for the conditional expectation of Gaussian variables https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions.
The term \(\tau_x\) is written as follows: \[\tau_x(\mu_k,\Sigma_k;x_i^\mathrm{obs},\theta^{[r-1]})=\frac{1}{2}\left[n\log(2\pi) + \log((\mid\Sigma_k\mid)) \right] -\frac{1}{2}\mathbb{E}\left[ (x_i - \mu_k)^T(\Sigma_k)^{-1}(x_i - \mu_k) \mid x_i^{\mathrm{obs}},z_{ik}=1;\theta^{[r-1]}\right].\]
This last term could be expressed using the commutativity and linearity of the trace function: \[\mathbb{E}\left[ (x_i - \mu_k)^T(\Sigma_k)^{-1}(x_i - \mu_k) \mid x_i^{\mathrm{obs}},z_{ik}=1;\theta^{[r-1]} \right] \\ = \mbox{tr}(\mathbb{E}\left[ (x_i - \mu_k)(x_i - \mu_k)^T \mid x_i^{\mathrm{obs}},z_{ik}=1,c_i;\theta^{[r-1]} \right](\Sigma_k)^{-1}).\]
Therefore, to compute \(\tau_x\), only \(\mathbb{E}\left[ (x_i - \mu_k)(x_i - \mu_k)^T \mid x_i^{\mathrm{obs}},z_{ik}=1;\theta^{[r-1]} \right]\) has to be calculated.
Hint: up to a reorganization of the variables, we have
\[(x_i - \mu_k)(x_i - \mu_k)^T = \begin{pmatrix} (x_i^{\mathrm{obs}} - \mu_{ik}^{\mathrm{obs}})^T(x_i^{\mathrm{obs}} - \mu_{ik}^{\mathrm{obs}}) &(x_i^{\mathrm{obs}} - \mu_{ik}^{\mathrm{obs}})^T(x_i^{\mathrm{mis}} - \mu_{ik}^{\mathrm{mis}})\\ (x_i^{\mathrm{mis}} - \mu_{ik}^{\mathrm{mis}})^T(x_i^{\mathrm{obs}} - \mu_{ik}^{\mathrm{obs}}) &(x_i^{\mathrm{mis}} - \mu_{ik}^{\mathrm{mis}})^T(x_i^{\mathrm{mis}} - \mu_{ik}^{\mathrm{mis}}) \end{pmatrix}.\]
You can compute the expected value for each block.
To conclude, the E-step consists of computing the following quantities: \((\tilde{\mu}_{ik}^{\mathrm{mis}})^{[r-1]}\), \((\tilde{\Sigma}_{ik}^{\mathrm{mis}})^{[r-1]}\) and \(t_{ik}(\theta^{[r-1]}\).
Let us denote \((\tilde{x}_{ik})^{[r-1]} = (x_i^{\mathrm{obs}},(\tilde{\mu}_{ik}^{\mathrm{mis}})^{[r-1]})\) and \(\tilde{\Sigma}_{ik}^{[r-1]} = \left( \begin{array}{cc} 0_i^{\mathrm{obs},\mathrm{obs}} & 0_i^{\mathrm{obs},\mathrm{mis}}\\ 0_i^{\mathrm{mis},\mathrm{obs}} & (\tilde{\Sigma}_{ik}^{\mathrm{mis}})^{[r-1]} \end{array}\right)\)
The maximization of \(Q(\theta;\theta^{[r-1]})\) over \((\pi,\mu,\Sigma)\) leads to, for \(k=1,\ldots,K\), \[\pi_k^{[r]} = \frac{1}{n} \sum_{i=1}^n t_{ik}(\theta^{[r-1]})\] \[\mu_k^{[r]} = \frac{\sum_{i=1}^n t_{ik}(\theta^{[r-1]}) (\tilde{x}_{ik})^{[r-1]}}{\sum_{i=1}^n t_{ik}(\theta^{[r-1]})}\] \[\Sigma_k^{[r]} = \frac{\sum_{i=1}^n \left[t_{ik}(\theta^{[r-1]}) \left((\tilde{x}_{ik})^{[r-1]} - \mu_k^{[r]})((\tilde{x}_{ik})^{[r-1]} - \mu_k^{[r]})^T+\tilde{\Sigma}_{ik}^{[r-1]}\right) \right]}{\sum_{i=1}^n t_{ik}(\theta^{[r-1]})}\]
We consider a bivariate Gaussian mixture (\(d=2\)) with 2 classes (\(K=2\)): \[X \sim \pi_1\mathcal{N}(\mu_1,\Sigma_1)+\pi_2\mathcal{N}(\mu_2,\Sigma_2),\] with \(\mu_1=(0,0)\), \(\mu_2=(3,3)\), \(\pi_1=0.3,\pi_2=0.7\) and \(\Sigma_1=I_{2\times 2}\), \(\Sigma_2=I_{2\times 2}\)
We then introduce MCAR values in \(X\) (both variables are missing).
library(MASS)
library(mvtnorm)
SimuZ <- function(n,pik){
#######Arguments
##n: number of observations
##pik: vector of size K, proportion of the K classes
#######Values
##Return a matrix of size (n,K)
K <- length(pik)
Z <- matrix(0,nrow=n,ncol=length(pik))
obs_pos <- 1:n
for (k in 1:(K-1)){
obs_k <- sample(obs_pos,pik[k]*n)
Z[obs_k,k] <- 1
obs_pos <- setdiff(obs_pos,obs_k)
}
Z[obs_pos,K] <- 1
return(Z)
}
set.seed(2)
n <- 100
d <- 2
mu.true <- list(rep(0,d),rep(3,d))
sigma.true <- list(diag(1,d),diag(1,d))
K <- 2
pik.true <- c(0.3,0.7)
## Generation of the true partition
Z.true <- SimuZ(n=n,pik=pik.true)
Partition_true <- apply(Z.true, 1, function(z) which(z==1))
## Generation of the complete dataset
X <- matrix(NA,nrow=n,ncol=d)
for (k in 1:K){
obs_k <- which(Z.true[,k]==1)
X[obs_k, ] <- mvrnorm(n*pik.true[k] , mu.true[[k]] , sigma.true[[k]])
}
## Introduction of missing values in X
prop.miss <- 0.4
nb.miss <- floor(n*d*prop.miss)
missing_idx.mcar <- sample(n*d, nb.miss, replace = FALSE)
XNA <- X
XNA[missing_idx.mcar] <- NA
for (i in 1:n){ #to avoid that one row contains only NA
if (sum(is.na(XNA[i,]))==d){
num <- sample(1:d,1)
XNA[i,num] <- X[i,num]
}
}
head(XNA)
## [,1] [,2]
## [1,] 4.5994441 NA
## [2,] 2.5131302 4.9682624
## [3,] -0.2427924 0.8234145
## [4,] NA 3.0975859
## [5,] 3.1487466 1.7245412
## [6,] 0.1171338 -0.3877827
sum(is.na(XNA))/(n*d)
## [1] 0.315
## Answer (code):
pi.init <- ############
mu.init <- ############
sigma.init <- ############
Compute the missing-data pattern \(M\) which indicates where are the missing values in \(X\).
Code the function which returns the observed log-likelihood and \(t_{ik}\).
Recall that the observed log-likelihood is written as \[\ell(\theta;X^\mathrm{obs})=\sum_{i=1}^n \log(f(x_i^{\mathrm{obs}};\theta))=\sum_{i=1}^n \log\left(\sum_{k=1}^K f_k(x_i^{\mathrm{obs}};\mu_k,\Sigma_k)\pi_k\right)\]
Use the following function logsumexp to write the function Loglikehood_obs_Gaussian which returns the observed log-likelihood and \(t_{ik}, \forall i, \forall k\).
Hint: you can use https://github.com/cbouveyron/MBSL-Course2021/blob/main/GMM_1d_notebook.ipynb
logsumexp <- function (x) {
y = max(x)
y + log(sum(exp(x - y)))
}
## Answer (code):
# Arguments:
## XNA: matrix containing missing values (data matrix of size n,d)
## mu: current value of the means (list of length K, with vectors of size d)
## sigma: current value of the covariance matrices (list of length K, with matrices of size d,d)
## prop.pi: current proportion per class (vector of length K)
Loglikelihood_obs_Gaussian <- function(XNA,mu,sigma,prop.pi){
n <- ############
d <- ############
M <- ############
K <- ############
log_tik_numerator <- matrix(0, n, K) #log_tik_numerator[i,k] = log(f(zik=1,y_i^obs))
Pattern <- matrix(M[!duplicated(M),],nrow=sum(!duplicated(M)))
for (i in 1:nrow(Pattern)){
wherePat <- which(sapply(1:nrow(M),function(x) prod(M[x,]==Pattern[i,])==1))
X_Pat <- rbind(XNA[wherePat,])
M_Pat <- rbind(M[wherePat,])
var_obs <- which(Pattern[i,]==0)
X_obs <- as.matrix(X_Pat[,var_obs])
mu_obs <- lapply(mu, function(x) x[var_obs])
sigma_obs <- lapply(sigma , function(x) matrix(x[var_obs,var_obs],ncol=length(var_obs),nrow=length(var_obs)))
for (k in 1:K){
log_tik_numerator[wherePat,k] = ############
}
}
log_tik = ############
tik = ############
loglik_obs = ############
return(list(loglik_obs=loglik_obs,tik=tik))
}
In the case where the covariance matrix are diagonal, note that the M-step can be written as follows:
\(\forall k \in \{1,\dots,K\}\):
\(\pi_k^{[r]} = \frac{1}{n} \sum_{i=1}^n t_{ik}(\theta^{[r-1]})\).
\(\forall i \in \{1,\dots,n\}, \quad \tilde{x}^{[r-1]}_{ik}=(\tilde{x}^{[r-1]}_{ik1},\dots,\tilde{x}^{[r-1]}_{ikd})\in \mathbb{R}^d\) such that \(\forall j \in \{1,\dots,d\}\): \[\tilde{x}^{[r-1]}_{ikj}=X_{ij} \mbox{ if $M_{ij}=0$ (observed) and } \tilde{x}^{[r-1]}_{ikj}=\mu_{kj}^{[r-1]} \mbox{ if $M_{ij}=1$ (missing)}.\]
\(\mu_{k}^{[r]}=(\mu_{k1}^{[r]},\dots,\mu_{kd}^{[r]})\in \mathbb{R}^d\) such that
\[\forall j \in \{1,\dots,d\}, \quad \mu_{kj}^{[r]}=\frac{\sum_{i=1}^n \tilde{x}^{[r-1]}_{ikj}t_{ik}(\theta^{[r-1]})}{\sum_{i=1}^n t_{ik}(\theta^{[r-1]})}\]
\(\forall i \in \{1,\dots,n\}, \quad \tilde{\Sigma}_{ik}^{[r-1]}=(\tilde{\Sigma}_{ikjj'}^{[r-1]})_{j\in\{1,\dots,d\},j'\in \{1,\dots,d\}} \in \mathbb{R}^{d\times d}\) such that: \[\forall j\neq j' \in \{1,\dots,d\}, \quad \tilde{\Sigma}_{ikjj'}^{[r-1]}=0\] and \[\forall j \in \{1,\dots,d\}, \quad \tilde{\Sigma}^{[r-1]}_{ikjj}=0 \mbox{ if $M_{ij}=0$ (observed) and } \tilde{\Sigma}^{[r-1]}_{ikjj}=\Sigma_{ikjj}^{[r-1]} \mbox{ if $M_{ij}=1$ (missing)}.\]
\(\Sigma_{k}^{[r]}=(\Sigma_{kjj'}^{[r]})_{j\in\{1,\dots,d\},j'\in \{1,\dots,d\}}\in \mathbb{R}^{d\times d}\) such that: \[\forall j\neq j' \in \{1,\dots,d\}, \quad \Sigma_{kjj'}^{[r]}=0\] and \[\forall j \in \{1,\dots,d\}, \quad \Sigma_{kjj'}^{[r]}=\frac{\sum_{i=1}^n (( \tilde{x}_{ikj}-\mu_{kj}^{[r]})( \tilde{x}_{ikj}-\mu_{kj}^{[r]}) + \tilde{\Sigma}_{ikjj}^{[r-1]}) t_{ik}(\theta^{[r-1]})}{\sum_{i=1}^n t_{ik}(\theta^{[r-1]})}\]
Describe a procedure to select the number of classes.
Write the complete log-likelihood if the mechanism is MNAR.