GANs for imputing matrices

General Idea

Given a feature matrix with missing values, we can use GAN bases ideas to impute the missing values. Formally we have a matrix (Eqaution) , generated by our model, with missing values. Assume we also have a binary masks matrix (Eqaution) in which (Eqaution) iff (Eqaution) is not missing. The patterns of the missing value matrices are far from being random, as some features always exist, some always exist when others exist, and some have very correlated mask bits. Our goal is to train a GAN with a Generator G, and a discriminator D. The generator G is a function (which we will train as a deep learning network) that gets an input vector (Eqaution) , and a mask vector (Eqaution) stating which channels are known, and generates (Eqaution) , a complete imputed vector, in which the masked variables are exactly as in the input, and the rest are imputed. So:

(Eqaution)
(Eqaution) is an imputed value if (Eqaution) The discriminator D is another (deep network) function that gets a vector (Eqaution) and a mask (Eqaution) and tries to estimate if the vector confined to the lit up channels is a real sample or one that went through imputing. The reason we do this with a mask vector is that we may not have enough examples in which (Eqaution) is given to us on all channels... we may only have a large (Eqaution) matrix that has many missing values in it. Hence we don't have full real examples to feed into the discriminator learning, and we need to confine D to answer the question of real vs. fake confined to only on some of the channels. This creates a problem on its own - the problem of making sure that the population of masks given for real samples is the same as the population of masks given for fake samples. While training, we do competition rounds in which G is trained to generate values for channels that are hidden from it, and D is trying to tell real from fake. The loss functions are arranged in away that will push G to generate examples that D finds hard to discriminate.