Representative Samples Selection in An Implicit Mixture Model with The Approximation Maximization Algorithm
Technological innovations have made a profound impact on knowledge discovery. Extracting useful samples from massive dataset is essential in many modern scientific areas. In this project, an Approximation Maximization (AM) algorithm is developed to select representative samples in an implicit mixture model.
The statistical problem is as follows.
The detailed theoretical derivation of AM algorithm can be referred as follows.
FAM is a more general version of AM.
Data: Four classes of implicit mixture datasets, where the explicit models are normal distribution, linear regression, logistic regression, and Poisson regression.
Models: AM, FAM (a more general version of AM), EM, MLE, and K-means.
Covariance structures: Independent with each other (S1), auto-regressive correlated (S2), and compound symmetry(S3).
Metrics: Deviation of true parameters (DEV), positive selection rate (PSR), false discovery rate (FDR), the final number of selected representative samples (FN), and computation time (Time).
In this case, more and more red points were selected into representative samples set as iteration grew. On the other side, the blue noise samples were ruled out gradually by the AM algorithm, which meant AM could effectively select the representative samples.
From the above table, it's noticeable that more complex covariance structure in this experiment did not lead to obvious reduction of accuracy for AM and FAM. By comparing AM with FAM, AM was computationally more efficient than FAM, which was due to a close form of true parameter in this model.
For this case, the AM algorithm successfully selected almost all the red representative samples and few blue noise samples after 7 iterations.
In this experiment, the performance of all 5 methods was similar. AM outperformed other methods in terms of DEV and selection indexes. Due to the similar reasons, MLE, EM and K-means failed to obtain accurate estimate and representative samples set. As for FAM, although it remained good selection performance, its estimation accuracy was lower than that in previous experiments.
In this scenario, the AM algorithm successfully selected a majority of the representative samples with only few noise samples after 4 iterations.
EM, AM and FAM performed well in selecting representative samples in terms of relatively high PSR and low FDR. From the perspective of DEV, EM performed more accurately than AM and FAM. The result of EM was different from that in previous models, which was because the implicit model assumption of EM matched the true model setup.
In this case, most of the representative samples with few noise samples were selected by AM after 5 iterations.
In this experiment, AM and FAM performed better than other methods when estimation and selection were together considered. It's noteworthy that, by the above data generation methods, some noise samples were close to the true model, which meant a small amount of noise samples would be chosen as representative samples.
According to the experiments in 4 types of mixed datasets, where the explicit models were normal distribution, linear regression, logistic regression and Poisson regression, respectively, as well as 3 different correlation structures among variables, the AM algorithm was robust, outperforming other methods in terms of estimation accuracy and selection consistency in most cases.
In our future investigation, I will attempt to study the specific influence of initial parameter input in the AM algorithm. Besides, I will design a general and data-driven threshold rule. In addition, the in-depth theoretical guarantees of AM is another important point in my future work.