Representative Samples Selection in An Implicit Mixture Model with The Approximation Maximization Algorithm

Introduction

Technological innovations have made a profound impact on knowledge discovery. Extracting useful samples from massive dataset is essential in many modern scientific areas. In this project, an Approximation Maximization (AM) algorithm is developed to select representative samples in an implicit mixture model.

Challenge

The statistical problem is as follows.

Methodology

The detailed theoretical derivation of AM algorithm can be referred as follows.

FAM is a more general version of AM.

Experiment settings

Data: Four classes of implicit mixture datasets, where the explicit models are normal distribution, linear regression, logistic regression, and Poisson regression.

Models: AM, FAM (a more general version of AM), EM, MLE, and K-means.

Covariance structures: Independent with each other (S1), auto-regressive correlated (S2), and compound symmetry(S3).

Metrics: Deviation of true parameters (DEV), positive selection rate (PSR), false discovery rate (FDR), the final number of selected representative samples (FN), and computation time (Time).

Case 1.1 Implicit normal mixture model (2 variables)

In this case, more and more red points were selected into representative samples set as iteration grew. On the other side, the blue noise samples were ruled out gradually by the AM algorithm, which meant AM could effectively select the representative samples.

Case 1.2 Implicit normal mixture model (multiple variables)

From the above table, it's noticeable that more complex covariance structure in this experiment did not lead to obvious reduction of accuracy for AM and FAM. By comparing AM with FAM, AM was computationally more efficient than FAM, which was due to a close form of true parameter in this model.

Case 2.1 Implicit linear mixture model (1 variable)

For this case, the AM algorithm successfully selected almost all the red representative samples and few blue noise samples after 7 iterations.

Case 2.2 Implicit linear mixture model (multiple variables)

In this experiment, the performance of all 5 methods was similar. AM outperformed other methods in terms of DEV and selection indexes. Due to the similar reasons, MLE, EM and K-means failed to obtain accurate estimate and representative samples set. As for FAM, although it remained good selection performance, its estimation accuracy was lower than that in previous experiments.

Case 3.1 Implicit logistic mixture model (1 variable)

In this scenario, the AM algorithm successfully selected a majority of the representative samples with only few noise samples after 4 iterations.

Case 3.2 Implicit logistic mixture model (multiples variables)

EM, AM and FAM performed well in selecting representative samples in terms of relatively high PSR and low FDR. From the perspective of DEV, EM performed more accurately than AM and FAM. The result of EM was different from that in previous models, which was because the implicit model assumption of EM matched the true model setup.

Case 4.1 Implicit Poisson mixture model (1 variable)

In this case, most of the representative samples with few noise samples were selected by AM after 5 iterations.

Case 4.2 Implicit Poisson mixture model (multiple variables)

In this experiment, AM and FAM performed better than other methods when estimation and selection were together considered. It's noteworthy that, by the above data generation methods, some noise samples were close to the true model, which meant a small amount of noise samples would be chosen as representative samples.

Conclusion

According to the experiments in 4 types of mixed datasets, where the explicit models were normal distribution, linear regression, logistic regression and Poisson regression, respectively, as well as 3 different correlation structures among variables, the AM algorithm was robust, outperforming other methods in terms of estimation accuracy and selection consistency in most cases.

Future work

In our future investigation, I will attempt to study the specific influence of initial parameter input in the AM algorithm. Besides, I will design a general and data-driven threshold rule. In addition, the in-depth theoretical guarantees of AM is another important point in my future work.

BenLiu983/Representative-Samples-Selection