-
SSPCA-EXP is an algorithm for the sparse-pca problem (sparsity in ℓ0-norm), desinged to solve high-dimensional, ill-posed datasets.
-
Standard PCA (eigendecomposion) tends to fit noise in such high-dimensional settings, known as the curse of dimensionality. To overcome this problem sparsity contsraints are added, which unfortunately render the problem computationally hard. SSPCA-EXP is an approximation algorithm, whose run time can be calibrated according to the compute resources at hand. The parameter k* (see description below) governs the run-time.
-
The full paper can be found here http://arxiv.org/abs/1910.06846
The taxing computational effort that is involved in solving some high-dimensional statistical problems, in particular problems involving non-convex optimization, has popularized the development and analysis of algorithms that run efficiently (polynomial-time) but with no general guarantee on statistical consistency. In light of the ever increasing compute power and decreasing costs, perhaps a more useful characterization of algorithms is by their ability to calibrate the invested computational effort with the statistical features of the input at hand. For example, design an algorithm that always guarantees consistency by increasing the run-time as the SNR weakens. We exemplify this principle in the ‘0-sparse PCA problem. We propose a new greedy algorithm to solve sparse PCA that supports such a calibration.
Parameters:
- fpath: Enter the file path and name, either absolute path or relative to the run file. the excepted file is a numeric matrix of dimensions n,p. where n is the amount of rows and p is the amount of columns. Default value: "data.csv"
- k: Enter the desired sparsity level, the number of non-zero entries in the resulting sparse PC. Default value: "20".
- k_star: The algorithm exhastively searches all subsets of size k* (each such subset is termed "seed"), completing each one in a greedy way to a candidate set of size k. It then returns the set k which maixmizes variation. The run-time of the algorithm is thus exponential in k*. Default value: "1".
- batch: Since the algorithm is built to run parallelly, each cpu will handle |batch| amount of seeds. By changing this parameter, task managment overhead can be decreased thus optimizing the runtime of the algorithm. Due notice - the algorithm does not optimize automaticaly and the optimization is up to the user. Default value: "0".
- cpus: The total number of cpus to be used by the algorithm. Default value: "1".
- newrun: The state of the algorithm is constatnlt being backed-up for case of bad connection, crashes etc... When starting a new execution with the same parameters as the last saved state, it is loaded automatically and the execution continues from that checkpoint. If you wish to generate a new run, and ignore old checkpoints, set this parameter to 1. Default value: "0".
In order to run SSPCA-EXP, first you need to install the next basic packages (which you probably already have):
These packages are default on anaconda enviorment.
Assuming the "sspca_exp.py" is at the same folder as your data, there are two ways to run the algorithm. The first is via command line, the second is via your favorite python IDE (TBD)...
Find the best 20 features of "mydata", using a seed of size 1, while using multiprocessing on 2 CPUs when the batch size is 210.
python3 sspca_exp.py --k 20 --k_star 1 --path "./mydata.csv" --cpus 2 --batch 210
The output will be located at "out/[today's date]/sspcaExp_kstar[kstar]_k[k].csv" as a csv file. In it there are 4 columns, which contain algorithm name, explained trace, runtime, k_entries. These k_entries are the actual output and the actual features that were chosen by sspca-exp.
- Dan Vilenchik
- Adam Soffer
- Guy Holtzman
This project is licensed under the MIT License.
Contact Adam Soffer at soffer@post.bgu.ac.il.