pca_module: A Python repository from kshmirko

PCA Module for Python
=====================


Introduction
------------
Implementation of a PCA module in python using numpy, scipy and
python extensions (here, in C). The module carries out Principal
Component Analysis (PCA) using either Singular Value Decomposition
(SVD) or the NIPALS algorithm.

I chose to implement the NIPALS algorithm in C, because it is
supposed to be faster on larger data sets. The user can choose
the number of PCs that are to be calculated. And the scipy package
already comes with a SVD method.


Documentation
-------------
PCA can be used to reduce multidimensional data to fewer dimensions,
while preserving the most important information during the process.
After that it can be used for exploratory data analysis or to make
predictive models.

See pca_nipals.pdf in the doc folder for more information about PCA,
Nipals and Correlation Loadings.

Go to the functions list for details about each function in the PCA Module:
pca_module_functions.html in the doc folder.


Installation
------------
Standard distutils build and install:
$ python setup.py install

There are two variables that can be adjusted in setup.py. The first "add_ext"
sets extension to be compiled and included (include: add_ext = True). If you set it to
False, C python extension will not be included and you cannot access c_nipals.
PCA can be calculated without the C python extension.
The other variable "old_numeric" sets which version to use. Either
old numeric version or the numpy version (use numpy version:
old_numeric = False).

The old numeric version is more limited when it comes to functions. If possible
the PCA module for scipy and numpy should be used.

To test that installation did not fail, try:
$ python
...
>>> import pca_module
>>>

No errors or exceptions should appear.


Usage
-----
The most difficult parts about PCA is using it well. This involves: formulating the problem well,
choosing the best possible variables, and finally, after calculation, explore the data and analyse the
plots you can make after PCA calculation. You should know a little about PCA
before you start using this module. I will not go far into these areas of PCA here,
but show how you can get the calculated data.

Assume you have a 2-dimensional data matrix ( X ) that holds your data
you want to analyze. This matrix is of size n x p with n = number of objects and
p = number of variables. The first matrix, T, are the so called PCA Scores,
the second matrix, P, are the PCA Loadings. Each row holds the values of an
object and each column holds the value for a variable.  The collection is
p-dimensional in variable space. You can read more about this and what
the PCA calculation returns under: <a href="#doc">3. Documentation</a>.

Examples usage (with explained variance for each PC):
>>> from pca_module import *
>>> T, P, explained_var = PCA_svd(X, standardize=True)
>>>

Examples usage 2 (with E-matrix for each PC):
>>> from pca_module import *
>>> T, P, E = PCA_nipals2(X, standardize=True, E_matrices=True)
>>>


Now both mean centering (always done) and standardization (standardize=True) of X
has been done before PCA calculation. Two matrices T and P are returned, and also
an array of explained variance. The first matrix, T, are the so called Scores,
the second matrix, P, are the Loadings, and the third and last element is an array
of explained variance for each PC.

Here I have made some plots of the results, and short explanations about the content:
plot_example.html in the doc folder.


Testing
-------
After installation is complete, you can unit-test the module.
With a python testing script called testing.py. There is also a
method for time measurements in the testing script.

Running the testing script:
$ python testing.py

Or for Numeric install:
$ python testing_numeric.py

errors   -  problems with module (e.g. import error)
failures -  functions return wrong results





Updates
-------

PCA Module 1.1.01 - february 2008
=================================
- Changed to using numpy.linalg instead of scipy.linalg.

- Added test in nipals. Before t was just set to first column of E. Now t will be set to a non-zero vector if possible.

- And some other minor fixes.


PCA Module 1.1 - oktober 2007
=============================
- You can now get all E-matrices after a PCA. See 'Usage' on how to do it. 
   Updated for all nipals methods including python c extension.

- PCA_svd still only returns the explained variances. 


PCA Module 1.0 - may 2007
=========================
- PCA using different packages and methods.



Made by
-------
Henning Risvik, risvik@gmail.com, may 2007, University of Oslo
kshmirko/pca_module