dami

Scalable algorithms in data mining. (I am shifting this project to feluca and will refactor it. so this project is deprecating)

dami is writen in Java. Our goal is to make algorithms that can handle hundreds of millions of data with a limited memory PC

Currently we have :

utility: Buffered vectors pool for dataset IO, High performance and simple text parser. (More tests need)
classification: SGD for logistic regressions
recommendation: SlopeOne, SVD, RSVD, itemneighborhood-SVD (see movielens_converter.py)
significant test: swap randomization
graph: Pagerank.

Future:

similarity: simhash

2012/10/22 Release Notes:

L1 & L2 logistic regression

memory cost estimation

simple commandline integration for LR

2012/7/22 Release Notes:

Asynchronous vector buffer for dataset IO

High performance and simple text parser(only for digital related chars)

small refactoring.

2012/7/12 Release Notes:

code refactoring for recommendation and IO

To run RMSE for recommendation, you first need to see movielens_convert.py for converting and/or splitting movielens data, and see CFDataConverter and TestSVD

To achieve computation efficiency and memory utilization, two ways we have just adopted.

1: Using "id" as index of array for fetching data.

2: Only maintaining model in memory and saving data to converted bytes for IO

So it's highly recommemded you use continuous ids for the algorithms :)

My Chinese blog : http://blog.csdn.net/lgnlgn
E-mail : gnliang10 [at] 126.com

lgnlgn/dami

dami