/dami

algorithms of data mining, See feluca project instead

Primary LanguageJava

dami

Scalable algorithms in data mining. (I am shifting this project to feluca and will refactor it. so this project is deprecating)

dami is writen in Java. Our goal is to make algorithms that can handle hundreds of millions of data with a limited memory PC

Currently we have :

  • utility: Buffered vectors pool for dataset IO, High performance and simple text parser. (More tests need)

  • classification: SGD for logistic regressions

  • recommendation: SlopeOne, SVD, RSVD, itemneighborhood-SVD (see movielens_converter.py)

  • significant test: swap randomization

  • graph: Pagerank.

Future:

  • similarity: simhash

2012/10/22 Release Notes:

  • L1 & L2 logistic regression
  • memory cost estimation
  • simple commandline integration for LR

2012/7/22 Release Notes:

  • Asynchronous vector buffer for dataset IO
  • High performance and simple text parser(only for digital related chars)
  • small refactoring.

2012/7/12 Release Notes:

  • code refactoring for recommendation and IO
  • To run RMSE for recommendation, you first need to see movielens_convert.py for converting and/or splitting movielens data, and see CFDataConverter and TestSVD

To achieve computation efficiency and memory utilization, two ways we have just adopted.

1: Using "id" as index of array for fetching data.

2: Only maintaining model in memory and saving data to converted bytes for IO

So it's highly recommemded you use continuous ids for the algorithms :)

My Chinese blog : http://blog.csdn.net/lgnlgn
E-mail : gnliang10 [at] 126.com