/idmlib

Data mining libraries for all iZENECloud projects

Primary LanguageC++Apache License 2.0Apache-2.0

Data mining libraries

A general data mining C++ library

Features

  • Keyphrase Extraction. We've implemented two kinds of keyphrase extraction approaches. One refers to the translation model from thesis work of Zhiyuan Liu, the other comes from our innovatin which uses Wiki data as the semantic knowledge base.

  • Taxonomy Generation.

  • Duplicate Detection. Read the paper Detecting Near-Duplicates for Web Crawling firstly then we could understand the algorithm. We used the famous Charikar simhash fingerprints generation approach and set the dimensions(f) to 64.

  • Ctr Prediction. We've implemented both AdPredictor and FTRL.

  • Chinese Query Correction.

  • Collaborative Filtering. This is an item-based incremental collaborative filtering.

  • Others.

Dependencies

We've just switched to C++ 11 for SF1R recently, and GCC 4.8 is required to build SF1R correspondingly. We do not recommend to use Ubuntu for project building due to the nested references among lots of libraries. CentOS / Redhat / Gentoo / CoreOS are preferred platform. You also need CMake and Boost 1.56 to build the repository . Here are the dependent repositories list:

  • cmake: The cmake modules required to build all iZENECloud C++ projects.

  • izenelib: The general purpose C++ libraries.

  • icma: The Chinese morphological analyzer library.

  • ijma: The Japanese morphological analyzer library.

  • ilplib: The language processing libraries.

License

The project is published under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0