FSFC is a library with algorithms of feature selection for clustering.
It's based on the article "Feature Selection for Clustering: A Review." by S. Alelyani, J. Tang and H. Liu
Algorithms are covered with tests that check their correctness and compute some clustering metrics. For testing we use open datasets:
- Generic data - High-dimensional points datasets
- Text data - SMS Spam Collection
Project documentation is available on Read the Docs
- Generic Data:
- SPEC family - NormalizedCut, ArbitraryClustering, FixedClustering
- Sparse clustering - Lasso
- Localised feature selection - LFSBSS algorithm
- Multi-Cluster Feature Selection
- Weighted K-means
- Text Data:
- Text clustering - Chi-R algorithm, Feature Set-Based Clustering (FTC)
- Frequent itemset extraction - Apriori
- numpy
- scikit-learn
- scipy
Now the project is in the early alpha stage, so it isn't publish to pip.
Because of it, installation of the project is a bit complicated. To use FSFC you should:
- Clone repository to your computer.
- Run
make init
to install dependencies. - Copy content of the folder fsfc to the source root of your project.
After it you can use feature selectors as follows:
import numpy as np
from fsfc.generic import NormalizedCut
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
data = np.array([...])
pipeline = Pipeline([
('select', NormalizedCut(3)),
('cluster', KMeans())
])
pipeline.fit_predict(data)
You can support development by testing and reporting of bugs or opening pull-requests.
Project has tests, they can be run with the command make test
Also code there is a Sphinx documentation for code, it can be built with the command make html
.
Documentation uses numpydoc
, so it should be installed on the system. To do it, run pip install numpydoc
.
- Alelyani, Salem, Jiliang Tang, and Huan Liu. "Feature Selection for Clustering: A Review."
- Data Clustering: Algorithms and Applications 29 (2013): 110-121.
- Zhao, Zheng, and Huan Liu. "Spectral feature selection for supervised and unsupervised learning."
- Proceedings of the 24th international conference on Machine learning. ACM, 2007.
- D.M. Witten and R. Tibshirani. "A framework for feature selection in clustering."
- Journal of the American Statistical Association, 105(490):713–726, 2010.
- Li, Yuanhong, Ming Dong, and Jing Hua. "Localized feature selection for clustering."
- Pattern Recognition Letters 29.1 (2008): 10-18.
- Cai, Deng, Chiyuan Zhang, and Xiaofei He. "Unsupervised feature selection for multi-cluster data."
- Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.
- Huang, Joshua Zhexue, et al. "Automated variable weighting in k-means type clustering."
- IEEE Transactions on Pattern Analysis and Machine Intelligence 27.5 (2005): 657-668.
- Li, Yanjun, Congnan Luo, and Soon M. Chung. "Text clustering with feature selection by using statistical data."
- IEEE Transactions on knowledge and Data Engineering 20.5 (2008): 641-652.
- Agrawal, Rakesh, and Ramakrishnan Srikant. "Fast algorithms for mining association rules."
- Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.
- Beil, Florian, Martin Ester, and Xiaowei Xu. "Frequent term-based text clustering"
- Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002.