Students.Filters is a package that provides unsupervised learning filters for the WEKA machine learning toolkit version >3.7. Development will prioritize filters that are useful to students taking machine learning at Georgia Tech; initially only an Independent Component Analysis filter using the FastICA algorithm has been implemented.
The preferred installation method is to use the WEKA package manager. The git repository contains additional files for an Eclipse project with Maven dependencies for the EJML package, and Ant build files for the jar
.
See instructions on the WEKA homepage. If the package is not available from the official package page, it can be installed directly from:
https://github.com/cgearhart/students-filters/raw/master/StudentFilters.zip
The source code & package file can be intalled from git:
git clone https://github.com/cgearhart/students-filters.git
The filter can be used like other WEKA filters from the command line, from the WEKA GUI, or directly within your own Java code. The specific options for each file can be found in the source code, documentation, or from the command line with the -h
flag.
Note that the filter will automatically apply ReplaceMissingValues
, NominalToBinary
, and Remove
filters to the input data; attributes that have only one distinct value or all missing values are automatically removed. The class attribute (if set) is passed through the ICA filter untouched.
ICA is not typically used directly for attribute selection. One common technique for attribute selection is to first run PCA to determine the number of attributes, then run ICA with that number of attributes specified.
Instances instances = ..some instances..
Filter filter = new IndependentComponents();
filter.setInputFormat(instances);
filter.setOutputNumAtts(10); // optionally set the number of attributes (for dimensionality reduction)
for (int i = 0; i < instances.numInstances(); i++) {
filter.input(instances.instance(i));
}
filter.batchFinished();
Instances newData = filter.getOutputFormat();
Instance processed;
while ((processed = filter.output()) != null) {
newData.add(processed);
}
..do something with newData..
See also: weka.filters.Filter
Read the instructions first. Make sure that weka.jar
and the StudentFilters.jar
files are in the classpath and in order. Options for each filter can be determined with the -h
argument. The filter can then be directly invoked (or chained like other WEKA filters), e.g.:
java -cp <weka_path>/weka.jar:<weka_packages>/studentfilters.jar weka.filters.unsupervised.attribute.IndependentComponent -i <infile.arff> -o <outfile.arff> -W -A -1 -N 200 -T 1E-4
The FastICA algorithm is implemented indepdent of WEKA, so it can be included without adding WEKA to your project by including the StudentFilters.jar
file and importing filters.FastICA
. However, using the WEKA-compatible IndepdentComponents filter requires the weka.jar
in the classpath, and can be imported as weka.filters.unsupervised.attribute.IndependentComponents
. See the WEKA documentation for more details.
The pom.xml
file can be used with Apache Maven to rebuild filters-2.0.0-SNAPSHOT.jar
by running:
mvn clean install -Dmaven.test.skip=true
NOTE: dependencies will be handled automatically by Maven.
GUI can then be launched with
java -Xmx1g -classpath <maven_path>/.m2/repository/com/googlecode/efficient-java-matrix-library/ejml/0.25/ejml-0.25.jar:<maven_path>/.m2/repository/nz/ac/waikato/cms/weka/weka-dev/3.7.10/weka-dev-3.7.10.jar:<maven_path>/.m2/repository/net/sf/squirrel-sql/thirdparty-non-maven/java-cup/0.11a/java-cup-0.11a.jar:<maven_path>/.m2/repository/org/pentaho/pentaho-commons/pentaho-package-manager/1.0.8/pentaho-package-manager-1.0.8.jar:<maven_path>/.m2/repository/junit/junit/4.11/junit-4.11.jar:<maven_path>/.m2/repository/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar weka.gui.Main
NOTE: the EMJL library needs to be installed on your system in the expected location; follow the instructions to install it with Maven.
Once the filter is installed with the package manager, or has been simply unzipped to the package folder on the weka path, it will automatically appear in the WEKA gui. (The GUI must usually be restarted after new packages are added.) See the WEKA documentation for more details.
The filters are dependent on WEKA (licensed under GPL) and the Efficient Java Matrix Library (EJML) (licensed under Apache License 2.0). The FastICA algorithm is released under the GPL. The implementation in this package is based on the scikit-learn implementation which is released under BSD. To the extent that there may be any original copyright, it is licensed under the Unlicense - i.e., it is released to the Public Domain.