This implementation is based on sklearn.linear_model.SGDClassifier
.
We modify the underlying basic stochastic gradient descent algorithm to support
- Per feature L1 and L2 regularization
- The target vector the algorithm will regularize the weight vector toward.
Part of the package is written in Cython, which is a mixture of Python and C. This require compiling before use. The package is still loosely depends on the original scikit-learn package.
Recommend to use Anaconda for package and external environment control(including gcc and CBLAS). Please use the Python 3.6 version of Anaconda. https://www.anaconda.com/download/#macos
- Python 3.6
- cython >= 0.27.3
- numpy >= 1.14.0
- scikit-learn = 0.19.1
- scipy >= 1.1.0
Please make sure numpy
is already installed.
And simply run pip install ./
.
The scripts are modified from the original scikit-learn library.
Detail modification can be found by using git diff between commit
5fcf6f486fb0134c91f5a8741fe7e450539883f0
in original scikit-learn
repository.
-
(From
sklearn/linear_model/stochastic_gradient.py
) Thestochastic_gradient.py
provides the python interface. The modifications are the additional arguments inSGDClassifier.fit()
. The following are the added arguemnts. Detail explanation are in the documentation of the arugment in the code.per_feature_alpha
(per feature L2 penalty)per_feature_beta
(per feature L1 penalty)modal_vector
(prior vector)
-
(From
sklearn/linear_model/sgd_fast.pyx
) The main algorithm is insgd_fast.pyx
, this script will further compile to C binary file by Cython. Theplain_sgd
function provides the interface between python and C for the entire algorithm. -
(From
sklearn/util/weight_vector.pyx
) The applying of the regularization is hidden in the methodapply_penalty
inweight_vector.pyx
. This method would apply both L1 and L2 penalty without using any efficiency tricks, such as JIT update or delayed update. -
(From
sklearn/util/seq_dataset.pyx
)seq_dataset.pyx
is added to the directory for compiling purpose. This script is not modified.
Please kindly cite our paper if you are using this package.
If you are using non-zero priors(modal_vector
), please cite:
@inproceedings{yang_regularization_2019,
address = {Montréal, Canada},
title = {A {Regularization} {Approach} to {Combining} {Keywords} and {Training} {Data} in {Technology}-{Assisted} {Review}},
booktitle = {Proceedings of the 17th edition of the {International} {Conference} on {Artificial} {Intelligence} and {Law} ({ICAIL})},
author = {Yang, Eugene and Lewis, David D. and Frieder, Ophir},
month = jun,
year = {2019}
}
If you are also using per feature regularization penalty(per_feature_alpha
or
per_feature_beta
), please cite:
@inproceedings{yang_text_2019,
address = {Paris, France},
title = {Text {Retrieval} {Priors} for {Bayesian} {Logistic} {Regression}},
booktitle = {Proceedings of the 41st {International} {ACM} {SIGIR} {Conference} on {Research} and {Development} in {Information} {Retrieval}},
publisher = {ACM},
author = {Yang, Eugene and Lewis, David D. and Frieder, Ophir},
month = jul,
year = {2019}
}