Preface for revise ================== LibShortText is an open source library for short-text classification (http://www.csie.ntu.edu.tw/~cjlin/libshorttext). Please read the COPYRIGHT file before using LibShortText. LibShortText is built based on project liblinear(see http://www.csie.ntu.edu.tw/~cjlin/liblinear/) which support win and linux platform both. But LibShortText does not support Windows, So, this project do it : -- support build and run on windows platform Building Windows Binaries ========================= Windows binaries are available in the directory `windows'. To re-build them via Visual C++, use the following steps: 1. Open "X64 Native Tools Command Prompt for VS2017" comand line tools. also you can open a dos command window and set environment variables of VC++ like this, type ""C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64\vcvars64.bat"" You may have to modify the above command according which version of VC++/VS or where it is installed. 2. change to current project directory, and Type nmake -f Makefile.win clean all 3. (Optional) To build 32-bit windows binaries, you must (1) Setup "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\vcvars32.bat" instead of vcvars64.bat (2) Change CFLAGS in every Makefile.win: /D _WIN64 to /D _WIN32 nmake -f Makefile.win clean all 4. go to the ../demo ,and copy the command in demo.sh, and paste to command line to run: python ../text-train.py -f -A train_feats1 -A train_feats2 train_file python ../text-predict.py -f -A test_feats1 -A test_feats2 test_file train_file.model predict_result python demo.py Enjoy! Author: Justin(github: https://github.com/cosmichut) @2017/10/11 =================following is original README========================= LibShortText is an open source library for short-text classification (http://www.csie.ntu.edu.tw/~cjlin/libshorttext). Please read the COPYRIGHT file before using LibShortText. To get started, please read the ``Quick Start'' section first. For developers, please check our document at http://www.csie.ntu.edu.tw/~cjlin/libshorttext/doc/ for integrating LibShortText in your software. Table of Contents ================= - Installation and Data Format - Quick Start - Command-line Usage - More Examples about Command-line Usage - Interactive Error Analysis - Additional Information Installation and Data Format ============================ LibShortText requires UNIX systems with Python 2.6 or newer versions. The latest version (Python 2.7) is recommended for better efficiency. On Unix systems, type $ make to install the package. For training and test data, every line in the file contains a label and a short text in the following format: <label><TAB><text> A TAB character is between <label> and <text>. Both the label and the text can contain space characters. Here are some examples. Jewelry & Watches handcrafted two strand multi color bead necklace Books big bike magazine february 1973 Two sample sets included in this package are `train_file' and `test_file'. Quick Start =========== You can run $ cd demo $ ./demo.sh to run a demonstration. LibShortText provides a simple training-prediction workflow: short texts ============> model ==============> predictions text-train.py text-predict.py The command `text-train.py' trains a text set to obtain a model. For example, the following command generates `train_file.model' for the given `train_file'. $ python text-train.py train_file [output skipped] `text-predict.py' predicts a test file using the trained model. For example, the following command predicts `test_file' with `train_file.model' and stores the results in `predict_result'. $ python text-predict.py test_file train_file.model predict_result Accuracy = 87.1800% (4359/5000) Once predict_result is obtained, LibShortText provides several handy utilities to conduct error analysis in the Python interactive shell. Please see the section `Interactive Error Analysis' for more details. Command-line Usage ================== -`text-train.py' Usage `text-train.py' obtains a model by training either a short-text dataset or a LIBSVM-format data set generated by `text2svm.py'. Usage: text-train.py [options] training_file [model] options: -P {0|1|2|3|4|5|6|7|converter_directory} Preprocessor options. The options include stopwrod removal, stemming, and bigram. (default 1) 0 no stopword removal, no stemming, unigram 1 no stopword removal, no stemming, bigram 2 no stopword removal, stemming, unigram 3 no stopword removal, stemming, bigram 4 stopword removal, no stemming, unigram 5 stopword removal, no stemming, bigram 6 stopword removal, stemming, unigram 7 stopword removal, stemming, bigram If a preprocssor directory is given instead, then it is assumed that the training data is already in LIBSVM format. The preprocessor will be included in the model for test. -G {0|1} Grid search for the parameter C in linear classifiers. (default 0) 0 disable grid search (faster) 1 enable grid search (slightly better results) -F {0|1|2|3} Feature representation. (default 0) 0 binary feature 1 word count 2 term frequency 3 TF-IDF (term frequency + IDF) -N {0|1} Instance-wise normalization before training/test. (default 1 to conduct normalization) -A extra_svm_file Append extra libsvm-format data. This parameter can be applied many times if more than one extra svm-format data set need to be appended. -L {0|1|2|3} Classifier. (default 0) 0 support vector classification by Crammer and Singer 1 L1-loss support vector classification 2 L2-loss support vector classification 3 logistic regression -f Overwrite the existing model file. Examples: text-train.py -L 3 -F 1 -N 1 raw_text_file model_file text-train.py -P text2svm_converter -L 1 converted_svm_file -`text-predict.py' Usage `text-predict.py' predicts labels for a test dataset with a trained model. Usage: text-predict.py [options] test_file model output options: -f Overwrite the existing output file. -a {0|1} Output options. (default 1) 0 Store only predicted labels. The information is NOT sufficient for interactive analysis. Use this option if you would like to get only accuracy. 1 More information is stored. The output provides information for interactive analysis, but the size of output can become much larger. -A extra_svm_file Append extra libsvm-format data. This parameter can be applied many times if more than one extra svm-format data set need to be appended. -`text2svm.py' Usage `text2svm.py' generates a directory containing needed information for converting short texts to LIBSVM format. An output file in LIBSVM format is also generated. Usage: text2svm.py [options] text_src [output] options: -P {0|1|2|3|4|5|6|7} Preprocessor options. The options include stopwrod removal, stemming, and bigram. (default 1) 0 no stopword removal, no stemming, unigram 1 no stopword removal, no stemming, bigram 2 no stopword removal, stemming, unigram 3 no stopword removal, stemming, bigram 4 stopword removal, no stemming, unigram 5 stopword removal, no stemming, bigram 6 stopword removal, stemming, unigram 7 stopword removal, stemming, bigram Default output will be a file "text_src.svm" and a directory "text_src.text_converter." If "output" is specified, the output will be "output" and "output.text_converter." More Examples about Command-line Usage ====================================== We use the following questions/answers to demonstrate some examples. Q: Given many parameters provided by `text-train.py', how to choose the parameters at the first trial? A: Although `text-train.py' has several parameters to tune, we carefully choose default parameters based on a study on short-text classification [2]. Running `text-train.py' without parameters can deliver good classification accuracy in general. It is equivalent to the following command, in which default parameters are explicitly specified. $ python text-train.py -P 1 -G 0 -F 0 -N 1 -L 0 train_file Meaning for each parameter: -P 1: no stemming, no stopword removal, bigram features -G 0: no LIBLINEAR parameter selection -F 0: binary feature representation -N 1: each instance is normalized to unit length -L 0: use Crammer and Singer's multi-class method. Q: How to select the parameter C in LIBLINEAR automatically? A: By default, LIBLINEAR (and `text-train.py') sets the parameter C to 1. You can automatically select the best parameter C by using `-G 1`. Q: How to generate different models using the same training data? A: Internally, text-train.py converts data to LIBSVM format and applies LIBLINEAR for training. To reuse the pre-processed data, LibShortText provides another workflow: short texts ==========> LIBSVM format data ============> model ==============> result text2svm.py text-train.py text-predict.py The following command generates a LIBSVM-format file `train_file.svm' and a directory `train_file.text_converter' containing information for the conversion. $ python text2svm.py train_file [`train_file.text_converter' and `train_file.svm' are generated.] We then generate two models using the same LIBSVM-format file. $ python text-train.py -P train_file.text_converter -L 3 train_file.svm lr.model [A logistic regression model, `lr.model', is generated.] $ python text-train.py -P train_file.text_converter -L 2 train_file.svm l2svm.model [An L2-loss linear SVM model, `l2svm.model', is generated.] Q: How to overwrite existing models or prediction results? A: If the specified model or output file exists, by default, neither `text-train.py' nor `text-predict.py' overwrite them. You can generate new models/prediction outputs by `-f'. $ python text-train.py -f train_file $ python text-predict.py -f test_file train_file.model predict_result Q: Why is the file of prediction results so large? A: By default, some additional information for analysis are stored. If you need to get only classification accuracy, you can specify `-a 0' to save disk space. For example, $ python text-predict.py -a 0 test_file train_file.model predict_result Q: If I am an experienced LIBILNEAR user, how should I specify options for LIBLINEAR and `grid.py'? A: For LIBLINEAR, you can easily pass LIBLINEAR parameters in a double quoted string after `-L' with a special character `@'. For example, if you want to use L2-regularized Logistic Regression as the classifier, set the parameter C to 0.5, and append a bias term to each instance, you can type $ python text-train.py -L @"-s 3 -c 0.5 -B 1" train_file To show parameters provided by LIBLINEAR/grid, use $ python text-train.py -x liblinear $ python text-train.py -x grid For `grid.py', to specify the range of C, using `-G @"-log2c begin,end,step"'. For example, the following command selects the best C among [2^-2, 2^-1, 2^0, 2^1] in terms of cross validation rates. $ python text-train.py -G @"-log2c -2,1,1" train_file Q: I have more features for texts, how can I add them in LibShortText? A: You can use `-A' option in `text2svm.py', `text-train.py', and `text-predict.py' to append feature files. Note that you can use multiple feature files. If we have 20 features, and these features are included in two files, `train_feats1' and `train_feats2', then we can use these files in the training stage by $ python text-train.py -A train_feats1 -A train_feats2 train_file The features you use in the training stage should be identical to those in the predict stage. Assume that `test_feats1' and `test_feats2' are feature files corresponding to `train_feats1' and `train_feats2', respectively. To predict a test file you should use $ python text-predict.py -A test_feats1 -A test_feats2 test_file train_file.model predict_result The usage of analyzer is the same as before. The features will be represented in the following format. <feat_filename>:<feat_idx> Q: I already have some LIBSVM-format features. How can I include these features when training the model? A: You can use the -A option in the command line mode. For example, if you have two extra svm files `extra_train_1' and `extra_train_2' in LIBSVM-format, then use: $ python text-train.py train_file -A extra_train_1 -A extra_train_2 Note that `train_file', `extra_train_1', and `extra_train_2' should have the same number of instances. And then use the following command to predict: $ python text-predict.py test_file -A extra_test_1 -A extra_test_2 train_file.model predict_result Interactive Error Analysis ========================== We provide interactive tools to analyze prediction results. First, you generate a file of prediction results by the commands introduced in section `Quick Start.' Note that you CANNNOT specify `-a 0' to `text-predict.py' or the prediction result will not be analyzable. You then enter Python, import the module, load the prediction results, and create an object of `Analyzer' by reading a model. $ python >>> from libshorttext.analyzer import * >>> predict_result = InstanceSet('predict_result') >>> analyzer = Analyzer('train_file.model') You can select a subset of test data for analysis using the following options. `wrong' Select wrongly predicted instances. `with_labels(labels, target)' If `target' is `true', then instances with labels in the set `labels' are selected. If `target' is `predict', those predicted to be in `labels' are chosen. `target' can also be `both' or `or'. `both' and `or' find the union and the intersection of `true' and `predict', respectively. The default value of `target' is `both'. `sort_by_dec' Sort instances by decision values. `subset(amount, method)' Get a specific amount of data by the method `top' or `random'. The default value of `method' is `top'. For example, among wrongly predicted instances with labels 'Books', 'Music', 'Art', and 'Baby', to get those having the highest 100 decision values, you can use >>> insts = predict_result.select(wrong, with_labels(['Books', 'Music', 'Art', 'Baby']), sort_by_dec, subset(100)) You can run the following operations to know details of the selected instances. >>> analyzer.info(insts) Number of instances: 100 Accuracy: 0.0 (0/100) True labels: "Baby" "Art" "Books" "Music" Predicted labels: "Baby" "Music" "Books" "Art" Text source: /home/user/libshorttext-1.0/test_file Selectors: -> Select wronly predicted instances -> labels: "Books", "Music", "Art", "Baby" -> Sort by maximum decision values. -> Select 100 instances in top. The following command generates a confusion table on the selected instances: >>> analyzer.gen_confusion_table(insts) Art Books Music Baby Art 0 15 4 5 Books 10 0 17 3 Music 10 21 0 3 Baby 1 7 4 0 To analyze a single short text, you first load it by >>> insts.load_text() Then you can print information for each single text in `insts'. >>> print(insts[61]) text = avengers assemble 4 panini uk collector s edition nm 2012 true label = Books predicted label = Music You can print model weights corresponding to tokens of a short text. The following operation prints weights of the three classes with the highest decision values. (To print weights in all classes, you can change 3 to 0.) >>> analyzer.analyze_single(insts[61], 3) Music Books Antiques edition -5.232e-02 8.869e-01 -1.303e-01 s edition -2.219e-02 1.527e-01 -4.077e-02 nm 7.269e-01 6.048e-02 -1.495e-01 collector -5.253e-02 -5.208e-02 8.804e-02 uk 9.466e-01 -2.089e-01 2.683e-02 collector s -3.174e-02 6.389e-02 9.963e-02 4 -2.011e-01 -2.062e-01 1.526e-01 2012 -1.173e-01 2.663e-01 -1.369e-01 s -5.142e-02 1.485e-01 1.757e-01 **decval** 3.816e-01 3.705e-01 2.842e-02 True label: Books You can also analyze an arbitrary short text. >>> analyzer.analyze_single('beatles help longbox sealed usa 3 cd single', 3) Music Crafts Travel sealed 4.828e-01 1.050e-03 -5.383e-02 cd 2.872e+00 -1.032e-01 -1.723e-01 cd single 1.663e-01 -5.181e-03 -6.558e-03 single 4.375e-01 -6.953e-02 -9.960e-02 usa 2.247e-01 3.530e-02 2.657e-02 beatles 5.050e-01 -5.710e-02 -6.933e-02 3 cd 1.320e-02 -3.837e-02 -7.793e-20 3 3.057e-02 4.712e-02 1.402e-01 **decval** 1.673e+00 -6.716e-02 -8.299e-02 Additional Information ====================== [1] H.-F. Yu, C.-H. Ho, Y.-C. Juan, and C.-J. Lin. LibShortText: A Library for Short-text Classification. [2] H.-F. Yu, C.-H. Ho, P. Arunachalam, M. Somaiya, and C.-J. Lin. Product title classification versus text classification. For any questions and comments, please email cjlin@csie.ntu.edu.tw
cosmichut/libshorttext-crossplatform
A crossplatform LibShortText: A Library for Short-text Classification and Analysis.
PythonBSD-3-Clause