MELT是一个机器学习包，主要特色是丰富的实验特性支持，注重训练速度优化，它的设计目标

方便的机器学习各种算法在各种数据平台(单机,mpi, hadoop,gpu)的实验。

提供丰富的实验支持和机器学习数据分析支持(MELT本身的支持及多个外围脚本支持)

训练时候的特征动态选择 –incl -excl支持

多种normalizer,calibrator可供选择(优化速度的minmax normalizer，还提供gauss和bin normalizer)

Python接口的支持

特征重要度打印，如GBDT,LinearSVM均支持

GBDT模型的单次预测重要度展示，gbdt树模型的图片打印，模型输出到C++/PYTON/PHP代码

迭代模型如GBDT，LinearSVM均支持迭代过程中的validating，early stop

Predictor内部整合normalizer和calibrator更加方便使用。

良好的扩展性支持其它开源线性模型的引入（sofia,liblinear,libsvm,lightgbm）

加入你自己的模型

场景及特性

通过在命令行执行melt命令完成单机实验
标配设置: 交叉验证, 训练+测试, 训练 + 测试 + 预测(线上), 特征选择(指定部分特征，或指定去掉部分特征
任务：二元分类，多分类，回归，排序
Learners: boosted trees, SVMs, logistic regression, stochastic gradient descent, perceptron, VW, bagging, etc.
MPI: 大规模数据并行算法
参数扫描(Parameter sweeps): 单机的参数扫描及hadoop，mpi版本参数扫描.
提供数据分析对比支持
数据输入格式建议采用melt默认的稠密或者稀疏表示格式，同时也直接LibSVM格式的输入

使用melt进行LightGBM训练与上线

使用示例

完整环境在lightgbm路径下直接使用lightgbm训练参考lightgbm-train-test.sh
使用melt调用lightgbm训练参考 run-rank.sh run-regression.sh
也可以使用 sh run.sh 将会执行3个训练实验，和一个python接口的预测/debug信息展示

melt train_data -c tt -test test_data -cl lightgbm -cls lightgbm-rank.conf -cl 表示classifier -cl light | lgbm | gbm | lg 都表示使用lightgbm处理
-cls classifierSetting所有非melt内部算法的第三方都通过 -cls 设置第三方库自身的命令行参数
melt train_data -c tt -test test_data -cl lightgbm -cls lightgbm-rank.conf,num_trees=200
melt train_data -c tt -test test_data -cl lightgbm -cls lightgbm-rank.conf,num_trees=200 -incl i:33-55
这相当于调用 lightgbm config=lightgbm-rank.conf以及 lightgbm config=lightgbm-rank.conf num_tress=200
melt train_data -c tt -test test_data -cl lightgbm -cls lightgbm-regression.conf,num_trees=200 -incl i:33-55 --metric rank
这里使用regression算法，但是仍然按照rank评估NDCG ,注意label需要是0-4的整数，如果label 有>4 需要手动设置 -gains
如果lable是浮点数那么rank评估值计算top1准确率
melt train_data -cl lightgbm -cls lightgbm-regression.conf
默认command是交叉验证，这里进行5交叉
conf的写法参考 https://github.com/Microsoft/LightGBM/blob/master/examples/lambdarank/train.conf

使用melt的和直接使用LightGBM的区别

melt对LightGBM做了封装，也就是说melt读取文本文件生成内部数据结构再转换进行LightGBM的训练
和直接使用LightGBM的区别

性能损失如果不是超大数据可以忽略不计，melt的读取也非常快，转换代价基本可以忽略
大文件处理，melt目前采用全部读入内存的方式，因此如果是超出内存大小的数据，不能使用melt的方式
可以直接用lightgbm训练配置文件设置

two_round, default=false, type=bool, alias=two_round_loading,use_two_round_loading
by default, LightGBM will map data file to memory and load features from memory. This will provide faster data loading speed. But it may out of memory when the data file is very big.
set this to true if data file is too big to fit in memory.

结果diff，结果的diff主要原因是LightGBM的分桶利用了部分数据默认50000样本随机，这部分随机和melt的instances随机逻辑略有不同但是不影响模型训练效果，diff源自随机性。
melt如果设置随机种子-rs为固定数值，lightgbm采用其配置文件默认的随机设置,不进行feature和bagging数据采样，那么每次训练结果保持不变。
局限，melt暂时不支持多分类，lightgbm本身已经支持多分类和初始score输入。

使用melt的收益

使用melt训练的主要好处在易用性，体现在

多种输入格式的支持

melt同lightgbm一样支持libsvm,tsv,csv格式，同时melt也直接支持malloc rank格式的样本(无需其他设置直接解析malloc rank格式数据)
lightgbm的rank需要样本提供query文件，并且保证数据文件要按照query排序
melt数据格式可以通过 --name 0,1 -group 0 来表示前两列对应是名称，第一列是groupKey/query 或者--rank=1等价(-name 0,1 -group 0)
melt支持内部默认的dense和sparse数据格式，其中sparse数据格式类似libsvm但是在label后面的第一列表示特征数目

更好的支持特征命名

melt可以通过以下几种方式提供特征命名信息

--header=1 确认输出有header信息如 query,doc,label,name1,name2,name3....
采用默认可解析header，样本数据第一行#开头如 #query,doc,label,name1,name2..
在当前路径下提供feature_name.txt里面的数据格式
name1 #comment
name2 #comment
...

特征匹配的支持

lightgbm支持在conf中设置要去掉的特征，但是没有melt的特征匹配完善
melt提供更方便的特征匹配支持 -incl, -excl

-incl i:0-44 表示只使用0-44 45个特征
-excl i:3-22,50-64 表示去掉3-22和50-64的特征
-incl ^idf 表示只使用特征名称是idf开头的特征
-exc new,old 表示去掉特征名称匹配到new或者old的特征
-incl ^idf -excl new,old 表示只使用idf开头的特征，同时再去掉这里面能够匹配到new或者old的特征
melt处理机制会将不使用的特征对应的全部置为0，而lightgbm转换后再处理的时候，会默认去掉所有样本对应特征值一样的特征不参与训练(但是仍然特征索引占位)

复用melt的处理框架

melt支持train,test,train-test,cross validation等这些都可以直接复用，lightgbm目前还不支持交叉验证
另外方便在相同环境测试不同算法效果，melt同时封装了liblinear,libsvm,vw,sofia-ml第三方算法库，同时自带linearSVM和gbdt二分类回归算法
melt另外支持一些比较有用的模式

melt data_file -c si #展示数据文件信息

mlt train -c tt --rank=1 -c si
I1118 07:49:16.736133  7969 Melt.h:59] _cmd.randSeed --- [1649160588]
I1118 07:49:16.736424  7969 Melt.h:1360] omp_get_num_procs() --- [12]
I1118 07:49:16.736796  7969 Melt.h:1362] gezi::get_num_threads() --- [12]
I1118 07:49:16.736873  7969 Melt.h:1365] commandStr --- [si]
I1118 07:49:16.736915  7969 InstanceParser.cpp:63] Using rank format data and not set --group by default will use first column as groupKey
I1118 07:49:16.736991  7969 time_util.h:107] ParseInputDataFile started
I1118 07:49:16.986577  7969 InstanceParser.h:541] _labelIdx --- [2]
I1118 07:49:16.986609  7969 InstanceParser.h:600] Try load feature names from feature_name.txt
I1118 07:49:16.987864  7969 InstanceParser.h:790]  CreateInstancesFromSparseFormat
I1118 07:49:17.792853  7969 InstanceParser.h:1091] kFormatNames[_fileFormat] --- [sparse]
I1118 07:49:17.792888  7969 InstanceParser.h:1092] _instances.NumFeatures() --- [559]
I1118 07:49:17.792896  7969 InstanceParser.h:1093] _instances.Count() --- [306886]
I1118 07:49:17.796424  7969 InstanceParser.h:1095] positiveCount --- [284624]
I1118 07:49:17.796463  7969 InstanceParser.h:1097] positiveRatio --- [0.9274584047496465]
I1118 07:49:17.796489  7969 InstanceParser.h:1103] numLabels --- [5]
I1118 07:49:17.802233  7969 InstanceParser.h:1111] Label:0 Count:22262
I1118 07:49:17.802260  7969 InstanceParser.h:1111] Label:1 Count:164836
I1118 07:49:17.802266  7969 InstanceParser.h:1111] Label:2 Count:38070
I1118 07:49:17.802273  7969 InstanceParser.h:1111] Label:3 Count:52331
I1118 07:49:17.802278  7969 InstanceParser.h:1111] Label:4 Count:29387
I1118 07:49:17.802284  7969 InstanceParser.h:1118] _instances.NumGroups() --- [43150]
I1118 07:49:17.806154  7969 InstanceParser.h:1123] denseCount -- [0] denseRatio -- [0]
I1118 07:49:17.806185  7969 InstanceParser.h:1124] IsDenseFormat() --- [0]
I1118 07:49:17.806197  7969 InstanceParser.h:1125] _instances.schema.tagNames:0 1 
I1118 07:49:17.806205  7969 InstanceParser.h:1126] _instances.schema.attributeNames:
I1118 07:49:17.806212  7969 InstanceParser.h:1127] _instances.schema.groupKeys:0 
I1118 07:49:17.806218  7969 InstanceParser.h:1128] _instances.schema.featureNames --- 559
I1118 07:49:17.806227  7969 InstanceParser.h:1128] 549       f549:new_idf_ranktext_vip
I1118 07:49:17.806234  7969 InstanceParser.h:1128] 550       f550:new_idf_ranktext_simid
I1118 07:49:17.806241  7969 InstanceParser.h:1128] 551       f551:new_idf_query
I1118 07:49:17.806246  7969 InstanceParser.h:1128] 552       f552:new_idf_query_simid
I1118 07:49:17.806279  7969 InstanceParser.h:1128] 553       f553:new_idf_query_ematch
I1118 07:49:17.806285  7969 InstanceParser.h:1128] 554       f554:new_idf_query_top1ematch
I1118 07:49:17.806290  7969 InstanceParser.h:1128] 555       f555:new_idf_query_top1match
I1118 07:49:17.806296  7969 InstanceParser.h:1128] 556       f556:new_tfidf_tf
I1118 07:49:17.806301  7969 InstanceParser.h:1128] 557       f557:new_tfidf_ranktext_all_simid
I1118 07:49:17.806306  7969 InstanceParser.h:1128] 558       f558:new_tfidf_ranktext_vip
I1118 07:49:17.806313  7969 InstanceParser.h:1129] _instances.schema.featureNames.NumFeatures() --- [559]
I1118 07:49:17.806319  7969 InstanceParser.h:1130] _instances.schema.featureNames.NumFeatureNames() --- [559]
I1118 07:49:17.806332  7969 time_util.h:118] ParseInputDataFile finished using:       [1069.31 ms] (1.06931 s)

melt data_file -c fss #fss表示featureStatusShow

mlt train -c tt --rank=1 -c fss
I1118 07:54:50.291501  8197 Melt.h:59] _cmd.randSeed --- [3262480143]
I1118 07:54:50.291765  8197 Melt.h:1360] omp_get_num_procs() --- [12]
I1118 07:54:50.292135  8197 Melt.h:1362] gezi::get_num_threads() --- [12]
I1118 07:54:50.292209  8197 Melt.h:1365] commandStr --- [fss]
I1118 07:54:50.292246  8197 time_util.h:107] FeatureStatus! -vl >= 0 to print warning of possible no use features started
I1118 07:54:50.292309  8197 InstanceParser.cpp:63] Using rank format data and not set --group by default will use first column as groupKey
I1118 07:54:50.292419  8197 time_util.h:107] ParseInputDataFile started
I1118 07:54:50.540861  8197 InstanceParser.h:541] _labelIdx --- [2]
I1118 07:54:50.540894  8197 InstanceParser.h:600] Try load feature names from feature_name.txt
I1118 07:54:50.542520  8197 InstanceParser.h:790]  CreateInstancesFromSparseFormat
I1118 07:54:51.349625  8197 time_util.h:118] ParseInputDataFile finished using:       [1057.17 ms] (1.05716 s)
Finished [ 306886 ] (0.186903 s)100% |*******************************************|
I1118 07:54:51.536661  8197 FeatureStatus.h:159] Write result to train.featurestatus
I1118 07:54:51.536783  8197 FeatureStatus.h:161] Write csv result to train.featurestatus.csv
65:f65:title_ner_MDL is always taking value: [0:0]                           |
66:f66:title_ner_PDT is always taking value: [0:0]
68:f68:title_pre_is_。 is always taking value: [0:0]                         |
69:f69:title_avg_pre_is_。 is always taking value: [0:0]
70:f70:title_pre_is_、 is always taking value: [0:0]
71:f71:title_avg_pre_is_、 is always taking value: [0:0]
78:f78:title_pre_is_” is always taking value: [0:0]                         |
79:f79:title_avg_pre_is_” is always taking value: [0:0]                     |
142:f142:title_post_is_。 is always taking value: [0:0]                       |
143:f143:title_avg_post_is_。 is always taking value: [0:0]
144:f144:title_post_is_、 is always taking value: [0:0]
145:f145:title_avg_post_is_、 is always taking value: [0:0]
152:f152:title_post_is_” is always taking value: [0:0]                       |
153:f153:title_avg_post_is_” is always taking value: [0:0]
281:f281:desc_ner_MDL is always taking value: [0:0]*****                      |
282:f282:desc_ner_PDT is always taking value: [0:0]
284:f284:desc_pre_is_。 is always taking value: [0:0]
285:f285:desc_avg_pre_is_。 is always taking value: [0:0]
286:f286:desc_pre_is_、 is always taking value: [0:0]****                     |
287:f287:desc_avg_pre_is_、 is always taking value: [0:0]
294:f294:desc_pre_is_” is always taking value: [0:0]****                     |
295:f295:desc_avg_pre_is_” is always taking value: [0:0]
358:f358:desc_post_is_。 is always taking value: [0:0]********                |
359:f359:desc_avg_post_is_。 is always taking value: [0:0]
360:f360:desc_post_is_、 is always taking value: [0:0]
361:f361:desc_avg_post_is_、 is always taking value: [0:0]
368:f368:desc_post_is_” is always taking value: [0:0]*********               |
369:f369:desc_avg_post_is_” is always taking value: [0:0]*****               |
465:f465:click_ner_MDL is always taking value: [0:0]******************        |
466:f466:click_ner_PDT is always taking value: [0:0]
Finished [ 559 ] (0.008675 s)100% |*******************************************|
I1118 07:54:51.681234  8197 time_util.h:118] FeatureStatus! -vl >= 0 to print warning of possible no use features finished using: [1388.95 ms] (1.38895 s)

查看生成的feature status文件

more train.featurestatus
FeatureName	Mean	PosMean	NegMean	PosMin	NegMin	PosMax	NegMax	Var	PosVar	NegVar
f0:title_ok	mean:0.705236	min:0	max:1	posMean:0.708858	negMean:0.658926	posMin:0	negMin:0	posMax:1	negMax:1	var:0.207879	posVar:0.206379	negVar:0.224753	
f1:title_num_matches	mean:1.1761	min:0	max:54	posMean:1.19418	negMean:0.945018	posMin:0	negMin:0	posMax:54	negMax:17	var:2.09061	posVar:2.15983	negVar:1.14805	
f2:title_num_distinct_matches	mean:1.0452	min:0	max:47	posMean:1.06156	negMean:0.835999	posMin:0	negMin:0	posMax:47	negMax:11	var:1.34379	posVar:1.38599	negVar:0.757029	
f3:title_book_count	mean:0.0111703	min:0	max:10	posMean:0.011654	negMean:0.00498607	posMin:0	negMin:0	posMax:10	negMax:6	var:0.0266475	posVar:0.0280664	negVar:0.008465
32	
f4:title_num_distinct_matches_weight	mean:2.38964	min:0	max:99.7132	posMean:2.43418	negMean:1.82024	posMin:0	negMin:0	posMax:99.7132	negMax:25.77	var:8.10253	posVar:8.35914	negVar:4.47227	
f5:title_num_strict_exact_matches	mean:0.0215194	min:0	max:6	posMean:0.0221942	negMean:0.0128919	posMin:0	negMin:0	posMax:6	negMax:2	var:0.0252729	posVar:0.0261918      negVar:0.013445

更好的特征重要度支持

lightgbm会在生成的模型文件最后展示特征重要度，这里的特征重要度是用特征在模型生成的所有树中的出现次数来表示的

feature importances:
f482:old_examplar=302
f480:old_Term_freq_weight=232
f471:total_wordrank=208
f468:keyword_length=156
f481:old_Keyword_length=146
f548:new_idf_ranktext_all_simid=133
f535:old_ner_PER=132
f547:new_tuwen_showandtell_char=132
f546:new_tuwen_showandtell=102
f550:new_idf_ranktext_simid=92
f475:old_Title_freq=69
f542:new_tuwen_bow=63
f453:click_startswith_wordrank=61
f543:new_tuwen_rnn=49
f479:old_Clickquery_freq=43
f271:desc_ner_PER=38
f55:title_ner_PER=38
f53:title_sum_texts_weight=37
f538:old_ner_BRD=37
f16:title_startswith_wordrank=36
f48:title_avg_offset=29

使用melt训练会在产出的模型路径默认./model文件夹下面提供model.featureGain.txt，提供了按照特征的分裂收益累加计算的特征重要度信息，并且做了归一化处理

more ./model/model.featureGain.txt 
0:f471:total_wordrank                           1
1:f546:new_tuwen_showandtell                    0.718685
2:f480:old_Term_freq_weight                     0.619463
3:f482:old_examplar                             0.561995
4:f481:old_Keyword_length                       0.524525
5:f548:new_idf_ranktext_all_simid               0.443716
6:f468:keyword_length                           0.427941
7:f542:new_tuwen_bow                            0.404851
8:f535:old_ner_PER                              0.364947
9:f543:new_tuwen_rnn                            0.356774
10:f550:new_idf_ranktext_simid                  0.227524
11:f547:new_tuwen_showandtell_char              0.196125

melt支持单次预测特征重要度的展示

melt训练过程使用参数--wr=1会打印预测结果到./result/0.inst.txt（可以通过--rf a.txt配置为打印到a.txt)
默认会设置 --writeDebug=1 展示debug信息也就是特征预测重要度信息，默认展示top5可以通过-numDebugs=10来修改为top10

mlt train -c tt -t test --rank=1 -cl lgb -cls lightgbm-rank.conf --wr=1 --rf result.txt^C
gezi@localhost:~/work/keywords/train/v2/zhongce/text$ more ./result.txt 
Instance	GroupKey	True	Predicted	NDCG@1	NDCG@2	NDCG@3	DCG@1	DCG@2	DCG@3	MaxDCG@1	MaxDCG@2	MaxDCG@3	Debug
3448460368,430059535|总经理	3448460368,430059535	1	-0.654047	100	100	92.8479	21.6404	35.294	37.4581	21.6404	35.294	40.3434	f482:old_examplar:-2.6698:-2,f480:old_Term_freq_weight:-2.40761:1,f547:new_tuwe
n_showandtell_char:0.889376:0.521313,f546:new_tuwen_showandtell:0.574118:0.50291,f543:new_tuwen_rnn:0.567754:0.696085
3448460368,430059535|江南汽车	3448460368,430059535	2	-1.13479	100	100	92.8479	21.6404	35.294	37.4581	21.6404	35.294	40.3434	f482:old_examplar:-2.28336:-2,f480:old_Term_freq_weight:-1.95214:1,f546
:new_tuwen_showandtell:-1.71251:0.267359,f548:new_idf_ranktext_all_simid:1.61797:15512,f543:new_tuwen_rnn:-1.02027:-0.0176351
3448460368,430059535|众泰	3448460368,430059535	2	-1.47698	100	100	92.8479	21.6404	35.294	37.4581	21.6404	35.294	40.3434	f480:old_Term_freq_weight:-1.4014:1,f546:new_tuwen_showandtell:-1.38943:0.27982
6,f543:new_tuwen_rnn:-1.09378:-0.0204482,f482:old_examplar:0.693243:-2,f542:new_tuwen_bow:-0.493574:0.275686
3448460368,430059535|吴建中	3448460368,430059535	3	-0.150239	100	100	92.8479	21.6404	35.294	37.4581	21.6404	35.294	40.3434	f482:old_examplar:-3.1043:-2,f480:old_Term_freq_weight:-2.60871:1,f548:new_idf_
ranktext_all_simid:1.05964:9867,f535:old_ner_PER:0.649271:1,f546:new_tuwen_showandtell:0.609997:0.56614
3448460368,430059535|应建仁	3448460368,430059535	3	-0.352779	100	100	92.8479	21.6404	35.294	37.4581	21.6404	35.294	40.3434	f548:new_idf_ranktext_all_simid:2.65718:197,f482:old_examplar:-2.30219:-2,f480:

单次预测特征重要度的计算是通过计算特征在预测路径分裂前后的差值累加得到的。

melt支持无依赖的c++接口预测上线

melt训练 -c train模式设置 --mcustorm=1 则生成的 model.bin文件(默认在 ./model路径下可以通过 -m 修改) 可以直接利用无编译依赖的GbdtPredictor.h头文件读取预测,具体参考工具包inference目录

melt提供预测和单次预测特征重要度的python接口

melt的python预测接口

统一的python接口(libmelt)

不仅仅限于gbdt，melt支持的其它模型如线性svm模型liblinear，同样可以只用统一的python预测接口
注意--mcustom=1生成的模型 LoadPredictor(string path, string modelName = "", bool isText = false, bool useCustomModel = false)
需要设置 useCustomModel=True

In [1]: import gezi.nowarning

In [2]: from libmelt import PredictorFactory, Vector

In [3]: predictor = PredictorFactory.LoadPredictor('./model/')
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1118 09:41:41.317528 25865 LoadSave.h:32] gbdtRanking try load from ./model/

In [4]: predictor.Predict(Vector('21:12.5,33:22.3'))
Out[4]: -1.207006727

In [5]: predictor = PredictorFactory.LoadPredictor('./LightGBM_model.txt')
W1118 09:42:03.366194 25865 PredictorFactory.cpp:181] Can not get model info, so judge it as predictor of default type: lambdarank
I1118 09:42:03.366287 25865 LoadSave.h:55] gbdtRanking try load as text from ./LightGBM_model.txt
I1118 09:42:03.366322 25865 GbdtPredictor.h:348] Loading from text file ./LightGBM_model.txt of LightGBM format
I1118 09:42:03.367594 25865 GbdtPredictor.h:240] numFeatures --- [559]
I1118 09:42:03.381628 25865 GbdtPredictor.h:310] _trees.size() --- [100]
I1118 09:42:03.382040 25865 GbdtPredictor.h:332] Per_feature_gain print top 10 -vl 0 will print only top 10, -vl 1 will print top 50, -vl 2 will print all
I1118 09:42:03.382071 25865 Predictor.h:806] Try load feature names from ./feature_name.txt
0:f471:total_wordrank                           1
1:f546:new_tuwen_showandtell                    0.712731
2:f480:old_Term_freq_weight                     0.613357
3:f482:old_examplar                             0.557893
4:f481:old_Keyword_length                       0.516144
5:f548:new_idf_ranktext_all_simid               0.421158
6:f468:keyword_length                           0.39352
7:f542:new_tuwen_bow                            0.389431
8:f535:old_ner_PER                              0.370457
9:f543:new_tuwen_rnn                            0.361138

In [6]: predictor.Predict(Vector('21:12.5,33:22.3'))
Out[6]: -1.207006727

gbdt专用python接口(libgbdt)

与统一接口类似，用法见下面单次特征重要度预测

melt的python单次预测特征重要度接口

In [2]: from libgbdt import GbdtPredictor, Vector

In [3]: predictor = GbdtPredictor('./LightGBM_model.txt')
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1118 10:42:01.142372  3170 LoadSave.h:55] gbdt try load as text from ./LightGBM_model.txt
I1118 10:42:01.142452  3170 GbdtPredictor.h:360] Loading from text file ./LightGBM_model.txt of LightGBM format
I1118 10:42:01.143805  3170 GbdtPredictor.h:245] numFeatures --- [559]
I1118 10:42:01.157310  3170 GbdtPredictor.h:322] _trees.size() --- [100]
I1118 10:42:01.157353  3170 Predictor.h:806] Try load feature names from ./feature_name.txt
I1118 10:42:01.158872  3170 GbdtPredictor.h:344] Per_feature_gain print top 10 -vl 0 will print only top 10, -vl 1 will print top 50, -vl 2 will print all
0:f471:total_wordrank                           1
1:f546:new_tuwen_showandtell                    0.712731
2:f480:old_Term_freq_weight                     0.613357
3:f482:old_examplar                             0.557893
4:f481:old_Keyword_length                       0.516144
5:f548:new_idf_ranktext_all_simid               0.421158
6:f468:keyword_length                           0.39352
7:f542:new_tuwen_bow                            0.389431
8:f535:old_ner_PER                              0.370457
9:f543:new_tuwen_rnn                            0.361138

In [4]: fe = Vector('0:1,1:1,2:1,4:1.77017,11:1,18:0.122528,19:1.34781,20:1,21:0.122528,22:1.34781,23:1,24:1,31:0.122528,32:1.34781,33:1,34:0.122528,35:1.34781,36:1.77017,43:0.216896,44:2.38586,45:1,46:0.122528,47:1.34781,48:13,50:1,51:1,52:1,53:1.77017,128:1,129:1,200:1,201:1,468:6,469:3,470:1,471:0.122528,472:1.77017,473:1,474:1,475:1,480:1,481:3,482:-2,521:1.77017,542:0.597565,543:0.696085,544:0.609126,545:0.702186,546:0.50291,547:0.521313,548:3.7336e+06,549:699743,550:511960,551:4,552:4,555:2,556:2,557:30.2658,558:26.9169')

In [5]: predictor.Predict(fe)
Out[5]: -0.6264205382000001

In [6]: print '\n'.join(predictor.SortedGainStrVec(fe, topNum=5))
f480:old_Term_freq_weight:-2.70486:1
f482:old_examplar:-2.59336:-2
f547:new_tuwen_showandtell_char:0.7709:0.521313
f546:new_tuwen_showandtell:0.520602:0.50291
f468:keyword_length:0.431322:6

c++二进制的debug工具 gbdt_predict

在tools路径下

gbdt_predict -m LightGBM_model.txt -f '0:1,1:1,2:1,4:1.77017,11:1,18:0.122528,19:1.34781,20:1,21:0.122528,22:1.34781,23:1,24:1,31:0.122528,32:1.34781,33:1,34:0.122528,35:1.34781,36:1.77017,43:0.216896,44:2.38586,45:1,46:0.122528,47:1.34781,48:13,50:1,51:1,52:1,53:1.77017,128:1,129:1,200:1,201:1,468:6,469:3,470:1,471:0.122528,472:1.77017,473:1,474:1,475:1,480:1,481:3,482:-2,521:1.77017,542:0.597565,543:0.696085,544:0.609126,545:0.702186,546:0.50291,547:0.521313,548:3.7336e+06,549:699743,550:511960,551:4,552:4,555:2,556:2,557:30.2658,558:26.916'
I1118 10:00:56.809183 28321 LoadSave.h:55] gbdt try load as text from LightGBM_model.txt
I1118 10:00:56.809365 28321 GbdtPredictor.h:348] Loading from text file LightGBM_model.txt of LightGBM format
I1118 10:00:56.810501 28321 GbdtPredictor.h:240] numFeatures --- [559]
I1118 10:00:56.823609 28321 GbdtPredictor.h:310] _trees.size() --- [100]
I1118 10:00:56.823645 28321 Predictor.h:806] Try load feature names from ./feature_name.txt
I1118 10:00:56.824920 28321 GbdtPredictor.h:332] Per_feature_gain print top 10 -vl 0 will print only top 10, -vl 1 will print top 50, -vl 2 will print all
0:f471:total_wordrank                           1
1:f546:new_tuwen_showandtell                    0.712731
2:f480:old_Term_freq_weight                     0.613357
3:f482:old_examplar                             0.557893
4:f481:old_Keyword_length                       0.516144
5:f548:new_idf_ranktext_all_simid               0.421158
6:f468:keyword_length                           0.39352
7:f542:new_tuwen_bow                            0.389431
8:f535:old_ner_PER                              0.370457
9:f543:new_tuwen_rnn                            0.361138
I1118 10:00:56.825709 28321 gbdt_predict.cc:47] predictorName: -- Notice for non gbdt binary classification model, proababilty is meaning less
I1118 10:00:56.825742 28321 gbdt_predict.cc:52] predict -- [-0.626421] probablity -- [-0.626421]
I1118 10:00:56.826164 28321 gbdt_predict.cc:78] Per feature gain for this predict, sortByGain: 1
0:f480:old_Term_freq_weight                     	-2.70486	480:1
1:f482:old_examplar                             	-2.59336	482:-2
2:f547:new_tuwen_showandtell_char               	0.7709	547:0.521313
3:f546:new_tuwen_showandtell                    	0.520602	546:0.50291
4:f468:keyword_length                           	0.431322	468:6
5:f545:new_tuwen_rnn_bi                         	0.385684	545:0.702186
6:f543:new_tuwen_rnn                            	0.382433	543:0.696085
7:f542:new_tuwen_bow                            	0.349407	542:0.597565
8:f555:new_idf_query_top1match                  	0.314287	555:2
9:f548:new_idf_ranktext_all_simid               	-0.282919	548:3.7336e+06
10:f47:title_avg_normed_wordrank_weight         	0.271767	47:1.34781
11:f550:new_idf_ranktext_simid                  	-0.204529	550:511960
12:f45:title_avg_matches_weight                 	0.194325	45:1
13:f34:title_avg_wordrank_real                  	0.177239	34:0.122528
14:f257:desc_num_start_matches_weight           	0.176843	257:0
15:f554:new_idf_query_top1ematch                	0.161948	554:0
16:f43:title_wordrank_weight                    	0.160418	43:0.216896
17:f46:title_avg_wordrank_weight                	0.134556	46:0.122528
18:f469:keyword_unicode_length                  	-0.132079	469:3

展示第一棵树的预测路径

gbdt_predict -m LightGBM_model.txt -t 0 -f '0:1,1:1,2:1,4:1.77017,11:1,18:0.122528,19:1.34781,20:1,21:0.122528,22:1.34781,23:1,24:1,31:0.122528,32:1.34781,33:1,34:0.122528,35:1.34781,36:1.77017,43:0.216896,44:2.38586,45:1,46:0.122528,47:1.34781,48:13,50:1,51:1,52:1,53:1.77017,128:1,129:1,200:1,201:1,468:6,469:3,470:1,471:0.122528,472:1.77017,473:1,474:1,475:1,480:1,481:3,482:-2,521:1.77017,542:0.597565,543:0.696085,544:0.609126,545:0.702186,546:0.50291,547:0.521313,548:3.7336e+06,549:699743,550:511960,551:4,552:4,555:2,556:2,557:30.2658,558:26.916' 
I1118 10:03:04.561481 28535 LoadSave.h:55] gbdt try load as text from LightGBM_model.txt
I1118 10:03:04.561647 28535 GbdtPredictor.h:348] Loading from text file LightGBM_model.txt of LightGBM format
I1118 10:03:04.562968 28535 GbdtPredictor.h:240] numFeatures --- [559]
I1118 10:03:04.576159 28535 GbdtPredictor.h:310] _trees.size() --- [100]
I1118 10:03:04.576195 28535 Predictor.h:806] Try load feature names from ./feature_name.txt
I1118 10:03:04.577467 28535 GbdtPredictor.h:332] Per_feature_gain print top 10 -vl 0 will print only top 10, -vl 1 will print top 50, -vl 2 will print all
0:f471:total_wordrank                           1
1:f546:new_tuwen_showandtell                    0.712731
2:f480:old_Term_freq_weight                     0.613357
3:f482:old_examplar                             0.557893
4:f481:old_Keyword_length                       0.516144
5:f548:new_idf_ranktext_all_simid               0.421158
6:f468:keyword_length                           0.39352
7:f542:new_tuwen_bow                            0.389431
8:f535:old_ner_PER                              0.370457
9:f543:new_tuwen_rnn                            0.361138
I1118 10:03:04.578189 28535 gbdt_predict.cc:47] predictorName: -- Notice for non gbdt binary classification model, proababilty is meaning less
I1118 10:03:04.578217 28535 gbdt_predict.cc:52] predict -- [-0.626421] probablity -- [-0.626421]
I1118 10:03:04.578239 28535 gbdt_predict.cc:71] Print tree: 0
$[471] f471:total_wordrank 0.12253 <= 0.19798 ?
|  $[546] f546:new_tuwen_showandtell 0.50291 <= 0.50020 ?
|  |  [548] f548:new_idf_ranktext_all_simid 3733600.00000 <= 256634.00000 ?
|  |  |  [469] f469:keyword_unicode_length 3.00000 <= 4.50000 ?
|  |  |  |  [482] f482:old_examplar -2.00000 <= 0.51275 ?
|  |  |  |  |  -0.0298153
|  |  |  |  |  [471] f471:total_wordrank 0.12253 <= 0.08275 ?
|  |  |  |  |  |  -0.0233632
|  |  |  |  |  |  -0.00122999
|  |  |  |  [482] f482:old_examplar -2.00000 <= 0.52598 ?
|  |  |  |  |  -0.0126085
|  |  |  |  |  0.0137746
|  |  |  [482] f482:old_examplar -2.00000 <= 0.47890 ?
|  |  |  |  -0.0409305
|  |  |  |  [471] f471:total_wordrank 0.12253 <= 0.07121 ?
|  |  |  |  |  -0.0360865
|  |  |  |  |  [469] f469:keyword_unicode_length 3.00000 <= 2.50000 ?
|  |  |  |  |  |  -0.0301188
|  |  |  |  |  |  -0.0082769
|  |  $[469] f469:keyword_unicode_length 3.00000 <= 4.50000 ?
|  |  |  $[471] f471:total_wordrank 0.12253 <= 0.07836 ?
|  |  |  |  [481] f481:old_Keyword_length 3.00000 <= 3.50000 ?
|  |  |  |  |  -0.0277535
|  |  |  |  |  [479] f479:old_Clickquery_freq 0.00000 <= 0.50000 ?
|  |  |  |  |  |  -0.0156884
|  |  |  |  |  |  0.0041274
|  |  |  |  $[547] f547:new_tuwen_showandtell_char 0.52131 <= 0.57042 ?
|  |  |  |  |  $[-0.0042237]
|  |  |  |  |  0.0127872
|  |  |  [471] f471:total_wordrank 0.12253 <= 0.06837 ?
|  |  |  |  [482] f482:old_examplar -2.00000 <= 0.54051 ?
|  |  |  |  |  -0.00946544
|  |  |  |  |  0.0109279
|  |  |  |  [542] f542:new_tuwen_bow 0.59757 <= 0.64964 ?
|  |  |  |  |  0.0104062
|  |  |  |  |  0.0288594
|  [468] f468:keyword_length 6.00000 <= 4.50000 ?
|  |  [55] f55:title_ner_PER 0.00000 <= 0.50000 ?
|  |  |  [546] f546:new_tuwen_showandtell 0.50291 <= 0.36023 ?
|  |  |  |  -0.0262891
|  |  |  |  [471] f471:total_wordrank 0.12253 <= 0.41764 ?
|  |  |  |  |  -0.00869735
|  |  |  |  |  0.00875897
|  |  |  0.0362065
|  |  [543] f543:new_tuwen_rnn 0.69608 <= 0.56176 ?
|  |  |  [471] f471:total_wordrank 0.12253 <= 0.52321 ?
|  |  |  |  [482] f482:old_examplar -2.00000 <= 0.44422 ?
|  |  |  |  |  [480] f480:old_Term_freq_weight 1.00000 <= 1.50000 ?
|  |  |  |  |  |  -0.0196232
|  |  |  |  |  |  0.00265214
|  |  |  |  |  [481] f481:old_Keyword_length 3.00000 <= 4.50000 ?
|  |  |  |  |  |  0.0100783
|  |  |  |  |  |  0.032422
|  |  |  |  [480] f480:old_Term_freq_weight 1.00000 <= 1.50000 ?
|  |  |  |  |  0.010391
|  |  |  |  |  0.0357248
|  |  |  [471] f471:total_wordrank 0.12253 <= 0.50000 ?
|  |  |  |  [481] f481:old_Keyword_length 3.00000 <= 4.50000 ?
|  |  |  |  |  0.0239055
|  |  |  |  |  0.0384573
|  |  |  |  [475] f475:old_Title_freq 1.00000 <= 0.50000 ?
|  |  |  |  |  0.0306819
|  |  |  |  |  0.0429537

baotong/melt