FFMFormat

FM和FFM模型是最近几年提出的模型,凭借其在数据量比较大并且特征稀疏的情况下,仍然能够得到优秀的性能和效果的特性,常用于计算广告中的CTR,CVR预估。美团点评技术团队写过一篇《深入FFM原理与实践》博客,链接地址:https://tech.meituan.com/deep_understanding_of_ffm_principles_and_practices.html ,写的很详细。然后目前FFM常用的python库有libffm,xlearn等。

pandas到libffm的转化可参考kaggle的TalkingData AdTracking Fraud Detection Challenge竞赛:https://www.kaggle.com/mpearmain/pandas-to-libffm

Data Format(libffm格式)

为了使用FFM方法,所有特征必须转化成

label field_id:feature_id:value field_id:feature_id:value field_id:feature_id:value ...

field_id表示每一个特征域的id号

feature_id表示所有特征值的id号(可采用连续编码以及hash编码)

value:当特征域不是连续特征时,value=1,若为连续特征,value=该特征的值

libffm格式

有必要理解什么是field,feature,value。举个例子:

对于pandas DataFrame格式数据来说:

label  category_feature  continuous_feature  vector_feature
=====  ================  ==================  ==============
0           x               1.1               1 2
1           y               1.2               3 4 5   
0           x               2.2               6 7 8 9

首先给各个特征域编号:

dict_field[category_feature] = 0

dict_field[continuous_feature] = 1

dict_field[vector_feature] = 2

然后给每个特征值编号:

dict_feature[category_feature-x] = 0

dict_feature[continuous_feature-1.1] = 1

dict_feature[vector_feature-1] = 2

dict_feature[vector_feature-2] = 3

dict_feature[category_feature-y] = 4

dict_feature[continuous_feature-1.2] = 5

dict_feature[vector_feature-3] = 6

dict_feature[vector_feature-4] = 7

dict_feature[vector_feature-5] = 8

dict_feature[category_feature-x] = 0 # category_feature重复项编号不变

dict_feature[continuous_feature-2.2] = 9

dict_feature[vector_feature-6] = 10

dict_feature[vector_feature-7] = 11

dict_feature[vector_feature-8] = 12

dict_feature[vector_feature-9] = 13

最后设置value的值:

dict_value[category_feature-x] = 1

dict_value[continuous_feature-1.1] = 1

dict_value[vector_feature-1] = 1

dict_value[vector_feature-2] = 1

dict_value[category_feature-y] = 1

dict_value[continuous_feature-1.2] = 1.2

dict_value[vector_feature-3] = 1

dict_value[vector_feature-4] = 1

dict_value[vector_feature-5] = 1

dict_value[category_feature-x] = 1

dict_value[continuous_feature-2.2] = 2.2

dict_value[vector_feature-6] = 1

dict_value[vector_feature-7] = 1

dict_value[vector_feature-8] = 1

dict_value[vector_feature-9] = 1

综上,我们可得到FFM Format data:

0 0:0:1 1:1:1.1 2:2:1 2:3:1

1 0:4:1 1:5:1.2 2:6:1 2:7:1 2:8:1

0 0:0:1 1:9:2.2 2:10:1 2:11:1 2:12:1 2:13:1

本文仅有category_feature,continuous_feature,vector_feature。若还有其他特征可自行修改添加。在格式化FFM之前,连续型数据最好归一化,保证能够收敛

libffm库参考:https://github.com/guestwalk/libffm

xlearn使用方法参考:http://xlearn-doc.readthedocs.io/en/latest/start.html