logpai/loglizer

SVM: evaluate on same train set

ai2010 opened this issue · 2 comments

Hello,

I am using the HDFS strucuted file you have : data/HDFS/HDFS_100k.log_structured.csv then I split train test and train with SVM:

(x_train, y_train), (x_test, y_test) = dataloader.load_HDFS(struct_log,
label_file=label_file,
window='session',
train_ratio=0.5,
split_type='uniform')

feature_extractor = preprocessing.FeatureExtractor()
x_train = feature_extractor.fit_transform(x_train, term_weighting='tf-idf')
model = SVM()
model.fit(x_train,y_train)

Now to check everything is right, I test in the same xtrain dataset:
precision, recall, f1 = model.evaluate(x_train, y_train)

However the metrics are:
'Precision: 1.000, recall: 0.365, F1-measure: 0.535' I would expect almost perfect metrics since I am predicting on the trained set. Do you know what the issue is?

Hi. I also encountered this problem a few days ago. It's due to the hdfs_100k data. If you use the full hdfs data set, the performance will be similar to the paper.

Yes, the benchmarking is made on the full data. Code is available at benchmarks/, not demo/