seclab-fudan/APIGraph

Re-implementation of Drebin

Closed this issue · 4 comments

E0HYL commented

Hi, thanks for the contribution.
For the implementation of Drebin, may I ask whether the feature set of strings (e.g., component names, network addresses) is extracted from your 322K samples?
In my opinion, the great amount of uniques strings can lead to a huge feature dimension.
Am I wrong, or can you explain the size of the final feature vector that is sent to the ML classifier?

You're right. We did experiments and found that 322K samples would cause a too large feature set to train and test the classifier.
As a result, we refer to related work [1], where constant strings are not used in the implementation.

[1] Experimental Study with Real-world Data for Android App Security Analysis using Machine Learning - ACSAC15

E0HYL commented

Thanks for the kind reply.

You're right. We did experiments and found that 322K samples would cause a too large feature set to train and test the classifier. As a result, we refer to related work [1], where constant strings are not used in the implementation.

[1] Experimental Study with Real-world Data for Android App Security Analysis using Machine Learning - ACSAC15

The article you mentioned doesn't seem to have any code that has been re implemented by open-source drybin. Would it be convenient for you to open source drebin code that is enhanced based on APIGraph, or roughly how it was implemented? Drebin seems to encode the obtained features according to word frequency, which is unrelated to the vectorized representation of the API itself

Same question. The released code does not mention how to integrate APIGraph and drebin. I think a possible way is to replace the 0-1 API vectors in Drebin with the corresponding APIGraph embeddings, though not sure about the performance :)