- 搭建基于ML技术的恶意文件检测引擎 实验原型 恶意文件=[Windows Files, PDF, Android APPs]
- Basic
- CPU Consumption: (Needed to fill)
- Memory Consumption: (Needed to fill)
- Size of the dataset: needs to be at 5M scale
- ML Model Training Time using the whole dataset: Within 12 hours
- Model 在防火墙应用场景中的平均预测时间(对于单个样本): needs to be at 1 ms level
- Model 在防火墙应用场景中的最大预测时间(对于单个样本): needs to be at 10 ms level
- Model False Positive Rate: needs to be < 0.0001%
- Model Update Frequency: needs to be once every day
- Model accuracy: Needs to be 99.999%
- Self-evidence for effectiveness & effectiveness: 未知威胁的有效发现(如与VT服务进行交叉验证)
- Medium
- File types support extention from Windows Files (PEs, DLLs) to MS-Office docs, PDFs, Android & iOS, Linux etc.
- 级联Malware Prediction Engine: From Coarse-Grained Good/Dad file classification to further malware family prediction
- Advanced
- Construction of the cloud-based backend management system
-
Upon how much data does the machine learning solution base its decisions? Is it enough?
-
From where does the data come? Is there a wide variety of sources, or are they dependent on third-party threat aggregator sites?
-
How often is the data collected?
-
How often are new models trained and propagated to the customer?
-
How is the system trained? Is it trained through a constant supply of rich data sets, so properties discovered can be used in future machine learning decisions?
-
How does the vendor handle false positives?
-
How does the vendor handle false negatives that the vendor later discovers (after the customer has run the malware)?
-
How quickly can the solution make a determination that leads to action?
-
How quickly can it obtain enough relevant new data to influence the decisions it makes?
-
Where and how quickly does the analysis take place?
-
What is the impact on the end-user system?
-
What type of analysis is done on incoming files? On endpoints only, on cloud only, or a combination?
-
Does it rely on post-event analysis (detecting rather than preventing)?
-
Ongoing training of the model similarly relies on continuous access to large amounts of new data.
-
The rate of false positives can be extremely high if the data set is not robust.
-
It is important to note that the machine must have access to both benign and malicious data in order to accurately distinguish between the two.
-
? Training a model based solely on bad data increases the chance of high false-positive rates.