- 搭建基于ML技术的恶意文件检测引擎 实验原型 恶意文件=[Windows Files, PDF, Android APPs]
- Basic
- CPU Consumption: (Needed to fill)
- Memory Consumption: (Needed to fill)
- Size of the dataset: needs to be at 5M scale
- ML Model Training Time using the whole dataset: Within 12 hours
- Model 在防火墙应用场景中的平均预测时间(对于单个样本): needs to be at 1 ms level
- Model 在防火墙应用场景中的最大预测时间(对于单个样本): needs to be at 10 ms level
- Model False Positive Rate: needs to be < 0.0001%
- Model Update Frequency: needs to be once every day
- Model accuracy: Needs to be 99.999%
- Self-evidence for effectiveness & effectiveness: 未知威胁的有效发现(如与VT服务进行交叉验证)
- Medium
- File types support extention from Windows Files (PEs, DLLs) to MS-Office docs, PDFs, Android & iOS, Linux etc.
- 级联Malware Prediction Engine: From Coarse-Grained Good/Dad file classification to further malware family prediction
- Advanced
- Construction of the cloud-based backend management system
Upon how much data does the machine learning solution base its decisions? Is it enough?
From where does the data come? Is there a wide variety of sources, or are they dependent on third-party threat aggregator sites?
How often is the data collected?
How often are new models trained and propagated to the customer?
How is the system trained? Is it trained through a constant supply of rich data sets, so properties discovered can be used in future machine learning decisions?
How does the vendor handle false positives?
How does the vendor handle false negatives that the vendor later discovers (after the customer has run the malware)?
How quickly can the solution make a determination that leads to action?
How quickly can it obtain enough relevant new data to influence the decisions it makes?
Where and how quickly does the analysis take place?
What is the impact on the end-user system?
What type of analysis is done on incoming files? On endpoints only, on cloud only, or a combination?
Does it rely on post-event analysis (detecting rather than preventing)?
Ongoing training of the model similarly relies on continuous access to large amounts of new data.
The rate of false positives can be extremely high if the data set is not robust.
It is important to note that the machine must have access to both benign and malicious data in order to accurately distinguish between the two.
? Training a model based solely on bad data increases the chance of high false-positive rates.