Training Dataset can be found here
Note: We only trained on 10% of the data and obtained better results then all kaggle notebooks.
We got the embeddings of each website using the ada embeddings from OpenAI and then tried multiple ML algorithms, sticked with the Random Forest Classifier.
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
benign | 0.85 | 0.99 | 0.91 | 3018 |
defacement | 0.91 | 0.90 | 0.91 | 2047 |
malware | 1.00 | 0.68 | 0.81 | 266 |
phishing | 0.99 | 0.44 | 0.61 | 669 |
accuracy | 0.88 | 6000 | ||
macro avg | 0.94 | 0.75 | 0.81 | 6000 |
weighted avg | 0.89 | 0.88 | 0.87 | 6000 |
As we can see above, the results do not look so well for malware and phishing. The application can be tried on huggingface.
!Attention! Prone to error.
License for the dataset: CC0 Public Domain