A research project aim to build a new Detection tool using machine learning. Machine translate of the Research thesis can be found here: https://drive.google.com/file/d/1BrK4KZSfYeVjzzUKvmr7bjtyXWeUOw8P/view?usp=sharing
Quick guide
apt install npm node python3 python3-pip
npm install
pip install -r requirements.txt
node archive_page.js
python tf_caculation.py
python model_and_measurement.py
After all Reason, Definition and Theory, the main implementing start at CHƯƠNG (CHAPTER) 3 INSTALLATION AND TESTING
. I will explain more on how source code in this repo reflect on the thesis detection model proposed thesis. The main part of thesis come from building classifier machine learning model:
Web scrapping automation using Javascript, puppeteer:
- The URLs that need to be monitored are automatically processed by the web crawler, and the returned results include the resource and HTML source code;
- The website resources are extracted as the detection signature set.
Source: archive_Script-hashlog.js
Input: URL file list, can use provided 200-safe-vn.txt
, 200-deface.csv
or using your own, edit the path directly in the source code
async function main(){
let all_urls;
all_urls = (await fse.readFile( '200-safe-vn.txt', 'utf-8')).split('\r\n');
Output: Run data:
fail-log.csv
: Containt list of page URL that errors when crawling- Page data: Every page data will be store in they own
saved/<url_sanitize>
directory (all character that be blocked by window will be change to_
)-
saved_responsed_file/
: All page loaded resource will be pull and store in this directory with their correspond hash name -
archived_log.csv
: Log all request with timestamptimestamp,type,url,status "910","Request","https://www.moha.gov.vn/","Sent"
-
title_log.csv
: The real title of the page -
url_hash_log.csv
: Containing all signature that have been colected -
index.html
,index_loaded.html
,text_only_index_loaded.html.txt
: Page HTML data with 3 versiontext_only_index_loaded.html.txt
: Dynamic loadding page for 30s, but we only gethtml.innerText
dataindex_loaded.html
: Dynamic loadding page for 30s, then we get full loaded html dataindex.html
: Static (page is not load) html file
-
I don't have and will not provide any data that have been crawled and using for building the model here. It have been lost and can't be recovered, the file in saved directory was crawled recenterly just for example purpose. If you try repeated crawling and bypassing zone-h.org protected mechanism could lead to retricted and IP ban, please use the crawler with caution.
n-gram frequency extraction
- HTML source code data standardized through the HTML crawler will be converted through a preprocessing step into feature vector form.
- After collecting a large enough amount as a profile, it is finally used as training data for a machine learning algorithm to build a classifier.
Source: tf_caculation.py
Input: saved
directory
- Counting all 2-gram, 3-gram with each of these 3
html-data
files[text_only_index_loaded.html.txt, index_loaded.html, index.html]
apperence, chosing the top 300 n-gram with highest appearence - Vectorize all page with each n-gram apearence in each page text file Output: Example output is provided
n_gram_count
: Contain counting result from all set of input and config.<n-gram><html-data>_count.csv
tf_gram_count
: Contain vectorized page data using the top 300n-gram
of corespondhtml-data
file- Each round in csv representing a page
html-data
i'th
column is the frequency ofi'th
n-gram in the top 300 appear in the page
- Each round in csv representing a page
Python Machine learning Framwork (sklearn, matplotlib, numpy), RandomForest, NaiveBayes classifier
Source: model_and_measurement.py
Input: Vectorized data - You can use Example file from tf_gram_count
Output: Result metric only - Example 0.25_final.csv
, 0.5_final.csv
- Using RandomForest, NaiveBayes classifier for modeling the input data
- Model isn't save by default: New model will be train on the input on each run and only output the final result metric
Using confusion matrix, I draw the conclusion base on 4 metric
- Accuracy: Overall accuracy of the model.
(TP+FN) / (TP+FP+TN+FN) × 100%
- Percision: Accuracy of predicting the attacked sample.
TP / (TP+FP) ×100%
- Recall: Ability to find all attacked samples of the model.
TP / (TP+FN) × 100%
- Fail Alarm: False alarm rate.
FP / (FP+TN) ×100%
There is only the result with 400 records ( 200 safe Vietnamese page - 200 deface page) in this repo from provided URL list file.
0.25_final.csv
contain result of the model build with 75% data use for tranning and 25% data use for test0.5_final.csv
contain result of the model build with 50% data use for tranning and 50% data use for test
But, the largest run of the thesis have 1600 records (with 800 safe mix language page - 800 Deface pages) with 50% traning - 50% testing, which conclude the thesis proposed model can gives good performance and have a very high accuracy rate.
Clasification Algorithm | File | n-gram | percision | recall | accuracy | Fail Alarm |
---|---|---|---|---|---|---|
RandomForest | index.html | 2 | 97.87% | 97.87% | 97.98% | 1.92% |
NaiveBayes | index.html | 2 | 99.40% | 88.27% | 94.18% | 0.48% |
RandomForest | index.html | 3 | 98.93% | 98.40% | 98.74% | 0.96% |
NaiveBayes | index.html | 3 | 98.82% | 89.60% | 94.56% | 0.96% |
RandomForest | index_loaded.html | 2 | 96.40% | 97.66% | 97.13% | 3.37% |
NaiveBayes | index_loaded.html | 2 | 99.40% | 86.98% | 93.50% | 0.48% |
RandomForest | index_loaded.html | 3 | 97.90% | 97.14% | 97.63% | 1.92% |
NaiveBayes | index_loaded.html | 3 | 99.41% | 87.24% | 93.63% | 0.48% |
RandomForest | text_only.txt | 2 | 95.94% | 98.95% | 97.49% | 3.86% |
NaiveBayes | text_only.txt | 2 | 99.15% | 91.88% | 95.73% | 0.72% |
RandomForest | text_only.txt | 3 | 96.92% | 98.95% | 97.99% | 2.89% |
NaiveBayes | text_only.txt | 3 | 97.55% | 93.98% | 95.98% | 2.17% |