/web_page_classification

Deep learning algorithms for web page classification written in Tensorflow (Python).

Primary LanguagePythonMIT LicenseMIT

This repository contains Deep learning algorithms for web page classification written in Tensorflow (Python).

Requirements

mkvirtualenv --python=`which python3` --system-site-packages wpc
pip install -r requirements.txt

Running

python -m collect --data_dir ~/Downloads/wpc --dataset_type dmoz --max_file_num 5
python -m collect --data_dir ~/Downloads/wpc --pages_per_file 100 --max_file_num 5 --dataset_type dmoz --cat_num 5
python -m collect --data_dir ~/Downloads/wpc --pages_per_file 2000 --max_file_num 5 --dataset_type dmoz --cat_num 5
python -m collect --data_dir ~/Downloads/wpc --pages_per_file 1000 --max_file_num 5 --dataset_type dmoz --cat_num 10
python -m collect --data_dir ~/Downloads/wpc --pages_per_file 100 --max_file_num 5 --dataset_type ukwa

python -m convert --data_dir /media/yuhui/linux/wpc_data/ --dataset_type dmoz --num_cats 5 --html_folder html_5
python -m convert --data_dir ~/Downloads/wpc --dataset_type dmoz --num_cats 5 --html_folder html_5-2500
python -m fastText_convert --data_dir ~/Downloads/wpc --dataset_type dmoz --num_cats 10 --html_folder html_10
python -m fastText_convert --data_dir ~/Downloads/wpc --dataset_type dmoz --num_cats 10 --html_folder html_10 --model svm
python -m convert --data_dir ~/Downloads/wpc --dataset_type dmoz --num_cats 10 --html_folder html_10
python -m convert --dataset_type dmoz --num_cats 10 --html_folder html_10
python -m convert --data_dir ~/Downloads/wpc --dataset_type dmoz --num_cats 5 --html_folder html_test --verbose False

python -m train --data_dir ~/Downloads/wpc/ --dataset dmoz-10 --model_type resnn --print_step 3 --summary_step 40 --checkpoint_step 300
python -m train --data_dir /media/yuhui/linux/wpc_data/ --dataset dmoz-10 --model_type resnn  --print_step 3 --summary_step 50 --checkpoint_step 300
python -m train --data_dir ~/Downloads/wpc/ --dataset dmoz-5-2500 --model_type resnn --if_eval False 
python -m train --data_dir ~/Downloads/wpc/ --dataset dmoz-10 --model_type resnn --print_step 2 --summary_step 50 -- checkpoint_step 1000

python -m train --data_dir ~/Downloads/wpc/ --num_cats 5 --model_type cnn --tfr_folder TFR_5-2500
python -m train --data_dir ~/Downloads/wpc/ --num_cats 5 --model_type cnn --tfr_folder TFR_5-2500 --if_eval True

python -m eval --data_dir ~/Downloads/wpc/ --model_type cnn --train_dir ~/Downloads/wpc/cnn/outputs/

./fasttext supervised -input data/TFR_10-fast/train -output model10
./fasttext test model10.bin data/TFR_10-fast/test 1

./fasttext supervised -input data/TFR_5-fast/train -output model5
./fasttext test model5.bin data/TFR_5-fast/test 1

./fasttext supervised -input data/TFR_ukwa-fast/train -output model-ukwa
./fasttext test model10.bin data/TFR_ukwa-fast/test 1

Modules

convert: to shuffle all category samples thoroughly, create only two files first: examples.train and examples.test. These two will be split into several TFRecord files which might more than 1 GB. A new argument: the max samples per TFR

License

MIT

##Note

googlescraper

put chromedriver under ~/work/py-envs/wpc/bin/