Data preprocess

Crawl Safebooru

eg: http://safebooru.org/index.php?page=post&s=view&id=2844551

Crawl Image from id=0 to current max id and collect tag meta information which appears in the html code, such as class="tag-type-copyright" , tag-type-general or tag-type-character, write them in a ori_tags.csv whose table head includes: id,img_src,tags,types

Find most popular 512 attribute tags

Scan over the whole original csv and encode each tag into a tag index, forming functions tag2index index2tag
Count all tags using Counter, as our task only focus on background removal, just find the most popular 512 general(which I call it attribute) tags
Filter out those images which don't include these 512 tags
Reindex these tags to 0~511
Construct a dict which maps image_id to attr_index. And we should be aware of those pictures which are not successful downloaded, they cound cause pytorch dataloader exception, so we need to clean them by using os.path.exists.Finally, we cache it to img_id2attr.pkl, so that the training set is produced

Dataset

Read the mentioned img_id2attr.pkl,cast the index into one-hot encoding and start training.

seekerzz/Pytorch-Illustration2Vec

Data preprocess

Crawl Safebooru

Find most popular 512 attribute tags

Dataset