eg:
http://safebooru.org/index.php?page=post&s=view&id=2844551
Crawl Image from id=0 to current max id and collect tag meta information which appears in the html code, such as class="tag-type-copyright"
, tag-type-general
or tag-type-character
, write them in a ori_tags.csv
whose table head includes:
id,img_src,tags,types
- Scan over the whole original csv and encode each tag into a tag index, forming functions
tag2index
index2tag
- Count all tags using
Counter
, as our task only focus on background removal, just find the most popular 512general
(which I call itattribute
) tags - Filter out those images which don't include these 512 tags
- Reindex these tags to 0~511
- Construct a dict which maps
image_id
toattr_index
. And we should be aware of those pictures which are not successful downloaded, they cound cause pytorch dataloader exception, so we need to clean them by usingos.path.exists
.Finally, we cache it toimg_id2attr.pkl
, so that the training set is produced
Read the mentioned img_id2attr.pkl
,cast the index into one-hot encoding and start training.