itkach/slob

[request] script to remove images.

Closed this issue · 3 comments

Dmole commented

Something like

slob trim_img in.slob out.slob

would be useful for making smaller files for Aard 2.
Or a more generic regex filter to accomplish the same thing

slob trim --rewite_txt 's/<img[^<>]+src="[^"]+\.(jpg|png|gif)"\/?>/[image placeholder]/g' --exclude_files '.*\.(jpg|png|gif)' in.slob out.slob

or whatever expressions best sute the format it's stored in.

I don't think this library is appropriate place to manipulate content like this. Also, using regular expressions for manipulating xml/html is error prone and cumbersome. And I'm not sure removing image tags from Wikipedia articles actually saves much, but mwscrape2slob already implements ability to filter content using css selectors, and includes image thumbnail filter example.

I'm not convinced manipulating content inside slob is really needed, but perhaps convert command implementation should be expanded to allow plug in external converter functions to make it easier to write such scripts: currently one would need to repeat the mapping of content blobs to set of keys pointing to them to avoid content duplication since many keys may point to the same blob. I will think about it.

Dmole commented

So I was starting to write my own script to do this but in the proses I noticed there were no images; it's just that the output HTML is stored so they are loaded from the web if you are online (can be disabled in aard2 settings).

The other thing I noticed is that the output HTML is about 14 times larger than the text (but only 2 times larger when compressed), so If one wanted a 50% smaller wikipedia dump storing the wiki source would be ideally small but functional (linked and formatted) (one link per article to the online references instead of storing them all offline gets it down to 3 times smaller when compressed)

I guess that size difference is not going to be worth adding a feature to aard2 as microSDXC capacity is only growing. (not that I would expect my phone to work with a 200GB SanDisk beast)

Dmole commented

Related code:

import slob
import re
s = slob.open('enwiki-20150406.slob')

#
# list blobs
s.blob_count
#### 4988503

#
# list content type counts
current = 0
countHTML = 0
countJS = 0
countCSS = 0
countJSON = 0
countSVG = 0
for bin_index, store_item in enumerate(s._store):
    for item_index in range(len(store_item.content_type_ids)):
        if current % 10000 == 0:
            print('\nc = {} HTML = {} JS = {} CSS = {} JSON = {} SVG = {}'.format(current, countHTML, countJS, countCSS, countJSON, countSVG))
        current += 1
        content_type, content = s._store.get(bin_index, item_index)
        if content_type == "text/html;charset=utf-8":
            countHTML += 1
            continue
        if content_type == "application/javascript":
            countJS += 1
            continue
        if content_type == "text/css":
            countCSS += 1
            continue
        if content_type == "application/json":
            countJSON += 1
            continue
        if content_type == "image/svg+xml":
            countSVG += 1
            continue
        print('\nunexpected content type = {} '.format(content_type))
#### c = 250000 HTML = 249912 JS = 82 CSS = 4 JSON = 1 SVG = 1

#
# find largest blob
max = 0
for bin_index, store_item in enumerate(s._store):
    for item_index in range(len(store_item.content_type_ids)):
        content_type, content = s._store.get(bin_index, item_index)
        if max < len(content):
            max = len(content)
            print('\nmax = {} for {} {} '.format(max, bin_index, item_index))
#### 1392390 for 1062 16

#
# export largest blob
bin_index = 1062
item_index = 16
content_type, content = s._store.get(bin_index, item_index)
text_file = open("out.html", "w")
text_file.write(content.decode("utf-8"))
text_file.close()
#
# size reduction test
cat out.html | perl -pe 's/<[^<>]+>//g' > out.txt
grep -B 9999 -m 1 References out.txt > out_noref.txt
7z a -t7z -m0=lzma -mx=9 -mfb=64 -md=32m -ms=on out.html.7z out.html
7z a -t7z -m0=lzma -mx=9 -mfb=64 -md=32m -ms=on out.txt.7z out.txt
7z a -t7z -m0=lzma -mx=9 -mfb=64 -md=32m -ms=on out_noref.txt.7z out_noref.txt
stat -f "%z %N" out*
#1392390 out.html
#62195 out.html.7z
#95839 out.txt
#27306 out.txt.7z
#72991 out_noref.txt
#21883 out_noref.txt.7z
echo "scale=3;62195/21883"|bc
#### 2.842