Read and write Spage.
Spage is an incompact data structure to specify fetched record. Generally speaking, it contains four sub-blocks: url, inner_header, http_header, and data.
Spage:
- url: the URL.
- inner_header: key-values, can be used to record fetch/process info, such as fetch-time, data-digest, record-type, ect.
- http_header: key-values, server's response HTTP Header as you know.
- data: fetched data, can be flat or compressed html.
We use dict type to implements Spage. A predefined schema can be used for validating.
It is common to write Spage to size-rotate-file, we choice os-rotatefile as default back-end.
Notice:
- os-spage should not be used for strict serialization/deserialization purpose, it will lose type info when written, all data will be read as string(unicode python2) after all.
- Usually, the data stored in compressed format. You can use
zlib.decompress
method to decompress.
Offpage:
From v0.4, this libaray support reading from offpage. Offpage is another data storage format, include url, headers and series data. You can use read/open_file
methods with page_type="offpage"
to read from offpage.
From v0.5, support transform spage into offpage. You can use read/open_file
methods with page_type="s2o"
to read from spage and transform the record into offpage format. (Not fully tested yet)
Example:
from os_spage import read
f = open('your_spage', 'rb')
for offpage in read(f, page_type='s2o'):
print(offpage )
pip install os-spage
- Write to size-rotate-file
from os_spage import open_file
url = 'http://www.google.com/'
inner_header = {'User-Agent': 'Mozilla/5.0', 'batchID': 'test'}
http_header = {'Content-Type': 'text/html'}
data = b"Hello world!"
f = open_file('file', 'w', roll_size='1G', compress=True)
f.write(url, inner_header=inner_header, http_header=http_header, data=data, flush=True)
f.close()
- Read from size-rotate-file
from os_spage import open_file
f = open_file('file', 'r')
for record in f.read():
print(record)
f.close()
- R/W with other file-like object
from io import BytesIO
from os_spage import read, write
s = BytesIO()
write(s, "http://www.google.com/")
s.seek(0)
for record in read(s):
print(record)
$ tox
MIT licensed.