os-spage

Read and write Spage.

Spage is an incompact data structure to specify fetched record. Generally speaking, it contains four sub-blocks: url, inner_header, http_header, and data.

Spage:

url: the URL.
inner_header: key-values, can be used to record fetch/process info, such as fetch-time, data-digest, record-type, ect.
http_header: key-values, server's response HTTP Header as you know.
data: fetched data, can be flat or compressed html.

We use dict type to implements Spage. A predefined schema can be used for validating.

It is common to write Spage to size-rotate-file, we choice os-rotatefile as default back-end.

Notice:

os-spage should not be used for strict serialization/deserialization purpose, it will lose type info when written, all data will be read as string(unicode python2) after all.
Usually, the data stored in compressed format. You can use zlib.decompress method to decompress.

Offpage:

From v0.4, this libaray support reading from offpage. Offpage is another data storage format, include url, headers and series data. You can use read/open_file methods with page_type="offpage" to read from offpage.

From v0.5, support transform spage into offpage. You can use read/open_file methods with page_type="s2o" to read from spage and transform the record into offpage format. (Not fully tested yet)

Example:

from os_spage import read

f = open('your_spage', 'rb')
for offpage in read(f, page_type='s2o'):
    print(offpage )

Install

pip install os-spage

Usage

Write to size-rotate-file

  from os_spage import open_file

  url = 'http://www.google.com/'
  inner_header = {'User-Agent': 'Mozilla/5.0', 'batchID': 'test'}
  http_header = {'Content-Type': 'text/html'}
  data = b"Hello world!"

  f = open_file('file', 'w', roll_size='1G', compress=True)
  f.write(url, inner_header=inner_header, http_header=http_header, data=data, flush=True)
  f.close()

Read from size-rotate-file

  from os_spage import open_file

  f = open_file('file', 'r')

  for record in f.read():
      print(record)
  f.close()

R/W with other file-like object

  from io import BytesIO
  from os_spage import read, write

  s = BytesIO()
  write(s, "http://www.google.com/")

  s.seek(0)
  for record in read(s):
      print(record)

Unit Tests

$ tox

License

MIT licensed.