istresearch/scrapy-cluster

Why encoding byted body

YanzhongSu opened this issue · 2 comments

In pipelines.py, why the body first needs to be converted to bytes and then use base64 encoding?

Can we not store the body(by default it is ) itself directly? what happens if we just leave as it is?
My understanding is if we transmit the body itself, the data might be corrupted during transmission.

if self.use_base64:
    datum['body'] = base64.b64encode(bytes(datum['body'], 'utf-8'))
    message = ujson.dumps(datum, sort_keys=True)

The python 3 docs have the following:
https://docs.python.org/3/library/base64.html#base64.b64encode

base64.b64encode(s, altchars=None)
Encode the bytes-like object s using Base64 and return the encoded bytes.

A bytes-like object is
https://docs.python.org/3/glossary.html#term-bytes-like-object

bytes-like object
An object that supports the Buffer Protocol and can export a C-contiguous buffer. This includes all bytes, bytearray, and array.array objects, as well as many common memoryview objects.

I think from the docs it makes sense as to why we encode the value into bytes before passing it into the function. This may also have been crossover between python2 and python3 string compatibility and it is just easier to say "everything is always bytes."

If this answers your question please close the ticket

@madisonb Thank you for your answer.