/steganos

This is a library to encode bits into text.... steganography in text!

Primary LanguagePythonGNU Lesser General Public License v3.0LGPL-3.0

steganos

This is a library to encode bits into text.

Installation

You can install from source by doing,

$ git clone git@github.com:fastforwardlabs/steganos.git
$ cd steganos
$ python setup.py install

or simply,

$ pip install git+https://github.com/fastforwardlabs/steganos.git

Encoding

To find out how many bits can be encoded into a string:

import steganos

original_text = '"Hello," he said.\n\t"I am 9 years old"'
capacity = steganos.bit_capacity(original_text)

To encode bits into a string:

import steganos

bits = '101'
original_text = '"Hello," he said.\n\t"I am 9 years old"'
encoded_text = steganos.encode(bits, original_text)

Decoding

Retrieving the bits from a string requires the original text into which the bits were encoded.

If you have the complete encoded text, use the decode_full_text function:

import steganos

bits = '101'
original_text = '"Hello," he said.\n\t"I am 9 years old"'
encoded_text = steganos.encode(bits, original_text)
recovered_bits = steganos.decode_full_text(encoded_text, original_text)
# recovered_bits.startswith('101') == True

If you have on part of the encoded text, you can use the decode_partial_text function. If you know the indices of the original text that the partial encoded text corresponds to, you can pass those in as a tuple (start_index, end_index) as the final parameter. Otherwise, they will be inferred.

import steganos

bits = '101'
original_text = '"Hello," he said.\n\t"I am 9 years old"'
encoded_text = steganos.encode(bits, original_text)
partial_text = encoded_text[:8]
recovered_bits = steganos.decode_partial_text(partial_text, original_text)
# recovered_bits.startswith('1?1') == True

Sending messages

In order to help send encoded messages as opposed to just storing bytes, we provide bytes_to_binary and binary_to_bytes in order to encode/decode a message to and from steganos' binary format.

import steganos

message = b'Hello World!'
original_text = open('text.txt').read()

bits = steganos.bytes_to_binary(message)
encoded_text = steganos.encode(bits, original_text)

recovered_bits = steganos.decode_full_text(encoded_text, original_text)
recovered_msg = steganos.binary_to_bytes(recovered_bits)

# recovered_msg.startswith(b'Hello World!') == True

A note on message length

By default, and decoded message will be the maximum length encodable within the source document. That is to say, if you have a document that can store 8 bits and your message is just two bits, the decoded result will be your two bits repeated four times. This can be solved by providing the message_bits parameter to the decode function. In addition to returning with the proper number of bits, this also will give possible increased accuracy for partial decodings.

bits = '101'
original_text = '"Hello," he said.\n\t"I am 9 years old"'
encoded_text = steganos.encode(bits, original_text)
partial_text = encoded_text[14:26]
recovered_bits = steganos.decode_partial_text(partial_text, original_text)
recovered_bits_limit = steganos.decode_partial_text(partial_text, original_text, message_bits=3)
# recovered_bits == '1??101'
# recovered_bits_limit = '101'

Extending Steganos

Steganos encoding works by generating 'branchpoints' for a given original text. Each branchpoint represents a change to the text that does not change the meaning of the text. Each branchpoint is 'executed', which means that the change it defines is made, according to the bits we are trying to encode. For example, if we want to encode '10' in a text for which we can generate two branchpoints, the first of those is executed and the second is not. Note that if there are more branchpoints available than there are bits to encode, the bits are repeated to make use of the spare capacity. For example, if we want to encode '10' in a text with 4 branchpoints, steganos.encode automatically encodes '1010', improving our ability to retrieve the encoded information from an incomplete encoded text.

Steganos decoding works by figuring out which branchpoints were executed on a given text. It does this by comparing the encoded text to the original.

The Data Model

Each branchpoint is represented as a list of changes. Each change is a tuple of length three. The first two elements are the start and end indices of the chunk to be removed from the text, and the third element is the text with which it is to be replaced. The end index is non-inclusive. Branchpoints are represented in this way so that they can be easily interleaved.

Adding Branchpoints

Adding a new type of branchpoint should only entail changes to src/branchpoints.py and test/branchpoints_test.py. Simply add a function that accepts a string and returns a list of branchpoints represented in the manner described above.

Note that there are functions called unicode_branchpoints, ascii_branchpoints and global_branchpointsin the branchpoints module. Functions that add branchpoints that take advantage of unicode codepoints should be called from the unicode_branchpoints function. Other local branchpoints should be called from the ascii_branchpoints function.

Some changes to the text only make sense when applied universally (e.g. using oxford commas). These can be represented as a single branchopint with many changes. Functions that find global branchpoints should be called from the global_branchpoints function.

The get_all_branchpoints function in that module will then integrate the new branchpoints appropriately, and no further changes will have to be made.

Please note that adding new branchpoints will make it impossible to decode text that had been encoded before those branchpoints were added. As such, we should bump the version every time new branchpoints are added and keep track of which texts were encoded with which version.

An arbitrary example to demonstrate a function that finds branchpoints with multiple changes each is below. This will generate branchoints that every time the letter 'a' appears will change it to 'x' and will change the letter two before to 'y'. This is of course not a legitimate branchpoint because it alters the semantics of the text.

def example_branchpoints(text: str):
    a_indices = [index for index, char in enumerate(text) if char == 'a']
    return [[(index - 2, index - 1, 'y'), (index, index + 1, 'x')] for index in a_indices]

Running Tests

Get pytest with pip install pytest, then run py.test test/. There are no production dependencies.

TODO

  • The code contains only sample global, ascii, and unicode branchpoints.
  • Enable flag for 'ascii-only' branchpoints.