/akhmetov-2016

Programs for encoding digital information as DNA.

Primary LanguagePython

DNA information storage

The main goal of the research was to design and implement a scheme for encoding digital information into DNA. This repository includes:

  • Codec for reading and writing DNA-encoded data.
  • Code for partitioning a large sequence of encoded DNA into smaller pieces.
  • Code for using error-correction capacity of the encoding scheme to correct mutations.
  • Code for simulating mutations.

Instructions for running code

All of the code was written in Python 3, using PyCharm Community Edition as the IDE, on Windows 7. In principle Python is portable and all of this should work under any Python 3 environment (provided that module dependencies are satisfied) but I have not tested that.

It is strongly recommended that, before running any file, you read through all the code that will be executed. In many cases it is expected that you specify parameters and input/output files in the section on top of each file, or in constants.py before you run.

Note on performance

Some of the code as is may take more than a minute to run, depending on hardware and parameters. In most cases the code will at least print some sort of progress information when this happens.

Dependencies

Various Python packages are used at several points in the project, as can be seen from the import statements. These were installed into the environment using the PyCharm "Project Interpreter" window, but they should also be installable with pip since that is what PyCharm uses.

Included data

The inputs and outputs from the paper are included as samples in the repository.

Codebook generator

make_codewords.py

Given a block length, this program will construct a list of DNA blocks which satisfy certain defined parameters. The resulting list of DNA blocks (ie. short sequences) can be used to generate a codebook for a DNA codec (construct_codebooks.py) - by simply assigning a number to each sequence.

To reiterate the internal terminology of this project: A codebook is a list of short DNA sequences mapped to numbers. Codewords are these DNA sequences per se, that is to say without necessarily being mapped to anything.

The algorithm is somewhat non-deterministic, part of the reason is that it references the order of Python language constructs such as dictionaries, which have non-deterministic order by design and in practice. If run repeatedly with the same parameters, the results will often vary trivially, and sometimes may vary non-trivially. It is expected that this code will be run a handful of times until a satisfactory codeword list is generated, and then that codeword list is used for all subsequent work.

The result are saved in codewords.txt by default. The codewords.txt that was actually used for the paper (and sample encoded DNA) is included under this path already -- if this file is changed, the sample data we provide can no longer be decoded with it, so be careful when running this script!

construct_codebooks.py

This file maps codewords to integers (0, 1, 2, ...) to create a codebook which can be used by the DNA codec to encode/decode digital information to/from DNA.

  • A list of codewords, generated by make_codewords.py, must be given. Input/codewords.txt is the list used in the paper.
  • Output/Codebook contains the forward and reverse codebooks. We have provided the ones used in the paper.

Two codebooks are generated: The codebook itself, which is a Python dictionary from string (DNA sequence) to integer, and a reverse codebook, which is the same dictionary but with keys and values reversed. The reverse codebook is generated because it simplifies decoding logic, but it is theorethically redundant, since it can be trivially generated from the codebook.

In intended usage, the codebook pair is generated once, and then used for all subsequent encoding/decoding operations. The codebooks must be stored securely for a sufficiently long time since they will be required for eventualy decoding (decoding with unknown codebook may be possible, but is not supported). If multiple users will be encoding/decoding information from each other, they must ensure that they all use the same codebooks.

DNA codec

dna_write.py

Digital file goes in, DNA comes out. Requires a codebook (Input/codebook.pickle).

  • Input/tars/ folder contains the four tar archives containing the test data. The archives can be extracted to obtain the original files, but it was the tar archive that was used as input in every case.
  • Intermediate data/encoded bytes.txt is the byte stream of the digital data just before encoding, dumped here for testing/debug.
  • Output/Encoded DNA/ contains the files resulting from encoding the tars.
    • encoded DNA is a single DNA sequence. This is the file of interest.
    • encoded blocks is the same sequence formatted as a list of blocks, for testing/debug.

dna_read.py

Converts DNA (must not have errors in it) into digital file. Requires the "reverse codebook" (reverse codebook.pickle) corresponding to the codebook used for encoding. The reverse codebook is simply the ordinary codebook (which itself is a map from DNA blocks to integers) with the keys and values reversed, to simplify program logic.

  • Input/Encoded DNA/ folder contains the DNA sequences representing encoded test data. This is a copy of the output from dna_write.py.
  • Intermediate data/decoded bytes.txt is the byte stream of the digital data just after decoding, dumped here for testing/debug.
  • Output/tars/ contains the files recovered from the decoding, saved with the .tar extension (note that our scheme specifies that all data will be wrapped in a tar archive when being encoded). This can be extracted to observe the recovered original data.

split_into_packets.py

Breaks down a single, long DNA sequence into short overlapping fragments, in preparation for oligo array synthesis.

  • Input/Encoded DNA/ should contain the encoded DNA files (from dna_write.py) that will be packaged.
  • The resulting packets will be saved as a collection of FASTA-formatted sequences in a single text file under Output/Packets. The pieces are numbered to help with debugging, but their names are not important.