humburg/pirates

Ensure read names in output file are compatible with downstream tools

Closed this issue · 0 comments

We currently use the header line of the FASTQ records to store a lot of information (at least if there are several sequence differences in a cluster). To avoid problems with the downstream processing, we should ensure that

  • Each sequence in the output file has a unique name
  • The header line is no longer than 254 characters

As far as I can tell there are less strict requirements on the line separating the sequence and qualities. We could use that to store the diff as these can get quite long. For the first line of the record, I would suggest generating a unique name, possibly consisting of a (user provided?) prefix, either an index or the label, the cluster size and a (short) summary of the number of differences observed between sequences in the cluster (with details stored in line 3).
The format would then look something like this:
@[char name prefix]_[char unique suffix] [int cluster size] [int number of mismatch positions]
[16 char label][char sequence]
+[int char position]A[int count]T[int count]C[int count]G[int count]N[int count] ...
[char qualities matching label and read above]

The diff could be compressed by listing only letters with non-zero counts.