Question about tagging sam files with transcript ID
Closed this issue · 2 comments
Hello,
Thank you for developing pbsim!
I am using pbsim to simulate transcripts and I want to assess if they map back to the right genes
and which transcripts map to other matching positions.
For this it would be very helpful if I could tag the output sam with the transcript id from which the read was simulated.
Could you please provide recommendations on how I can achieve this?
One approach I'm considering is simulating each transcript individually, adding the corresponding transcript ID tag to the output SAM files, and subsequently merging all transcripts. Do you have any suggestions or alternative methods for achieving this goal?
Thank you for your assistance!
Best,
Nadja
Thank you for your using PBSIM3.
The maf file contains alignments between transcripts and reads, and you can create a correspondence list of transcripts and read IDs from the maf file.
Your approach is also good. Change --seed each time you run it. Otherwise, the distribution of read start positions on the template and the read length distribution may be biased.
Thanks a lot @yukiteruono for the quick response! I didn't check the maf before.
If anyone stumbles on this issue and wants to do the same:
- Here is how I extracted the read ID, transcript pairs from the maf file.
# Get a list of read and transcript ids
awk '/^a\$/ { getline; transcript_id=\$2; getline; read_id=\$2; print read_id, transcript_id }' *maf > read_gene_mapping
# Get readname prefix
prefix=\$(awk '{print \$1; exit}' read_gene_mapping | sed 's/\\/[0-9]*\\/[0-9]*\$//')
# Replace pass number with "ccs" to match read names after running ccs
sed -i "s/\$prefix\\/\\([0-9]*\\)\\/[0-9]* /\$prefix\\/\\1\\/ccs /g" read_gene_mapping
# Only keep unique lines
uniq read_gene_mapping > read_gene_mapping
- After mapping to reference:
To tag the aligned sam file with the corresponding transcript id, I ran this:
while read -r read_name rg_tag; do
samtools view -h $aligned_sam | grep \$read_name | awk -v read_name="\$read_name" -v rg_tag="\$rg_tag" 'BEGIN {OFS="\\t"} \$1 == read_name { \$0 = \$0"\\tRT:Z:"rg_tag } { print \$0 }' >> tagged.sam
done < read_gene_mapping_filter