faa file of individual Cas proteins?

Question

faa file of individual Cas proteins?

Closed this issue 2 years ago · 7 comments

Dear all

I notice that the typical output of cctyper involves a file of cas_operons.txt, with each row containing potentially an operon with multiple Cas protein hits. However, the proteins.faa file contains all proteins used for cctyper (potentially even those not identified in the cas_operons.txt file.

Is there a way to extract the information in the "Genes" column of cas_operons.txt, and extract those amino acid sequences from proteins.faa to only include protein sequences of Cas proteins?

Thanks

Marcus

Answer 1 · 2022-01-20T10:30:57.000Z

Hi Marcus,

Here is a script to do it:
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' <(tail -n+2 cas_operons.tab | cut -f 1,11 | tr -d '[,]' | awk -F'\t' 'gsub(" ","\n"$1"_",$2)') proteins.faa

It's a little convoluted, but it gets the job done.

Cheers,
Jakob

Answer 2 · 2022-01-20T10:56:17.000Z

Wow Jakob that worked beautifully thank you so much! :)

Answer 3 · 2022-01-20T13:03:25.000Z

Sorry Jakob. I notice that for every row of the cas_operon.tab file, the first gene of the "Genes" column is missing from the output faa file. That is because I notice that the start position of that operon in that particular contig is never found in the protein sequence ID in the .faa file. The end position, on the other and, can be found in the seqID of the sequences.

Is there a way we can adjust the code slightly to also incorporate the first gene of each row of cas results please? Thank you so much!

Marcus

Answer 4 · 2022-01-20T13:48:22.000Z

ah yes, sorry about that:
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' <(tail -n+2 cas_operons.tab | cut -f 1,11 | tr -d '[,]' | awk -F'\t' 'gsub(" |^","\n"$1"_",$2)') proteins.faa

Answer 5 · 2022-01-24T03:29:09.000Z

Thank you very much!

Answer 6 · 2022-04-28T15:40:18.000Z

Dear all sorry for being annoying, but will it also be possible now to incorporate the protein name somewhere on the sequence ID in the output .faa file of command above? It would be nice to have the protein name for each of these amino acid sequences, based on the "Gene" column in the cctyper .tab outputs.

It would really helpful for those who want to, say, use seqkit or something to extract out only amino acid sequences that belong to a particular Cas type.

Thank you so very much!

Regards

Marcus

Answer 7 · 2022-05-02T08:13:35.000Z

Hi Marcus

It's becoming a bit convoluted for a one-liner, but here you go:
join -1 1 -2 2 <(perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' <(tail -n+2 cas_operons.tab | cut -f 1,11 | tr -d '[,]' | awk -F'\t' 'gsub(" |^","\n"$1"_",$2)') proteins.faa | awk '{print $1}' | awk -v RS=">" -v ORS="\n" -v OFS="" '{$1=$1"\t"}1' | sort -k1,1) <(tail -n+2 hmmer.tab | sort -k2,2 -k6,6nr | sort -k2,2 -u | cut -f1,2) | awk '{print ">"$1,$3"\n"$2}'

So it's the same script as above, but then the output is flattened to a tab file, merged with the cas name from the hmmer.tab file, and then expanded to a fasta file again.

Best,
Jakob