How do you show a more complete description of the aligned entries?
thomashjordan opened this issue · 4 comments
Hello,
First of all I want to say that I really like this python package. I think it is really neat!
I am currently having the problem that the labels for the various sequences are cut off.
According to how I understand the code, if I have a fasta file containing aligned sequence records with the format
>Genus1 species1 strain1 | accession_number1
----M----AD---A
>Genus2 species2 strain2 | accession_number2
----M----SD---A
etc.
Then when I run the package on that particular fasta file, only the Genus is shown on the left of each sequence.
Since I have files with many species of the same genus, I also cannot use the sorted=True setting when creating the MsaViz object, as I then get a "Duplicate values" error.
If I switch the format of my alignment files around in order to have the first thing be the accession_number, the sorted=True setting functions as intended but I am left with a figure showing only the accession_numbers as a description, which makes interpreting the results hard.
Current format:
>accession_number1 | Genus1 species1 strain1
----M----AD---A
>accession_number2 | Genus2 species2 strain2
----M----SD---A
etc.
Am I doing something wrong currently, or is there some setting to ensure that the full title of the sequence record is shown?
Minimal working example assuming switched fasta files:
def make_pymsaviz_plot(path, name, outname, min_gap_length = 5, gap_fraction=0.05, gap_char="-", variable_consensus=0.4, variable_char="x", show_count=True, show_consensus=True, color_scheme="Clustal", sorted = False):
# create the input and output file names
infile = os.path.join(path, name)
outfile = os.path.join(path, outname)
# parse the input fasta file into an array of sequences
sequences = []
for record in SeqIO.parse(infile, "fasta"):
sequences.append(record.seq)
# for every position in the alignment, count the number of gaps and variable characters
gap_count = np.zeros(len(sequences[0]))
for sequence in sequences:
for i, aa in enumerate(sequence):
if aa == gap_char:
gap_count[i] += 1
# get the continuous stretches of gaps
gap_stretches = []
start = 0
end = 0
for i, count in enumerate(gap_count):
if(count/len(sequences) > (1-gap_fraction)):
end = i
else:
if end > start:
gap_stretches.append((start+2, end+1))
start = i
end = i
if end > start:
gap_stretches.append((start+2, end+1))
# remove the stretches of gaps that are too short
gap_stretches = [stretch for stretch in gap_stretches if stretch[1] - stretch[0] > min_gap_length]
print(gap_stretches)
# create a new fasta file with the gap_stretches removed
protein_accessions = []
genus_species = []
with open(outfile + "_gap_trimmed.fasta", "w") as f:
for record in SeqIO.parse(infile, "fasta"):
# get the protein accession and genus species
protein_accessions.append(record.description.split("|")[0].strip())
genus_species.append(record.description.split("|")[1])
new_seq = ""
for i, aa in enumerate(record.seq):
if not any([i >= stretch[0]-1 and i <= stretch[1]-1 for stretch in gap_stretches]):
new_seq += aa
f.write(">" + record.description + "\n")
f.write(new_seq + "\n")
# create a pymsaviz object from the both the non-trimmed and trimmed fasta file
msa = pymsaviz.MsaViz(infile, show_count=show_count, show_consensus=show_consensus, sort=sorted, color_scheme=color_scheme)
msa_trimmed = pymsaviz.MsaViz(outfile + "_gap_trimmed.fasta", show_count=show_count, show_consensus=show_consensus, sort=sorted, color_scheme=color_scheme)
# add annotations to the pymsaviz object
for gap in gap_stretches:
msa.add_text_annotation(gap, "Gap Region", text_color="black", range_color="red")
msa.savefig(outfile + "_gap.png")
# add variable markers to the trimmed pymsaviz object to show for highly non-conserved regions
high_variability = []
identity_list = msa_trimmed._get_consensus_identity_list()
for position, identity in enumerate(identity_list, 1):
if identity < variable_consensus:
high_variability.append(position)
msa_trimmed.add_markers(high_variability, marker=variable_char, color="red")
msa_trimmed.savefig(outfile + "_gap_trimmed.png")
return msa, msa_trimmed
Thank you in advance for your help.
Hi, @thomashjordan
The newly released pyMSAviz v0.4.0 resolves the following two issues presented here.
- Unable to display complete description as label
- Duplicate name error occurs when setting sort options
Below is an example of pyMSAviz v0.4.0 with the issues resolved.
from pymsaviz import MsaViz
# User can display complete description by setting
# label_type="description" (By default, label_type="id")
mv = MsaViz(
"test.fa",
wrap_length=60,
label_type="description",
sort=True,
)
mv.savefig("test.png")
test.png
test.fa
>Genus1 species1 strain1 | accession_number1
MATPGPVIPEVPFEPSKPPVIEGLSPTVYRNPESFKEKFLRKTRENPVVPIGCLATAAALTYGLYSFHRGNSQRSQLMMRTRIAAQGFTVAAILLGLAVTAMKSRP------------
>Genus1 species2 strain2 | accession_number2
MATPGPVIPEVPFEPSKPPVIEGLSPTVYRNPESFKEKFVRKTRENPVVPIGCLATAAALTYGLYSFHRGNSQRSQLMMRTRIAAQGFTVAAILLGLAVTAMKSRP------------
>Genus2 species3 strain3 | accession_number3
MATPGPVIPEVPFEPSKPPVIEGFSPTVYRNPESFKGKFLRKTRENPVVPIGCLATAAALTYGLYSFHRGDSQRSQLMMRTRIAAQGFTVAAILLGLAVTAMKSRPSAQGLASKAPQK
>Genus3 species4 strain4 | accession_number4
MATPGPVIPEVPFEPSKPPVIEGLSPTVYRNPESFKEKFVRKTRENPVVPIGCLATAAALTYGLYSFHRGNSQRSQLMMRTRIAAQGFTVAAILLGLAVTAMKSRP------------
>Genus3 species5 strain5 | accession_number5
MATPGPVIQEVPFEPSKPPVIEGLSPTVYRNPESFKEKFVRKTRENPVVPIGCLATAAALTYGLYSFHRGNSQRSQLMMRTRIAAQGFTVAAILLGLAVTAMKSRP------------
>Genus4 species6 strain6 | accession_number6
MATPGPVIPKVPFEPSKPPVIEGLSPTVYRNPESFKEKFLRKTRENPVVPIGCLATATALSYGLYSFHRGNSQRSQLMMRTRIAAQGFTVAAILLGLAVTAMKSRP------------
Thank you for the rapid fix!
Would you mind updating the bioconda repository as well? The current version is still 0.3.0 there so I can't fetch it via update.
Cheers!
I have already made a pull request to Bioconda for an update to pyMSAviz.
However, Bioconda will not update the package until a reviewer reviews and merges it.
The Bioconda team will probably update it within the next two days. Please wait until then.
Perfect, thank you again!