How do you show a more complete description of the aligned entries?

Question

How do you show a more complete description of the aligned entries?

thomashjordan opened this issue 2 years ago · 4 comments

Hello,

First of all I want to say that I really like this python package. I think it is really neat!

I am currently having the problem that the labels for the various sequences are cut off.
According to how I understand the code, if I have a fasta file containing aligned sequence records with the format

>Genus1 species1 strain1 | accession_number1
----M----AD---A
>Genus2 species2 strain2 | accession_number2
----M----SD---A
etc.

Then when I run the package on that particular fasta file, only the Genus is shown on the left of each sequence.
Since I have files with many species of the same genus, I also cannot use the sorted=True setting when creating the MsaViz object, as I then get a "Duplicate values" error.

If I switch the format of my alignment files around in order to have the first thing be the accession_number, the sorted=True setting functions as intended but I am left with a figure showing only the accession_numbers as a description, which makes interpreting the results hard.

Current format:

>accession_number1 | Genus1 species1 strain1
----M----AD---A
>accession_number2 | Genus2 species2 strain2 
----M----SD---A
etc.

Am I doing something wrong currently, or is there some setting to ensure that the full title of the sequence record is shown?

Minimal working example assuming switched fasta files:


def make_pymsaviz_plot(path, name, outname, min_gap_length = 5, gap_fraction=0.05, gap_char="-", variable_consensus=0.4, variable_char="x", show_count=True, show_consensus=True, color_scheme="Clustal", sorted = False):
    # create the input and output file names
    infile = os.path.join(path, name)
    outfile = os.path.join(path, outname)

    # parse the input fasta file into an array of sequences
    sequences = []
    for record in SeqIO.parse(infile, "fasta"):
        sequences.append(record.seq)

    # for every position in the alignment, count the number of gaps and variable characters
    gap_count = np.zeros(len(sequences[0]))

    for sequence in sequences:
        for i, aa in enumerate(sequence):
            if aa == gap_char:
                gap_count[i] += 1


    # get the continuous stretches of gaps
    gap_stretches = []
    start = 0
    end = 0
    for i, count in enumerate(gap_count):
        if(count/len(sequences) > (1-gap_fraction)):
            end = i
        else:
            if end > start:
                gap_stretches.append((start+2, end+1))
            start = i
            end = i
    if end > start:
        gap_stretches.append((start+2, end+1))

    # remove the stretches of gaps that are too short
    gap_stretches = [stretch for stretch in gap_stretches if stretch[1] - stretch[0] > min_gap_length]
    print(gap_stretches)

    # create a new fasta file with the gap_stretches removed
    protein_accessions = []
    genus_species = []

    with open(outfile + "_gap_trimmed.fasta", "w") as f:
        for record in SeqIO.parse(infile, "fasta"):

            # get the protein accession and genus species
            protein_accessions.append(record.description.split("|")[0].strip())
            genus_species.append(record.description.split("|")[1])

            new_seq = ""
            for i, aa in enumerate(record.seq):
                if not any([i >= stretch[0]-1 and i <= stretch[1]-1 for stretch in gap_stretches]):
                    new_seq += aa
            f.write(">" + record.description + "\n")
            f.write(new_seq + "\n")

    # create a pymsaviz object from the both the non-trimmed and trimmed fasta file
    msa = pymsaviz.MsaViz(infile, show_count=show_count, show_consensus=show_consensus, sort=sorted, color_scheme=color_scheme)
    msa_trimmed = pymsaviz.MsaViz(outfile + "_gap_trimmed.fasta", show_count=show_count, show_consensus=show_consensus, sort=sorted, color_scheme=color_scheme)

    # add annotations to the pymsaviz object
    for gap in gap_stretches:
        msa.add_text_annotation(gap, "Gap Region", text_color="black", range_color="red")
    msa.savefig(outfile + "_gap.png")

    # add variable markers to the trimmed pymsaviz object to show for highly non-conserved regions
    high_variability = []
    identity_list = msa_trimmed._get_consensus_identity_list()

    for position, identity in enumerate(identity_list, 1):
        if identity < variable_consensus:
            high_variability.append(position)
    msa_trimmed.add_markers(high_variability, marker=variable_char, color="red")
    msa_trimmed.savefig(outfile + "_gap_trimmed.png")

    return msa, msa_trimmed

Thank you in advance for your help.

Answer 1 · 2023-03-25T11:58:04.000Z

Hi, @thomashjordan

The newly released pyMSAviz v0.4.0 resolves the following two issues presented here.

Unable to display complete description as label
Duplicate name error occurs when setting sort options

Below is an example of pyMSAviz v0.4.0 with the issues resolved.

from pymsaviz import MsaViz

# User can display complete description by setting 
# label_type="description" (By default, label_type="id")
mv = MsaViz(
    "test.fa",
    wrap_length=60,
    label_type="description",
    sort=True,
)
mv.savefig("test.png")

test.png

test.fa

>Genus1 species1 strain1 | accession_number1
MATPGPVIPEVPFEPSKPPVIEGLSPTVYRNPESFKEKFLRKTRENPVVPIGCLATAAALTYGLYSFHRGNSQRSQLMMRTRIAAQGFTVAAILLGLAVTAMKSRP------------
>Genus1 species2 strain2 | accession_number2
MATPGPVIPEVPFEPSKPPVIEGLSPTVYRNPESFKEKFVRKTRENPVVPIGCLATAAALTYGLYSFHRGNSQRSQLMMRTRIAAQGFTVAAILLGLAVTAMKSRP------------
>Genus2 species3 strain3 | accession_number3
MATPGPVIPEVPFEPSKPPVIEGFSPTVYRNPESFKGKFLRKTRENPVVPIGCLATAAALTYGLYSFHRGDSQRSQLMMRTRIAAQGFTVAAILLGLAVTAMKSRPSAQGLASKAPQK
>Genus3 species4 strain4 | accession_number4
MATPGPVIPEVPFEPSKPPVIEGLSPTVYRNPESFKEKFVRKTRENPVVPIGCLATAAALTYGLYSFHRGNSQRSQLMMRTRIAAQGFTVAAILLGLAVTAMKSRP------------
>Genus3 species5 strain5 | accession_number5
MATPGPVIQEVPFEPSKPPVIEGLSPTVYRNPESFKEKFVRKTRENPVVPIGCLATAAALTYGLYSFHRGNSQRSQLMMRTRIAAQGFTVAAILLGLAVTAMKSRP------------
>Genus4 species6 strain6 | accession_number6
MATPGPVIPKVPFEPSKPPVIEGLSPTVYRNPESFKEKFLRKTRENPVVPIGCLATATALSYGLYSFHRGNSQRSQLMMRTRIAAQGFTVAAILLGLAVTAMKSRP------------

Answer 2 · 2023-03-25T15:38:21.000Z

Thank you for the rapid fix!

Would you mind updating the bioconda repository as well? The current version is still 0.3.0 there so I can't fetch it via update.

Cheers!

Answer 3 · 2023-03-25T15:51:35.000Z

I have already made a pull request to Bioconda for an update to pyMSAviz.
However, Bioconda will not update the package until a reviewer reviews and merges it.

The Bioconda team will probably update it within the next two days. Please wait until then.

After the update finished, the badge below will be v0.4.0.

Answer 4 · 2023-03-25T15:52:25.000Z

Perfect, thank you again!