vanheeringen-lab/genomepy

chromosome order

Phlya opened this issue · 10 comments

Phlya commented

Is there a way to specify chromosome order when installing a genome? E.g., when downloading a human genome I want the chromosomes to be sorted in the "normal" way aschr1, chr2, ... chr10, ... chr22, chrX, chrY, chrM. But currently they are sorted as chr1, chr10, ..., chr2, ..., chr22, chr3, ... chr9, chrM, chrX, chrY. Is it just the same order as in UCSC (I downloaded from there)?

genomepy does not (currently) support this. This snippet should do the trick:

from pyfaidx import Fasta
import re 

oldfa = '/path/to/genome.fa'
newfa = '/path/to/sorted_genome.fa'

def sorted_nicely( l ): 
    """ 
    Sort the given iterable in the way that humans expect.
    source: https://stackoverflow.com/questions/2669059/how-to-sort-alpha-numeric-set-in-python
    """ 
    convert = lambda text: int(text) if text.isdigit() else text 
    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] 
    return sorted(l, key = alphanum_key)

fa = Fasta(oldfa)
sorted_chr = sorted_nicely(fa.keys())
with open(newfa) as sorted_fa:
    for chr in sorted_chr:
        sorted_fa.write('>'+chr+'\n')
        sorted_fa.write(fa[chr][:])
Phlya commented

Thank you. Is there a way to perform this, but still use genomepy to manage the new sorted genome?

Assuming that with manage, you mean also applies this sorting to the support files (fa.sizes, gaps.bed & fa.fai). If so, you can run this after the previous code:

from genomepy import Genome
import os

g = Genome(oldfa)
os.unlink(g.index_file)
os.unlink(g.sizes_file)
os.unlink(g.gaps_file)

g = Genome(newfa)

if oldfa and newfa were the same it should just overwrite the old stuff (check if this does what you intended though)

Phlya commented

OK, thank you!
Would this feature be something you might be interested in? I might try to make a PR to add it then. The ordering of chromosomes is pretty important, and non-trivial in some organisms.

Thank you for the offer, but I don't think this is a desired feature. Chromosomes are sorted by size, which all tools I know can use, and some tools might not work otherwise (due to assumptions). Natural sorting would also fail for certain genomes (such as those that use roman numerals).

Phlya commented

Are chromosomes sorted by size? That's not what I experienced, for me they were sorted alphabetically, as I indicated in the first post.
I'm not suggesting necessarily natural sorting, but just arbitrary sorting. Natural sorting can be an option, but also the actual desired order of chromosomes could be provided.

My bad, you're right about the default order.
Inserting a sorting function into genomepy (as user) would however be equally complex to running the sorting afterward. And since we can't add sorting methods for every genome, I feel this step is best left to the user.

Phlya commented

Doesn't need to be a function, just the actual list of chromosome names in the right order would work.

Hi @Phlya, I think we're hesitant to add functionality to genomepy where the use-case is not clear as this is functionality that needs to be tested and maintained down the line and adds additional complexity to the command line tool.
However, maybe some further information would help. Can you provide a clear use-case for this, i.e. an example where the order of the chromosomes in a genome FASTA need to be changed? I would expect most tools to expect lexicographical order, but maybe I'm wrong here.

Phlya commented

No worries, I understand if this functionality might be not the most important.

My main use-case is that I want to use genomepy in a pipeline to manage the genome reference. Different tools and steps rely on a chromsizes file, and that needs to align with how the data is processed, including the order of chromosomes, and some tools also need the genome sequence. I need to investigate if the order of chromosomes in the fasta file needs to match the order in the chromsizes file, but even if it's not strictly required for the tools I am using now, I am not keen on the idea of having a chromsizes file with a different order than the fasta - I feel that's a dangerous combination that can lead to errors in some cases.

And of course I'd rather have my chromosomes sorted in the logical order (which matches the natural sorting for human, but could be arbitrarily different in other organisms).

I hope this explains my use case and gives you some context. Thanks!