Format for MY_CLUSTERS_FILE

Question

Format for MY_CLUSTERS_FILE

Closed this issue 4 years ago · 3 comments

I want to import clusters from another pangenomic analysis tool (Anvio). I know that I need "a .tsv file listing, in the first column the gene family names, and in the second column the gene ID that is used in the annotation files."

Could you provide an example of this MY_CLUSTERS_FILE? How should the IDs be separated in the second column?

Thanks

Answer 1 · 2021-02-19T07:46:03.000Z

Hello,

There is only one gene ID per line. Basically, one line corresponds to one gene, and you'll have as many lines as there are genes in the pangenome.
With a generic example, it would look like this:

family1\tgene1
family1\tgene2
family2\tgene3
family2\tgene4

I've linked an "actual" example to the issue: gene_families.txt

It follows exactly the same format than what is provided in the 'gene_families.tsv' file generated by PPanGGOLiN
I hope this helps !

Adelme

Answer 2 · 2021-02-19T16:28:31.000Z

Great! I thought it was one family per line like in:

family1\tgene1, tgene2
family2\tgene3, tgene4

This makes it even easier to import from Anvio

Answer 3 · 2021-02-19T17:03:44.000Z

Awesome then !