How can I simulate reads data when I manually designed an OTU table?
fanqiedantang opened this issue · 4 comments
Hi,
There are 8 columns, and the first column is the OTU number that I manually set. The second to seventh columns are the sample names. The last column is species classification.
And I have set the reference genome file required for the - ref parameter. There are three columns in total, the first column is taxi, the second column is species classification, and the third column is the download address. But there was an error.
otu table:
-ref :
My command is : python metagenome_from_profile.py -p test/test.biom -c test/test_config.ini -ref test/file_paths.txt -f --seed 100
Is it right for me to do this?
Can you tell me what kind of OTU table can be used for simulation after being converted to BIOM format? It would be best if you could provide an example of an OTU table, thank you!
Most of these are just warnings, some of which even come directly from ete3/NCBITaxa, CAMISIM should still simulate a data set.
The Rank class ...
warning means that up to the taxonomic rank of Class
CAMISIM did not find a reference genome for the given lineage, subsequently the Filling up
warning just means that you used the -f
option and CAMISIM shows to you for which genomes in the profile CAMISIM did not find matching genomes and instead chose one from the reference genome list. It seems it does so top to bottom instead of randomly - that is why it matches a lot of tax id 9 Buchnera aphidicola
: These are early in the reference list. I could change the filling up to choose genomes randomly/add an option to do so, if you prefer.
In regard to your last comment: Looking at the input files you provide it seems like you already know which genomes from your BIOM profile should be used - CAMISIM tries to match genomes using ete3/NCBI, so it will not be able to match your genomes with your references. Instead of using the from_profile
option you should do a de novo
simulation here.
For this you need a genome_to_id.tsv
file, this should be tab-separated the genome ID (e.g. the first column from your BIOM profile/otu table) and the path to the genome (download the genomes from your reference file and link to the path).
Additionally, you need the metadata.tsv
file, this should be tab-separated genome_ID OTU NCBI_ID and novelty_category. Genome ID needs to be the same as in the genome_to_id.tsv
, OTU is not really important, you can just use the NCBI ID or numbers from 1 to number of genomes
, the NCBI ID (first column from your reference file) and for novelty_category you can use known_strain
. Finally, you need to provide the abundances for the individual samples with the distribution_file_paths
option in the config file. This option should point to one abundance file per sample, tab-separated genome_ID and abundance with the abundance being the second column of your profile/otu table (for the first sample), the third column for the second sample and so on.
I would appreciate it if you can give me a option to randomly select genomes