Predict coding regions using SVMs on AMI-derived profiles of test sequences
Running the script predictCodingCrossSpecies.m does the following for each taxonomic ID in TaxonomicIDList.csv:
- [If Necessary] Download genome assemblies from NCBI and write summary information to SpeciesList.csv (by calling getGenomes.m)
- [If Necessary] Compile coding and noncoding parent sequences (by calling compileCodingNoncoding.m)
- Generate AMI profiles for a random selection of training sequences drawn from the coding and noncoding parent sequences (by calling getAMI)
- Train an SVM on the profiles in the training set
- Use that SVM to predict whether test sequences drawn from each of the other species is coding or noncoding
- Output prediction metrics to CodingRegion_CrossSpecies.xlsx
Running the script predictCoding.m does the following for the taxonomic ID specified:
- [If Necessary] Download genome from NCBI (by calling getGenomes.m)
- [If Necessary] Compile coding and noncoding parent sequences (by calling compileCodingNoncoding.m)
- Generate AMI profiles for a random selection of training sequences drawn from the coding and noncoding parent sequences (by calling getAMI)
- Train an SVM on the profiles in the training set
- Use that SVM to predict whether each sequence in the specified multi-FASTA file is coding
- Output prediction scores to specified output file (CSV, XLSX, or TXT)
- MATLAB R2020a or later
- Internet connection (if downloading files from NCBI)
MIT