gbru_planning

short term

  • Salmonella graph

    • determine node coverage of xmfa vs pggb_5 vs pggb_250
      • make table of "node,length,reads,num_reads_covering"
    • Test different SPAdes parameters with simulated reads: k 21,33,55
      • Coverage distribution of assembled contigs
      • Get alignments, view in tubemaps - specifically nodes that don't align
    • Compare graph with BWA
      • Take simulated reads of the 4 non-ref genomes -> assemble -> align to the source genome (CP004027.1) - uncovered areas comparable to source genome simulations?
      • How much of CP004027.1 is not covered when you align the same assembled simulated reads to it with BWA? Compare to GraphAligner
  • Genesieve

    • Obtain the Arabidopsis QTL data
      1. Get all associations by paging thru 25/100/1000 at a time, increasing offset to get next page
      2. Get the phenotype desc with JSON parser
      3. Get the SNP Chr/location with JSON parser
      4. Write it down
      5. For each study...
        • Collect all SNPs
        • If many on one chromosome, do k-means of like 2 or 3
        • get k-means boundaries by sorting clusters, get centroid +-50kbp
        • use those as coordinates, obtain genes from gff
    • Complete functionality testing on rice test set
      • Fix genesieve env on genesieve server: numpy/gensim incompatibility issue
      • Run full pipeline tests with hardcoded data
    • Test and validate SQL queries
      • incorporate into genesieve.py when finished
    • Create a validation test
      • Using our test set, pull regions (~300kbp either side?) around "true" genes. Generate distribution of scores. Where is the 'true' gene? Now generate a random region with a random trait. What's that distribution? How do they differ? "true" value minus the median of the "false" values
      • Homology: what if there are tandem dupes? How many validated genes have tandem dupes/copies conflating results? see how many - if it's like 50% then mask them in test set
        • Check to see how often a validated candidate gene has duplicated genes next to/near it in the ~500kbp either side region.