campbio/musicatk

musicatk::create_musica() from a data frame leads to only NN as variant alleles

Opened this issue · 4 comments

Hi,
I run musicatk::create_musica() from a dataframe. This results in a count table with only NN as variant alleles.
I am not sure how I can get to the correct variant alleles in. It also leads to downstream errors in the signature detection step.

head(dbs.df)
chr start end ref alt sample
1 1 104017 104018 CC TT ONCOLEAD_CELL_CAPAN1
2 1 149875 149876 GC CG ONCOLEAD_CELL_CAPAN1
3 1 232961 232962 TG CA ONCOLEAD_CELL_CAPAN1
4 1 362904 362905 TT GG ONCOLEAD_CELL_CAPAN1
g=select_genome("hg19")
dbs_musica <- create_musica(x = dbs.df, genome = g)
build_standard_table(dbs_musica, g, "DBS78", overwrite = TRUE)
Building count table from DBS with DBS78 schema
head(dbs_musica@count_tables$DBS78@annotation)
motif mutation context
AC>NN_CA AC>NN_CA AC>NN CA
AC>NN_CG AC>NN_CG AC>NN CG
AC>NN_CT AC>NN_CT AC>NN CT
AC>NN_GA AC>NN_GA AC>NN GA
AC>NN_GG AC>NN_GG AC>NN GG
AC>NN_GT AC>NN_GT AC>NN GT

musica.result <- discover_signatures(musica = dbs_musica, table_name = "DBS78",
num_signatures = 3, algorithm = "lda",
nstart = 10, par_cores=8)
Error in colSums(counts_table) :
'x' must be an array of at least two dimensions

Hi Thomas,
It looks like your steps should work. In order to view the count table to see what might be going on, take a look at
head(dbs_musica@count_tables$DBS78@count_table)
If the count table data is not sensitive you can post it here to diagnose.

It's possible your chromosome chr column should have data of the form chr1 not 1. You can try modifying that and see if it solves your issue.

Please let me know if either of those are informative!

The DBS motifs are defined here:
https://cancer.sanger.ac.uk/signatures/dbs/

But I would highly recommend viewing
dbs_musica@count_tables$DBS78@count_table

As this will enumerate the exact motifs and counts in a human-readable format
If you have many samples you may want to try
dbs_musica@count_tables$DBS78@count_table[, 1:3]
to view just a few samples' counts.