JakobSkouPedersenLab/dreams

Issue with train_dreams_model step

Opened this issue · 12 comments

Hello,
I'm interested in the Dreams package, and I've been trying to use it with my data.
I'm working with deep-sequenced targeted panels of cfDNA.
For the initial step, I used the following code on my normal BAM file:

training_data = get_training_data( 
  bam_paths = "my_sample.bam", 
  reference_path = "hg19.fa",
  verbose = T)

It seems to be working. After that, using the following code:

model = train_dreams_model( 
  training_data = training_data, 
  layers = c(16,8),
  model_features = c("trinucleotide_ctx", "strand", "read_index"),
  lr = 0.01,  
  batch_size = 1000, 
  epochs = 100)

I encountered the following error message:

Error in `select()`:
! `select()` doesn't handle lists.
Backtrace:
    x
 1. +-dreams::train_dreams_model(...)
 2. | \-dreams:::prepare_training_data(...)
 3. |   \-... %>% sample_frac(1)
 4. +-dplyr::sample_frac(., 1)
 5. +-dplyr::select(., all_of(model_features), .data$obs)
 6. \-dplyr:::select.list(., all_of(model_features), .data$obs)
 7.   \-rlang::abort("`select()` doesn't handle lists.")

I'm wondering if I might be doing something wrong with the parameters of the train_dreams_model function. Can you help me with this issue?

Here's some additional information:
R version : 4.2.2
dplyr version : 1.1.3
dreams was installed from the source code (and its version is 0.0.0.9000 with packageVersion("dreams"))

Romain

Dear Mikkel,
It seems that it's not a data frame. I used saveRDS and readRDS to save the output because the first step takes time, but it's not supposed to alter the object structure.
So there is an issue with the first step ?

thank your for your answer

Hi Romain

The problem you encounter arise because the output from get_training_data() is a list with two elements. Instead you need to access the dataframe with $data.
If you go to the branch named dreams_mathilde, you will find help in the README.rm file.
Please note, that arguments to dreams_vc and dreams_cc have been corrected as well.

training_data = get_training_data(
  bam_paths = "/path/bam_file",
  reference_path = "/path/hg38.fa",
  ...)$data

Best Regards
Mathilde

Hello,
Thank you for your response, Mathilde. Unfortunately, I am facing another issue. I executed the following code:

training_data = get_training_data( 
bam_paths = "/path/bam_file", 
reference_path = "/path/hg19.fa",
verbose = T)$data

model = train_dreams_model( 
training_data = training_data, 
layers = c(16,8),
model_features = c("trinucleotide_ctx", "strand", "read_index"),
lr = 0.01,  
batch_size = 10000, 
epochs = 100)

My training_data data frame looks like this:

print(training_data)

# A tibble: 9,094,810 x 20
   qname          chr   genomic_pos obs   ref   strand first_in_pair read_index
   <chr>          <chr>       <dbl> <chr> <chr> <chr>          <int>      <dbl>
 1 Library:120266 chr1     46713278 T     C     fwd                1         67
 2 Library:120276 chr1     46713345 A     T     fwd                1        128
 3 Library:120308 chr1     46713221 T     C     fwd                1          4
 4 Library:120311 chr1     46713221 T     C     fwd                1          4
 5 Library:120204 chr1     46713372 C     G     rev                1          2
 6 Library:120328 chr1     46713375 C     G     rev                1          1
 7 Library:120374 chr1     46713375 C     G     rev                1          1
 8 Library:120427 chr1     46713226 A     C     fwd                1          1
 9 Library:120432 chr1     46713375 C     G     rev                1          1
10 Library:120433 chr1     46713375 T     G     rev                1          2
# i 9,094,800 more rows
# i 12 more variables: fragment_size <int>, ctx_minus1 <chr>, ctx_plus1 <chr>,
#   trinucleotide_ctx <chr>, context11 <chr>, local_complexity_1 <dbl>,
#   local_complexity_2 <dbl>, local_GC <dbl>, n_other_errors <dbl>,
#   n_insertions_in_read <int>, n_deletions_in_read <int>, seq_length <int>

I encountered the following issue:

/usr/local/r-tensorflow/lib/python3.9/site-packages/keras/src/utils/np_utils.py:62: RuntimeWarning: invalid value encountered in cast
  y = np.array(y, dtype="int")
Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
  IndexError: index -9223372036854775808 is out of bounds for axis 1 with size 4
Run `reticulate::py_last_error()` for details.
Calls: train_dreams_model ... <Anonymous> -> do.call -> <Anonymous> -> py_call_impl
Execution halted

I'm currently working on resolving this issue, but if you have any ideas about how to resolve it, I would greatly appreciate your assistance.

Best Regards
Romain

Hello,

I finally found the issue with my data. I checked the values ($obs) in my training_data and found the following:

print(unique(training_data$obs))
[1] "A" "T" "C" "G" ""  "N"

In your code for train_model.R, I noticed the following:

training_data_labels <-
    training_data %>%
    select(.data$obs) %>%
    mutate(obs = as.numeric(factor(.data$obs, levels = c("A", "T", "C", "G"))) - 1) %>%
  as.matrix() %>%
    keras::to_categorical()

The issue arises when the value is "" or "N"; in such cases, the assigned value becomes "nan," resulting in the error "-9223372036854775808" (which equals "nan").

If you have an idea on what I can do to fix that thank you :)

I have a question about the mutations_df. Is dreams able to obtain a dataframe of potential mutations from a bam or I have to give a catalogue of mutations to dreams_vc ?

Hi Romain,

Thank you for using DREAMS and reporting the error! The bug should now be fixed if you install DREAMS from the branch named dreams_mathilde. The issue was due to the prepare_training_data function, which only allowed "A", "C", "T", and "G" as levels for the neural network.

Currently, you need to provide a catalog of mutations to dreams_vc.

Please note that the README has been updated as well. To get the training data, you should now use:

training_data = get_training_data(
  bam_paths = "/path/bam_file",
  reference_path = "/path/hg38.fa",
  ...
)

instead of:

training_data = get_training_data(
  bam_paths = "/path/bam_file",
  reference_path = "/path/hg38.fa",
  ...
)$data

Please let me know whether it works for you.

Best regards,
Mathilde

Thank you Mathilde, I'm going to test that !

Best regards,
Romain

Hello Mathilde,
I tried to rerun dreams with your git branch and I had these results/errors:

what I have done:

training_data = get_training_data(
bam_paths = 'path/to/bam',
reference_path = 'hg19.fa',
verbose = True)

print(training_data)

model = train_dreams_model(
  training_data = training_data,
  layers = c(128, 64, 32),
  model_features = c('read_index', 'strand', 'trinucleotide_ctx', 'first_in_pair', 'seq_length', 'fragment_size', 'n_other_errors', 'local_GC'),
  lr = 0.01,
  batch_size = 1000000,
  epochs = 1000,
  model_file_path = './dreams_model.h5',
  log_file_path = './dreams_model.log',
  min_delta = 0, 
  patience = 0, 
  l2_reg = 0, 
  validation_split = 0, 
  ctx3_embed_dim = 3)

print(model)

And what I had:

List of 2
 $ data: tibble [8,953,164 x 20] (S3: tbl_df/tbl/data.frame)
  ..$ qname               : chr [1:8953164] "Library:53453" "Library:53454" "Library:53454" "Library:53457" ...
  ..$ chr                 : chr [1:8953164] "chr1" "chr1" "chr1" "chr1" ...
  ..$ genomic_pos         : num [1:8953164] 46713260 46713217 46713237 46713283 46713298 ...
  ..$ obs                 : chr [1:8953164] "A" "T" "C" "A" ...
  ..$ ref                 : chr [1:8953164] "A" "T" "A" "C" ...
  ..$ strand              : chr [1:8953164] "fwd" "fwd" "fwd" "fwd" ...
  ..$ first_in_pair       : int [1:8953164] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ read_index          : num [1:8953164] 67 23 43 86 96 8 52 71 86 80 ...
  ..$ fragment_size       : int [1:8953164] 178 239 239 204 172 206 159 159 159 148 ...
  ..$ ctx_minus1          : chr [1:8953164] "A" "G" "T" "G" ...
  ..$ ctx_plus1           : chr [1:8953164] "T" "T" "A" "C" ...
  ..$ trinucleotide_ctx   : chr [1:8953164] "AAT" "GTT" "TAA" "GCC" ...
  ..$ context11           : chr [1:8953164] "GGTCAATTTAG" "CTCTGTTGGCC" "CATCTAAGTCA" "CAGAGCCCAGA" ...
  ..$ local_complexity_1  : num [1:8953164] 0.562 0.473 0.562 0.473 0.562 ...
  ..$ local_complexity_2  : num [1:8953164] 0.94 0.88 0.88 0.676 0.88 ...
  ..$ local_GC            : num [1:8953164] 0.364 0.636 0.364 0.636 0.636 ...
  ..$ n_other_errors      : num [1:8953164] 1 2 1 0 0 0 2 3 2 1 ...
  ..$ n_insertions_in_read: int [1:8953164] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ n_deletions_in_read : int [1:8953164] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ seq_length          : int [1:8953164] 102 133 133 116 100 115 91 91 91 87 ...
 $ info:'data.frame':	1 obs. of  4 variables:
  ..$ total_mismatches: int 4476582
  ..$ total_matches   : int 4476582
  ..$ total_coverage  : num 2.17e+09
  ..$ beta            : num 0.00206
Model: "model"
________________________________________________________________________________
 Layer (type)       Output Shape         Para   Connected to         Trainable  
                                         m #                                    
================================================================================
 first_in_pair (In  [(None,)]            0      []                   Y          
 putLayer)                                                                      
 fragment_size (In  [(None,)]            0      []                   Y          
 putLayer)                                                                      
 local_GC (InputLa  [(None,)]            0      []                   Y          
 yer)                                                                           
 n_other_errors (I  [(None,)]            0      []                   Y          
 nputLayer)                                                                     
 read_index (Input  [(None,)]            0      []                   Y          
 Layer)                                                                         
 seq_length (Input  [(None,)]            0      []                   Y          
 Layer)                                                                         
 strand (InputLaye  [(None,)]            0      []                   Y          
 r)                                                                             
 trinucleotide_ctx  [(None,)]            0      []                   Y          
  (InputLayer)                                                                  
 input_numeric (De  (None, 5)            0      ['first_in_pair[0]   Y          
 nseFeatures)                                   [0]',                           
                                                 'fragment_size[0]              
                                                [0]',                           
                                                 'local_GC[0][0]',              
                                                 'n_other_errors[0              
                                                ][0]',                          
                                                 'read_index[0][0]              
                                                ',                              
                                                 'seq_length[0][0]              
                                                ',                              
                                                 'strand[0][0]',                
                                                 'trinucleotide_ct              
                                                x[0][0]']                       
 batch_normalizati  (None, 5)            20     ['input_numeric[0]   Y          
 on (BatchNormaliz                              [0]']                           
 ation)                                                                         
 input_embed_ctx3   (None, 3)            192    ['first_in_pair[0]   Y          
 (DenseFeatures)                                [0]',                           
                                                 'fragment_size[0]              
                                                [0]',                           
                                                 'local_GC[0][0]',              
                                                 'n_other_errors[0              
                                                ][0]',                          
                                                 'read_index[0][0]              
                                                ',                              
                                                 'seq_length[0][0]              
                                                ',                              
                                                 'strand[0][0]',                
                                                 'trinucleotide_ct              
                                                x[0][0]']                       
 input_categorical  (None, 4)            0      ['first_in_pair[0]   Y          
  (DenseFeatures)                               [0]',                           
                                                 'fragment_size[0]              
                                                [0]',                           
                                                 'local_GC[0][0]',              
                                                 'n_other_errors[0              
                                                ][0]',                          
                                                 'read_index[0][0]              
                                                ',                              
                                                 'seq_length[0][0]              
                                                ',                              
                                                 'strand[0][0]',                
                                                 'trinucleotide_ct              
                                                x[0][0]']                       
 concatenate (Conc  (None, 12)           0      ['batch_normalizat   Y          
 atenate)                                       ion[0][0]',                     
                                                 'input_embed_ctx3              
                                                [0][0]',                        
                                                 'input_categorica              
                                                l[0][0]']                       
 hidden_1 (Dense)   (None, 128)          1664   ['concatenate[0][0   Y          
                                                ]']                             
 hidden_2 (Dense)   (None, 64)           8256   ['hidden_1[0][0]']   Y          
 hidden_3 (Dense)   (None, 32)           2080   ['hidden_2[0][0]']   Y          
 output (Dense)     (None, 4)            132    ['hidden_3[0][0]']   Y          
================================================================================
Total params: 12344 (48.22 KB)
Trainable params: 12334 (48.18 KB)
Non-trainable params: 10 (40.00 Byte)
________________________________________________________________________________

And the last line of my log file:

999,23381.6099720001,0.7580134868621826,0.6788825392723083,0.009999999776482582

I don't know whether these results are good or not; perhaps you could provide me with your opinion..

And after that, I attempted to call variants and perform cancer detection with this code:

mutations_df = read.csv('/path/to/mutations.bed', sep = '\t', header = FALSE)
colnames(mutations_df) = c('CHROM','POS','REF','alt')

cancer_calls = dreams_cc(
  mutations_df = mutations_df,
  bam_file_path = '/path/to/another/bam',
  reference_path = './hg19.fa',
  model = model,
  alpha = 0.05, 
  calculate_confidence_intervals = FALSE, 
  use_turboem = TRUE)

parallel_variant_calls = dreams_vc_parallel(
  mutations_df = mutations_df,
  bam_file_path = '/path/to/another/bam',
  reference_path = './hg19.fa',
  model = model,
  alpha = 0.05, 
  use_turboem = TRUE, 
  calculate_confidence_intervals = FALSE, 
  batch_size = 1000000, 
  ncores = 5,
  log_file = './TEST_variant_calling.log')

An extract of my mutation_df :

chr1    46713371        C       G
chr1    46713698        A       C
chr1    46726876        G       GT
chr1    46739188        AT      A

The error message for cancer detection:

....
  1/409 [..............................] - ETA: 20s���������������������������������������������������
 36/409 [=>............................] - ETA: 0s ��������������������������������������������������
 72/409 [====>.........................] - ETA: 0s��������������������������������������������������
108/409 [======>.......................] - ETA: 0s��������������������������������������������������
144/409 [=========>....................] - ETA: 0s��������������������������������������������������
180/409 [============>.................] - ETA: 0s��������������������������������������������������
216/409 [==============>...............] - ETA: 0s��������������������������������������������������
252/409 [=================>............] - ETA: 0s��������������������������������������������������
288/409 [====================>.........] - ETA: 0s��������������������������������������������������
323/409 [======================>.......] - ETA: 0s��������������������������������������������������
359/409 [=========================>....] - ETA: 0s��������������������������������������������������
395/409 [===========================>..] - ETA: 0s��������������������������������������������������
409/409 [==============================] - 1s 1ms/step
Error in `pmap_dbl()`:
i In index: 1.
Caused by error in `log_likelihood()`:
! Dimensions of list inputs are not equal
Backtrace:
     x
  1. +-dreams::dreams_cc(...)
  2. | \-dreams::call_cancer(...)
  3. |   \-dreams:::get_em_parameter_estimates(...)
  4. |     \-dreams:::get_starting_values(...)
  5. |       \-purrr::pmap_dbl(...)
  6. |         \-purrr:::pmap_("double", .l, .f, ..., .progress = .progress)
  7. |           +-purrr:::with_indexed_errors(...)
  8. |           | \-base::withCallingHandlers(...)
  9. |           +-purrr:::call_with_cleanup(...)
 10. |           \-dreams (local) .f(tf = .l[[1L]][[i]], r = .l[[2L]][[i]], ...)
 11. |             \-dreams:::log_likelihood(...)
 12. |               \-base::stop("Dimensions of list inputs are not equal")
 13. \-base::.handleSimpleError(...)
 14.   \-purrr (local) h(simpleError(msg, call))
 15.     \-cli::cli_abort(...)
 16.       \-rlang::abort(...)
Execution halted

And for variant calling step I had no error message but when I print 'parallel_variant_calls', I've got this:

 message                                   call      
result.1 "Dimensions of list inputs are not equal" Expression
result.2 "Dimensions of list inputs are not equal" Expression
result.3 "Dimensions of list inputs are not equal" Expression
result.4 "Dimensions of list inputs are not equal" Expression
result.5 "Dimensions of list inputs are not equal" Expression

So the same message "Dimensions of list inputs are not equal".
If you have any questions, don't hesitate.

thank you in advance
Romain

Hi Romain

Nice that you got your model! I think it looks as it should.
The error you are getting is due to your mutation_df that contains INDELs. At the moment DREAMS is only able to handle SNVs (but will soon be able to handle INDELS as well). Therefore, the columns REF and alt should only contain single nucleotide letters in each row. Thank you for reporting, I will update the code to give the correct error message.

Good luck and don't hesitate to write again!

Best Regards
Mathilde

Hi Mathilde

I removed INDELS from my mutation dataframe (315 SNVs from Vardict caller) and I have results for cancer detection :

$cancer_info
    tf_est tf_min tf_max     r_est r_min r_max    Q_val     ll_A      ll_0
1 1.441766     NA     NA 0.9269841    NA    NA 27577912 -1644345 -15433301
  mutations_tested est_mutations_present total_coverage total_count
1              315                   292        2752995     1936333
  EM_converged EM_steps fpeval objfeval p_val cancer_detected
1         TRUE        3      5        4     0            TRUE

I know from other tumoral fraction estimation methods that the tumoral fraction of this sample is about 45%. With DREAMS, I obtained a tumor fraction estimate of 1.44. Is this result a percentage or a fraction? Can I have some explanations about the values of other parameters, please?

The variant calling step is currently running. My idea was to use DREAMS to eliminate false positives from the results of variant callers like Vardict. What do you think is the best idea:

  • Use a catalog of several mutations well-known to be implicated in cancer (I work on a panel of 37 genes).
  • Use only variants called by a variant caller to remove variants from this list.

I thought that the first idea might be the best for cancer detection, and the second for variant calling?
What do you think about it ?

Best Regards
Romain