PopicLab/cue

Linked reads model

Opened this issue · 6 comments

I guess the current model v2 (https://storage.googleapis.com/cue-models/latest/cue.v2.pt) does not include support for linked read signals? If so, is there any plan to support this in a coming model? From #14 is seems that you plan on releasing a PacBio model by the end to this summer, does this include releasing a model for linked reads?

viq854 commented

Hi @pontushojer,

Yes, the v2 model is just for short reads. For linked reads, only a proof-of-concept model has been trained so far -- this model was released to reproduce the demo benchmarks in the paper and is not recommended for use on real data. However, the framework can be used to further train this model upon need. Which linked-read technology do you have in mind? Since 10X linked reads have been discontinued, we haven't yet selected another platform to target, although we have seen very promising results for SV calling with that data type. Happy to provide support if there is a specific use case.

We mostly work with our own DBS method for generating linked reads described here. But I have also run some 10X, TELL-Seq and stLFR libraries. There is also the haplotagging method which I have very little experience in. It would be nice if to have model that would work on all these. If i understand the method correctly, the only extra signal used by Cue for linked reads at the moment is split molecules. I guess that signal should appear kind of similar for the different technologies even though there are some differences, for example the number of molecules per barcode and such. But I guess that would have to be tested. What do you think?

viq854 commented

Yep, the only additional signal for linked-reads is the split molecule/barcode information. I have looked into the 10X and stLFR images for HG002 and they do differ enough to confound the model, e.g. a model trained on 10X didn't work very well on stLFR directly. So for best results, we would need to have read simulators for these different platforms and/or have a good estimate for the key params (e.g. the barcode and molecule coverage) to use with LRSIM. Haven't looked into stLFR vs TELL-Seq yet, definitely would be good to test. Could reads for your method be simulated reasonably well with existing simulators and are they similar in params to any of the other three linked-read methods?

I guess you could simulate them using LRSIM, I have not tried yet. Our DBS method generates results that are fairly similar to TELL-Seq and stLFR, but there are some big variations depending on how the data was generated. I would say the same holds true for all linked reads datasets. For example molecule coverage would be influenced by the sequencing depth.

What parameters did you use for generating the 10x training set with LRSIM? I could see if these were very narrowly defined this might make it hard to use on stLFR reads for example. Would it be good idea to to define to kind of bounds for the key parameters and generate a few simulated datasets for training based on those? That way one could maybe avoid overfitting the model to a specific linked read technology.

viq854 commented

I think the defaults in LRSIM were set based on 10X, so I've primarily used that for benchmarking and some variations of depth/key params for exploration/experimental training; there was also a separate simulator for stLFR, which I've tried a while back and it seemed to match real data better. Definitely agree on varying key parameters to generate a more diverse training dataset (esp. coverage and molecule lengths which are expected to vary) -- even if targeting a single platform. What we'd also have to see is how much of the difference within/across these platforms can be explained by tuning these known params and how much is due to data gen and idiosyncrasies/biases that might require additional modeling. I've been considering a GAN-based approach to address the latter but that will take a bit of time. It might be easiest to first just take a look at the Cue and IGV images from the real data you have and an LRSIM setting based on your param ranges. Happy to transition to email/Zoom to talk through more details and thanks for all the info -- we can definitely commit some resources sooner to address linked-reads as well given community needs.

Hi there, I'm following this thread and would like to join the discussion. I'm part of a/the team that works on Haplotag linked reads and much like @pontushojer have a lot of interest in this platform for variant calling with that linked read style. I think the main difference in haplotag data vs other linked-read formats is that the barcode appears in the read headers (vs in the sequence line) as the terminal tag BX:Z:A**C**B**D**, where ** is a 2-digit number from 0-9, and two zeroes 00 imply an invalid tag (e.g. A01C34B11D93 or A12C00B75D32). In the alignment, it has the same BX:Z:A**C**B**D** tag.