Bypassing Checkout and custom read overlap : details needed
cgirardot opened this issue · 7 comments
Dear all,
I am new to this package. I have a somewhat non standard set-up with 3 fastq files : read 1, read 2 and index file. The index.fastq contains both the sample barcode (to perform usual demultiplexing) and a UMI. As far as I understood this is not something I can do with Checkout so I wrote my own code that generates demultiplexed data containing the UMI slot as expected by MIGEC. Moving step 2, MIGEC complained about missing checkout.filelist.txt
; reading the code I created a 4 column file like :
sample_name ; "paired" ; file1.fastq.gz ; file2.fastq.gz
Question 1 : this made the trick for Assembly
but I wonder if this is the right format? What is column 2 supposed to contain ?
Moving fwd to CdrBlastBatch
, I first generated overlapped sequences (using pear) but CdrBlastBatch
now complains about missing assemble.log.txt
Question 2: Could you please let me know how to fake this file ?
Thanks in advance
I now understood that for:
- question 1 : column 2 does expect "paired"
- question 2: I did a mistake in the command line
I am now left with the following question : how do I re-integrate the overlapped sequences I created with PEAR using the Assembly
step output into the pipeline ?
Hi There. I am not the author, so I can just give some opinion there.
Question 1: The column 2 contain the string "paired", without the quote.
Question 2: How is it possible that the Assembly function did not create the assemble.log.txt file? I suspect that you are missing some log files.
The index file is generated by the sequencer, but I believe not needed for your purpose. Try to run the MIGEC workflow using only the read1 and read2 fastq files. If you have the description of the primers with UMI properly done (Master, slave etc...), you should actually be fine.
HI @GildasLepennetier
thank you for the answer. Sorry i guess I wasn't t clear on my comment this morning : I did a mistake in the CdrBlastBatch
step so my question 2 is not irrelevant.
I try running MiGEC on single cell data and I need to use the index file (this is not an Illumina sample index file) ; it is drop-seq data so the sample (==cell) barcode and the UMI are found in this file. But this is OK, I can demultiplex and create an output similar to Checkout.
Then I Assembly
my demultiplexed reads.
Using the results created by Assembly
, I create a consensus using PEAR.
This works great till this point but I am not sure how to continue towards CdrBlastBatch
now that I have:
- a
checkout
folder with the demultiplexed reads iedrop1_1.fastq.gz
&drop1_2.fastq.gz
- an
assembly
folder with the assembled files fromcheckout
iedrop1_1.t32.cf.fastq.gz
&drop1_2.t32.cf.fastq.gz
- an
overlap
folder with egdrop1_12.t32.cf.fastq.gz
Should I create a folder combining the assembly
and overlap
folder, and give this to CdrBlastBatch
?
If yes :
- is it an issue to have assemblies for the same MIG in both the _1/_2.fastq files and the _12.fastq file ? (I think I read somewhere it is OK)
- Should I modify the
assemble.log.txt
to reflect my _12.fastq files ? How ?
If you could share an example of assemble.log.txt
produced with both paired & overlapped inputs, that might help me. Thx!
Hi @cgirardot,
First, I would definitely recommend using MIXCR instead of CdrBlastBatch, as the former is actively updated. MIXCR accepts paired reads, so no need to do a separate overlapping procedure, you just run MIGEC checkout/assemble and then pass R1/2 to MIXCR. You can also try our new tool MINNN that is in beta stage right now and we plan to provide docs for scRNAseq analysis in a couple of weeks.
I think the Assembly function is doing the consensus. No need to use PEAR. In the assemble.log.txt files, you should have count concerning the MIG.
Check in your drop1_1.t32.cf.fastq.gz file. You should have something like
@MIG.857 R1 UMI:AGTTCTTCGTTG:16
In my case, this 16 means that 16 reads sequences were used for the consensus of the following sequence.
use a command like follow, giving both output folder to the function:
CdrBlastBatch
[options] checkout_dir/ assemble_dir/ output_dir/
The assemble.log.txt has 25 column in my case, but I do not think you can (and should) generate it yourself, since it contain several things that are calculated by and for MIGEC (e.g. MIGS_DROPPED_COLLISION_1, MIGS_DROPPED_OVERSEQ_1, READS_DROPPED_WITHIN_MIG_1).
At this point trying to bypass each step seem like a bad idea to me.
Hi @mikessh,
I have now run MIXCR analyze amplicon
and also CdrBlastBatch with _1/_2 only
; and I am a bit puzzled by the results.
Indeed, with MIXCR only a small fraction (<10%) of my drops gave me a *.clonotypes.IGH.txt
with at least one results. The others are either empty (headers only) or fail with e.g :
WARNING: 7 functional genes were excluded, re-run with --verbose option to see the list of excluded genes and exclusion reason.
Alignment: 100%
============= Report ==============
Analysis time: 4.20s
Total sequencing reads: 63
Successfully aligned reads: 59 (93.65%)
Paired-end alignment conflicts eliminated: 2 (3.17%)
Alignment failed, no hits (not TCR/IG?): 3 (4.76%)
Alignment failed because of absence of J hits: 1 (1.59%)
Overlapped: 60 (95.24%)
Overlapped and aligned: 57 (90.48%)
Alignment-aided overlaps: 38 (66.67%)
Overlapped and not aligned: 3 (4.76%)
IGH chains: 59 (100%)
Initialization: progress unknown
Preparing for sorting: progress unknown
Exception in thread "main" picocli.CommandLine$ExecutionException: Error while running command (com.milaboratory.mixcr.cli.CommandAnalyze$CommandAmplicon@7b2a3ff8): java.lang.IllegalArgumentException: Writing empty block.
at picocli.CommandLine.execute(CommandLine.java:1004)
at picocli.CommandLine.access$900(CommandLine.java:142)
at picocli.CommandLine$RunLast.handle(CommandLine.java:1199)
at com.milaboratory.mixcr.cli.Main$1.handle(Main.java:78)
at com.milaboratory.mixcr.cli.Main$1.handle(Main.java:61)
at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1075)
at com.milaboratory.mixcr.cli.Main.handleParseResult(Main.java:83)
at com.milaboratory.mixcr.cli.Main.main(Main.java:55)
Caused by: java.lang.IllegalArgumentException: Writing empty block.
at com.milaboratory.mixcr.basictypes.AlignmentsIO.writeBlock(AlignmentsIO.java:139)
at com.milaboratory.mixcr.basictypes.BasicVDJCAlignmentWriterFactory$Writer.writeSync(BasicVDJCAlignmentWriterFactory.java:221)
at com.milaboratory.mixcr.basictypes.ClnAWriter.writeAlignmentsAndIndex(ClnAWriter.java:311)
at com.milaboratory.mixcr.cli.CommandAssemble.run1(CommandAssemble.java:222)
at com.milaboratory.cli.ACommandWithSmartOverwrite.run0(ACommandWithSmartOverwrite.java:118)
at com.milaboratory.cli.ACommand.run(ACommand.java:90)
at com.milaboratory.mixcr.cli.CommandAnalyze.run0(CommandAnalyze.java:649)
at com.milaboratory.cli.ACommand.run(ACommand.java:90)
at picocli.CommandLine.execute(CommandLine.java:996)
... 7 more
Given options were --verbose --receptor-type igh -s mmu --only-productive --adapters adapters-present --5-end v-primers --3-end c-primers --starting-material rna file1.fasq file2.fastq analysis
But if I look at the results from MIGEC CdrBlastBatch
, I have perfectly valid hits for this sample eg for the asm
result file:
Count Fraction CDR3 nucleotide sequence CDR3 amino acid sequence V segments J segments D segments Last V nucleotide position First D nucleotide position Last D nucleotide position First J nucleotide position Good events Total events Good reads Total reads
56 0.9491525424 TGTGCAAGTCAGACGAATTTATTACTATGGCCCTTTGCTTACTGG CASQTNLLLWPFAYW IGHV1S45*01 IGHJ3*01 IGHD1-1*01 7 17 26 33 56 56 77554 77554
2 0.0338983051 TGTGCAAGTCAGACGAATTTATTACTATGGCCTTTTGCTTACTGG CASQTNLLLWPFAYW IGHV1S45*01 IGHJ3*01 IGHD1-1*01 7 17 26 33 2 2 167 167
1 0.0169491525 TGTGCAAGTCAGACGAATTTATTACTATGGCCCTTTGCCTACTGG CASQTNLLLWPFAYW IGHV1S45*01 IGHJ3*01 IGHD1-1*01 7 17 26 33 1 1 1300 1300
Any idea what I am doing wrong ?