mikessh/migec

Bypassing Checkout and custom read overlap : details needed

cgirardot opened this issue · 7 comments

Dear all,

I am new to this package. I have a somewhat non standard set-up with 3 fastq files : read 1, read 2 and index file. The index.fastq contains both the sample barcode (to perform usual demultiplexing) and a UMI. As far as I understood this is not something I can do with Checkout so I wrote my own code that generates demultiplexed data containing the UMI slot as expected by MIGEC. Moving step 2, MIGEC complained about missing checkout.filelist.txt ; reading the code I created a 4 column file like :

sample_name ; "paired" ; file1.fastq.gz ; file2.fastq.gz

Question 1 : this made the trick for Assembly but I wonder if this is the right format? What is column 2 supposed to contain ?

Moving fwd to CdrBlastBatch , I first generated overlapped sequences (using pear) but CdrBlastBatch now complains about missing assemble.log.txt

Question 2: Could you please let me know how to fake this file ?

Thanks in advance

I now understood that for:

  • question 1 : column 2 does expect "paired"
  • question 2: I did a mistake in the command line

I am now left with the following question : how do I re-integrate the overlapped sequences I created with PEAR using the Assembly step output into the pipeline ?

Hi There. I am not the author, so I can just give some opinion there.

Question 1: The column 2 contain the string "paired", without the quote.
Question 2: How is it possible that the Assembly function did not create the assemble.log.txt file? I suspect that you are missing some log files.

The index file is generated by the sequencer, but I believe not needed for your purpose. Try to run the MIGEC workflow using only the read1 and read2 fastq files. If you have the description of the primers with UMI properly done (Master, slave etc...), you should actually be fine.

HI @GildasLepennetier
thank you for the answer. Sorry i guess I wasn't t clear on my comment this morning : I did a mistake in the CdrBlastBatch step so my question 2 is not irrelevant.

I try running MiGEC on single cell data and I need to use the index file (this is not an Illumina sample index file) ; it is drop-seq data so the sample (==cell) barcode and the UMI are found in this file. But this is OK, I can demultiplex and create an output similar to Checkout.
Then I Assembly my demultiplexed reads.
Using the results created by Assembly, I create a consensus using PEAR.

This works great till this point but I am not sure how to continue towards CdrBlastBatch now that I have:

  • a checkout folder with the demultiplexed reads ie drop1_1.fastq.gz & drop1_2.fastq.gz
  • an assembly folder with the assembled files from checkout ie drop1_1.t32.cf.fastq.gz & drop1_2.t32.cf.fastq.gz
  • an overlap folder with eg drop1_12.t32.cf.fastq.gz

Should I create a folder combining the assembly and overlap folder, and give this to CdrBlastBatch ?
If yes :

  • is it an issue to have assemblies for the same MIG in both the _1/_2.fastq files and the _12.fastq file ? (I think I read somewhere it is OK)
  • Should I modify the assemble.log.txt to reflect my _12.fastq files ? How ?

If you could share an example of assemble.log.txt produced with both paired & overlapped inputs, that might help me. Thx!

Hi @cgirardot,

First, I would definitely recommend using MIXCR instead of CdrBlastBatch, as the former is actively updated. MIXCR accepts paired reads, so no need to do a separate overlapping procedure, you just run MIGEC checkout/assemble and then pass R1/2 to MIXCR. You can also try our new tool MINNN that is in beta stage right now and we plan to provide docs for scRNAseq analysis in a couple of weeks.

Hi @mikessh ,

ok Thx! Will move to MIXCR then. Will also check MINNN :-)

I think the Assembly function is doing the consensus. No need to use PEAR. In the assemble.log.txt files, you should have count concerning the MIG.
Check in your drop1_1.t32.cf.fastq.gz file. You should have something like

@MIG.857 R1 UMI:AGTTCTTCGTTG:16

In my case, this 16 means that 16 reads sequences were used for the consensus of the following sequence.
use a command like follow, giving both output folder to the function:

CdrBlastBatch [options] checkout_dir/ assemble_dir/ output_dir/

The assemble.log.txt has 25 column in my case, but I do not think you can (and should) generate it yourself, since it contain several things that are calculated by and for MIGEC (e.g. MIGS_DROPPED_COLLISION_1, MIGS_DROPPED_OVERSEQ_1, READS_DROPPED_WITHIN_MIG_1).
At this point trying to bypass each step seem like a bad idea to me.

Hi @mikessh,

I have now run MIXCR analyze amplicon and also CdrBlastBatch with _1/_2 only ; and I am a bit puzzled by the results.
Indeed, with MIXCR only a small fraction (<10%) of my drops gave me a *.clonotypes.IGH.txt with at least one results. The others are either empty (headers only) or fail with e.g :

WARNING: 7 functional genes were excluded, re-run with --verbose option to see the list of excluded genes and exclusion reason.
Alignment: 100%
============= Report ==============
Analysis time: 4.20s
Total sequencing reads: 63
Successfully aligned reads: 59 (93.65%)
Paired-end alignment conflicts eliminated: 2 (3.17%)
Alignment failed, no hits (not TCR/IG?): 3 (4.76%)
Alignment failed because of absence of J hits: 1 (1.59%)
Overlapped: 60 (95.24%)
Overlapped and aligned: 57 (90.48%)
Alignment-aided overlaps: 38 (66.67%)
Overlapped and not aligned: 3 (4.76%)
IGH chains: 59 (100%)
Initialization: progress unknown
Preparing for sorting: progress unknown
Exception in thread "main" picocli.CommandLine$ExecutionException: Error while running command (com.milaboratory.mixcr.cli.CommandAnalyze$CommandAmplicon@7b2a3ff8): java.lang.IllegalArgumentException: Writing empty block.
	at picocli.CommandLine.execute(CommandLine.java:1004)
	at picocli.CommandLine.access$900(CommandLine.java:142)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:1199)
	at com.milaboratory.mixcr.cli.Main$1.handle(Main.java:78)
	at com.milaboratory.mixcr.cli.Main$1.handle(Main.java:61)
	at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1075)
	at com.milaboratory.mixcr.cli.Main.handleParseResult(Main.java:83)
	at com.milaboratory.mixcr.cli.Main.main(Main.java:55)
Caused by: java.lang.IllegalArgumentException: Writing empty block.
	at com.milaboratory.mixcr.basictypes.AlignmentsIO.writeBlock(AlignmentsIO.java:139)
	at com.milaboratory.mixcr.basictypes.BasicVDJCAlignmentWriterFactory$Writer.writeSync(BasicVDJCAlignmentWriterFactory.java:221)
	at com.milaboratory.mixcr.basictypes.ClnAWriter.writeAlignmentsAndIndex(ClnAWriter.java:311)
	at com.milaboratory.mixcr.cli.CommandAssemble.run1(CommandAssemble.java:222)
	at com.milaboratory.cli.ACommandWithSmartOverwrite.run0(ACommandWithSmartOverwrite.java:118)
	at com.milaboratory.cli.ACommand.run(ACommand.java:90)
	at com.milaboratory.mixcr.cli.CommandAnalyze.run0(CommandAnalyze.java:649)
	at com.milaboratory.cli.ACommand.run(ACommand.java:90)
	at picocli.CommandLine.execute(CommandLine.java:996)
	... 7 more

Given options were --verbose --receptor-type igh -s mmu --only-productive --adapters adapters-present --5-end v-primers --3-end c-primers --starting-material rna file1.fasq file2.fastq analysis

But if I look at the results from MIGEC CdrBlastBatch, I have perfectly valid hits for this sample eg for the asm result file:

Count	Fraction	CDR3 nucleotide sequence	CDR3 amino acid sequence	V segments	J segments	D segments	Last V nucleotide position	First D nucleotide position	Last D nucleotide position	First J nucleotide position	Good events	Total events	Good reads	Total reads
56	0.9491525424	TGTGCAAGTCAGACGAATTTATTACTATGGCCCTTTGCTTACTGG	CASQTNLLLWPFAYW	IGHV1S45*01	IGHJ3*01	IGHD1-1*01	7	17	26	33	56	56	77554	77554
2	0.0338983051	TGTGCAAGTCAGACGAATTTATTACTATGGCCTTTTGCTTACTGG	CASQTNLLLWPFAYW	IGHV1S45*01	IGHJ3*01	IGHD1-1*01	7	17	26	33	2	2	167	167
1	0.0169491525	TGTGCAAGTCAGACGAATTTATTACTATGGCCCTTTGCCTACTGG	CASQTNLLLWPFAYW	IGHV1S45*01	IGHJ3*01	IGHD1-1*01	7	17	26	33	1	1	1300	1300

Any idea what I am doing wrong ?