SegataLab/preprocessing

Output bigger than input

Closed this issue · 2 comments

Hi!

I want to make sure I'm running the pipeline correctly. I ran an initial test with two paired-end files in the directory /home/Raw_data:
410M SRR224_R1.fastq.gz  
384M SRR224_R2.fastq.gz

I ran the command:
parallel -j $NCPU 'preprocess.new.py -i /home/Raw_data -s SRR224 -f R1 -r R2 -x /home/GRCh38/index_GRCh38'

But I got:
nohup: ignoring input

Then I ran the command:
preprocess.new.py -i /home/Raw_data -s SRR224 -f R1 -r R2 -x /home/GRCh38/index_GRCh38

It seemed to run, but the output was the following:
1.8G Aug  5 19:20 Raw_data.R1_trimmed.fq
6.0G Aug  5 19:22 Raw_data.data.R2_trimmed.fq

The output of R2_trimmed.fq is very large. It should be similar to that of R1_trimmed.fq. Please clarify how to run the pipeline with preprocess.new.py. The GitHub only has example with preprocess.sh, but with that script I got several errors.

Hi there!

Those are temporary outputs from trim_galore, which is at the very beginning of the preprocessing pipeline. Also, from the different sizes of the R1 and R2 files, the processing of R1 is still ongoing.

The command line looks good, but I think you just need to wait for the preprocessing to finish.
You can add the --verbose parameter to get more information printed on the console while the preprocessing is running.

I hope this helps.

Regards,
Francesco

Thanks! It already worked