
Error correction for Illumina RNA-seq reads

Described in:

Song, L., Florea, L., Rcorrector: Efficient and accurate error correction for Illumina RNA-seq reads. GigaScience. 2015, 4:48.

Copyright (C) 2012-2013, and GNU GPL, by Li Song and Liliana Florea

Rcorrrector includes the program Jellyfish2

What is Rcorrector?

Rcorrector(RNA-seq error CORRECTOR) is a kmer-based error correction method for RNA-seq data.

Rcorrector can also be applied to other type of sequencing data where the read coverage is non-uniform, such as single-cell sequencing.


  1. Clone the GitHub repo, e.g. with git clone https://github.com/mourisl/rcorrector.git
  2. Run make in the repo directory During the make procedure, the script will check whether you have jellyfish2 in $PATH. If not, it will download and compile jellyfish2 from its repository.


Usage: perl run_rcorrector.pl [OPTIONS]
	-s seq_files: comma separated files for single-end data sets
	-1 seq_files_left: comma separated files for the first mate in the paried-end data sets
	-2 seq_files_right: comma separated files for the second mate in the paired-end data sets
	-i seq_files_interleaved: comma sperated files for interleaved paired-end data sets
	-k INT: kmer_length (<=32, default: 23)
	-od STRING: output_file_directory (default: ./)
	-t INT: number of threads to use (default: 1)
	-maxcorK INT: the maximum number of correction within k-bp window (default: 4)
	-wk FLOAT: the proportion of kmers that are used to estimate weak kmer count threshold, lower for more divergent genome (default: 0.95)
	-ek INT: expected number of kmers; does not affect the correctness of program but affects the memory usage (default: 100000000)
	-stdout: output the corrected reads to stdout (default: not used)
	-tmpd temp_file_directory: directory for tempoary files (default: ./)
	-verbose: output some correction information to stdout (default: not used)
	-stage INT: start from which stage (default: 0)
		0-start from begining(storing kmers in bloom filter) ;
		1-start from count kmers showed up in bloom filter;
		2-start from dumping kmer counts into a jf_dump file;
		3-start from error correction.


For each input file, Rcorrector will generate the corresponding output file with "*.cor.fq/fa" in the directory specified by "-od".

In the header line for each read, Rcorrector will append some information.

"cor": some bases of the sequence are corrected
"unfixable_error": the errors could not be corrected
"l:INT m:INT h:INT": the lowest, median and highest kmer count of the kmers from the read


We put a small sample data set, you can run them by:

perl run_rcorrector.pl -1 Sample/sample_read1.fq -2 Sample/sample_read2.fq



