README

Program Parameters:

  • seq_len - Length of the sequence user wishes to simulate
  • Perror - Rate of error in illumuna reading; Perror = 0.2 means there is 20% chance of a nucleotide reading being wrong
  • c - Coverage of the sequence (As given in the problem)
  • paired_end - If data is paired end (True/False; Optional)
  • error_accumulation - From left to right, for each base pair, how much % error is accumulated.

Parameters in the defined problem: c : Genotyping. Every true nucleotide at a homozygous locus is read c times in the c reads covering the locus. paired_end : Paired read ends that overlap. Every true nucleotide in the overlap is read twice.

Logic:

  1. The number of sequences to be simulated is defined; they are equal to the c parameter (Genotyping). If the data is paired end, the iterations are doubled. (iteration 1)
  2. In each iteration (iteration 1), another iteration (iteration2) is run to go through each base in the sequence.
  3. In each iteration2 a random number is generated from uniform distribution, if that number is less than parameter 'Perror' then the base at that position is changed (error occurs).
  4. If the error occurs, the probability is generated randomly for 3 base pairs to be replaced. Base with highest generated probability replaces the original base. The quality score is then generated from 0-10 (Where, 10=X) using beta distributions with parameters alpha=3.1 and beta=0.9 initially as these parameters are derived from a analysis of a FASTQ file (can be extended to big data) and later beta is increased based on the value of 'error_accumulation' parameter to make distribution more left skewed to decrease the quality score. Random number generated is converted to quality score by multiplying by 10 and rounding off to the nearest integer. *Optional: Please run get_parameters() function and refer to Fig1.png in the same folder for the details. (or uncomment the call before main function)
  5. If the error does not occur, the base at that position remains the same and the quality score is generated by using the same procedure mentioned above.
  6. Error is per cent increase in the 'Perror' parameter AND Quality score for each iteration2 with user defined parameter 'error_accumulation'

Output format:

  1. This section informs about the original simulated true sequence (X will be replace by A/T/G/C)

Original Sequence: XXXXX

  1. This section informs about the n (number of iterations= 'c'*2 if paired end; 'c' if not) simulated sequence (X will be replaced by A/T/G/C) With simulated quality score (q will be replaced by a quality score)

SimulatedSequences:

n XXXXXXXXXX qqqqqqqqqq


How to run the program: python assignment1.py seq_len Perror c error_accumulation

Example: python.exe .\assignment1.py 200 0.5 8 1

NOTE: - check output.txt file after running the program for the output. - paired_end is True by default. - Requirements: numpy, scipy, matplotlib (optional to generate Fig1)