phasegenomics/FALCON-Phase

Hi-C Simulator

Closed this issue · 7 comments

Though the current test is pretty fast within Falcon-Phase (1-2min, with most runtime in bwa), for Unzip+Phase we are hoping to create a smaller workflow test based on synthetic data and simulated reads. We can easily generate a fake genome and simulate Sequel (not accurately, but well enough for this purpose). However, we do not really know how to simulate Hi-C read-pairs. Could you point us to a simulator?

Alternatively, if you could run a simulator yourself, we could use your simulated read-pairs. Given the original genome, we could also provide simulated Sequel subreads, or we could use yours.

We are hoping for something quite small. E.g. we have been using a fake genome only 200kbp long for testing Falcon-Unzip. That already takes several minutes in Falcon+Unzip, so we hope not to use anything larger than that.

@skingan, if you have ideas on this, please let us know.

zeeev commented

Hi @pb-cdunn,

I'd be happy to simulate hi-c data for a small test-set (https://github.com/cerebis/sim3C). What I'd need is the following.

  1. A set of minced sequences derived from a single primary contig (easy).

  2. An overlap index file corresponding to the minced sequences. This can also just be pulled from an existing dataset.

  3. The true phase assignment for each minced sequence from parental SNP data.

@skingan should have all of this for the human trio.

Once these things are available I'll write the test harness. It is a small work item.

Best,

Zev

@pb-cdunn @zeeev I understand what you both need. I will generate this from the hbird today.
Sarah

At some point, you both will need to write-up the steps that you took for

  1. Simulating HiC (using cerebis/sim3C?) from reference fasta, and
  2. Combing HiC into something we could use for falcon-phase.
zeeev commented

Hi @pb-cdunn,

Here are the commands to simulate the interleaved FASTQ files:

python ~/tools/sim3C/sim3C.py -C gzip -r 1984 -n 100000 --machine-profile HiSeqXtruSeqL150 -l 150 -e MluCI -m hic --trans-rate 0.01 --dist equal --prefix sim3c_MluCI ref/synth52k-x-y.fasta sim3c_MluCI
python ~/tools/sim3C/sim3C.py -C gzip -r 1984 -n 100000 --machine-profile HiSeqXtruSeqL150 -l 150 -e Sau3AI -m hic --trans-rate 0.01 --dist equal --prefix sim3c_Sau3AI ref/synth52k-x-y.fasta sim3c_Sau3AI

Still a fork, but renamed: https://github.com/PacificBiosciences/pb-falcon-phase

@skingan, Zev's commands above are necessary, but wouldn't I also need whatever you did to combine them? Explicitly, what did you do, which I then used as input?

@pb-cdunn, also see a slack DM with the file path for a readme with these commands on our cluster

split interleaved files into read1 and read2 for each enzyme

zcat sim3c_MluCI.fastq.gz | awk '(NR%8==1) || (NR%8==2) || (NR%8==3) || (NR%8==4)' > sim3c_MluCI_R1.fastq
zcat sim3c_MluCI.fastq.gz | awk '(NR%8==5) || (NR%8==6) || (NR%8==7) || (NR%8==0)' > sim3c_MluCI_R2.fastq

zcat sim3c_Sau3AI.fastq.gz | awk '(NR%8==1) || (NR%8==2) || (NR%8==3) || (NR%8==4)' > sim3c_Sau3AI_R1.fastq
zcat sim3c_Sau3AI.fastq.gz | awk '(NR%8==5) || (NR%8==6) || (NR%8==7) || (NR%8==0)' > sim3c_Sau3AI_R2.fastq

Concatenate the files together

cat sim3c_Sau3AI_R1.fastq sim3c_MluCI_R1.fastq > sim3c_R1.fastq
cat sim3c_Sau3AI_R2.fastq sim3c_MluCI_R2.fastq > sim3c_R2.fastq

# Here are the commands to simulate the interleaved FASTQ files:

python ~/tools/sim3C/sim3C.py -C gzip -r 1984 -n 100000 --machine-profile HiSeqXtruSeqL150 -l 150 -e MluCI -m hic --trans-rate 0.01 --dist equal --prefix sim3c_MluCI ref/synth52k-x-y.fasta sim3c_MluCI
python ~/tools/sim3C/sim3C.py -C gzip -r 1984 -n 100000 --machine-profile HiSeqXtruSeqL150 -l 150 -e Sau3AI -m hic --trans-rate 0.01 --dist equal --prefix sim3c_Sau3AI ref/synth52k-x-y.fasta sim3c_Sau3AI


# HiC library preparation involves cutting the cross-linked DNA with restriction enzymes
# Cross-linking preserves contacts between pieces of DNA that are on the same chromosome but can be Mb away
# Standard sequencing used paired-end illumina (read1 and read2)
# We have simulated datasets for two different enzymes, MluCI and Sau3AI, R1 and R2 are interleaved
# We can concatenate the read 1 and read 2 sequences together for the two enzymes
# And list the restriction enzyme recognition (or cut) sites in the config file (line 26)

# split interleaved files into read1 and read2 for each enzyme
zcat sim3c_MluCI.fastq.gz | awk '(NR%8==1) || (NR%8==2) || (NR%8==3) || (NR%8==4)' > sim3c_MluCI_R1.fastq
zcat sim3c_MluCI.fastq.gz | awk '(NR%8==5) || (NR%8==6) || (NR%8==7) || (NR%8==0)' > sim3c_MluCI_R2.fastq

zcat sim3c_Sau3AI.fastq.gz | awk '(NR%8==1) || (NR%8==2) || (NR%8==3) || (NR%8==4)' > sim3c_Sau3AI_R1.fastq
zcat sim3c_Sau3AI.fastq.gz | awk '(NR%8==5) || (NR%8==6) || (NR%8==7) || (NR%8==0)' > sim3c_Sau3AI_R2.fastq

# Concatenate the files together
cat sim3c_Sau3AI_R1.fastq sim3c_MluCI_R1.fastq > sim3c_R1.fastq
cat sim3c_Sau3AI_R2.fastq sim3c_MluCI_R2.fastq > sim3c_R2.fastq

(Triple backticks for fixed-width font.) Thanks, Sarah and Zev!