Hi-C Simulator
Closed this issue · 7 comments
Though the current test is pretty fast within Falcon-Phase (1-2min, with most runtime in bwa), for Unzip+Phase we are hoping to create a smaller workflow test based on synthetic data and simulated reads. We can easily generate a fake genome and simulate Sequel (not accurately, but well enough for this purpose). However, we do not really know how to simulate Hi-C read-pairs. Could you point us to a simulator?
Alternatively, if you could run a simulator yourself, we could use your simulated read-pairs. Given the original genome, we could also provide simulated Sequel subreads, or we could use yours.
We are hoping for something quite small. E.g. we have been using a fake genome only 200kbp long for testing Falcon-Unzip. That already takes several minutes in Falcon+Unzip, so we hope not to use anything larger than that.
@skingan, if you have ideas on this, please let us know.
Hi @pb-cdunn,
I'd be happy to simulate hi-c data for a small test-set (https://github.com/cerebis/sim3C). What I'd need is the following.
-
A set of minced sequences derived from a single primary contig (easy).
-
An overlap index file corresponding to the minced sequences. This can also just be pulled from an existing dataset.
-
The true phase assignment for each minced sequence from parental SNP data.
@skingan should have all of this for the human trio.
Once these things are available I'll write the test harness. It is a small work item.
Best,
Zev
At some point, you both will need to write-up the steps that you took for
- Simulating HiC (using cerebis/sim3C?) from reference fasta, and
- Combing HiC into something we could use for falcon-phase.
Hi @pb-cdunn,
Here are the commands to simulate the interleaved FASTQ files:
python ~/tools/sim3C/sim3C.py -C gzip -r 1984 -n 100000 --machine-profile HiSeqXtruSeqL150 -l 150 -e MluCI -m hic --trans-rate 0.01 --dist equal --prefix sim3c_MluCI ref/synth52k-x-y.fasta sim3c_MluCI
python ~/tools/sim3C/sim3C.py -C gzip -r 1984 -n 100000 --machine-profile HiSeqXtruSeqL150 -l 150 -e Sau3AI -m hic --trans-rate 0.01 --dist equal --prefix sim3c_Sau3AI ref/synth52k-x-y.fasta sim3c_Sau3AI
Still a fork, but renamed: https://github.com/PacificBiosciences/pb-falcon-phase
@skingan, Zev's commands above are necessary, but wouldn't I also need whatever you did to combine them? Explicitly, what did you do, which I then used as input?
@pb-cdunn, also see a slack DM with the file path for a readme with these commands on our cluster
split interleaved files into read1 and read2 for each enzyme
zcat sim3c_MluCI.fastq.gz | awk '(NR%8==1) || (NR%8==2) || (NR%8==3) || (NR%8==4)' > sim3c_MluCI_R1.fastq
zcat sim3c_MluCI.fastq.gz | awk '(NR%8==5) || (NR%8==6) || (NR%8==7) || (NR%8==0)' > sim3c_MluCI_R2.fastq
zcat sim3c_Sau3AI.fastq.gz | awk '(NR%8==1) || (NR%8==2) || (NR%8==3) || (NR%8==4)' > sim3c_Sau3AI_R1.fastq
zcat sim3c_Sau3AI.fastq.gz | awk '(NR%8==5) || (NR%8==6) || (NR%8==7) || (NR%8==0)' > sim3c_Sau3AI_R2.fastq
Concatenate the files together
cat sim3c_Sau3AI_R1.fastq sim3c_MluCI_R1.fastq > sim3c_R1.fastq
cat sim3c_Sau3AI_R2.fastq sim3c_MluCI_R2.fastq > sim3c_R2.fastq
# Here are the commands to simulate the interleaved FASTQ files:
python ~/tools/sim3C/sim3C.py -C gzip -r 1984 -n 100000 --machine-profile HiSeqXtruSeqL150 -l 150 -e MluCI -m hic --trans-rate 0.01 --dist equal --prefix sim3c_MluCI ref/synth52k-x-y.fasta sim3c_MluCI
python ~/tools/sim3C/sim3C.py -C gzip -r 1984 -n 100000 --machine-profile HiSeqXtruSeqL150 -l 150 -e Sau3AI -m hic --trans-rate 0.01 --dist equal --prefix sim3c_Sau3AI ref/synth52k-x-y.fasta sim3c_Sau3AI
# HiC library preparation involves cutting the cross-linked DNA with restriction enzymes
# Cross-linking preserves contacts between pieces of DNA that are on the same chromosome but can be Mb away
# Standard sequencing used paired-end illumina (read1 and read2)
# We have simulated datasets for two different enzymes, MluCI and Sau3AI, R1 and R2 are interleaved
# We can concatenate the read 1 and read 2 sequences together for the two enzymes
# And list the restriction enzyme recognition (or cut) sites in the config file (line 26)
# split interleaved files into read1 and read2 for each enzyme
zcat sim3c_MluCI.fastq.gz | awk '(NR%8==1) || (NR%8==2) || (NR%8==3) || (NR%8==4)' > sim3c_MluCI_R1.fastq
zcat sim3c_MluCI.fastq.gz | awk '(NR%8==5) || (NR%8==6) || (NR%8==7) || (NR%8==0)' > sim3c_MluCI_R2.fastq
zcat sim3c_Sau3AI.fastq.gz | awk '(NR%8==1) || (NR%8==2) || (NR%8==3) || (NR%8==4)' > sim3c_Sau3AI_R1.fastq
zcat sim3c_Sau3AI.fastq.gz | awk '(NR%8==5) || (NR%8==6) || (NR%8==7) || (NR%8==0)' > sim3c_Sau3AI_R2.fastq
# Concatenate the files together
cat sim3c_Sau3AI_R1.fastq sim3c_MluCI_R1.fastq > sim3c_R1.fastq
cat sim3c_Sau3AI_R2.fastq sim3c_MluCI_R2.fastq > sim3c_R2.fastq
(Triple backticks for fixed-width font.) Thanks, Sarah and Zev!