Obtaining genome wide contact matrix file in .npy/related format?
tarak77 opened this issue · 24 comments
Hello @lh3 @tanlongzhi ,
I know that we can convert .con to .hic file in the dip-c repository, but Is there a way to get the genome wide contact matrix in .npy or related matrix format at different resolutions? say 250kb, using the impute.pairs file as input(be it for my haploid cell or diploid cell). Though juicer tools have dump command to extract chromosome wise contact matrix as .txt file, we cannot extract the entire genowide contact matrix.
Any help will be awesome!
Hi Tarak,
In dip-c, you can generate a genome-wide matrix with dip-c bincon
, and then generate an accompanying bin description with dip-c bincon -i
. However, this function is not thoroughly tested, and can be slow and memory-consuming if the bin size -b INT
is set too small.
Cool, is there a similar command in dip-c repo that I could refer, just to know the formatting and what input file to give?
I was talking about the dip-c repo. Because you're not the only one asking, I will probably update the dip-c README for more instructions.
The dip-c README has been updated with instructions (tanlongzhi/dip-c@8dae2a8).
Hi Dr. Tan,
I used the hickit to process my dip-c data. but I got 6709 contact pairs. Is it enough.
Later, I use the "hickit -i in.pairs -u -o imput.pairs -Sr1m -c1 -r10m -c5 -b4m -b1m -b200k -D5 -b50k -D5 -b20k -O out.3dg" to analyze data and got almost no results in out.3dg file.
I attach the contacts.pairs file that I used and the out.3dg file. Please help me find out what's the issue is.
Hi @wuheng9,
7 k contacts are too few for successful 3D reconstruction. In our original study, we had ~1 m per cell. I imagine the algorithm might work with fewer (e.g. 200 k) for low-resolution reconstruction.
I looked at the files that you sent via email. It seems that your contacts are mostly very short-range -- 90% of your intra-chromosomal contacts are within 100 kb.
Assuming your sequencing data had sufficient depth, the lack of contacts is probably caused by insufficient restriction digestion. We have seen something similar in challenging cell types, in particular mammalian sperm -- do you think that's what happened?
Btw, to know whether your digestion worked at all, you can visually examine a few contact locations in your BAM file, and see if they coincide with the expected restriction site (e.g. GATC for MboI and DpnII).
Best,
Tan
Hi @wuheng9,
To check digestion, I typically take 5-10 uL from the digestion reaction, add PBS to a total of 100 uL, add 5 uL 0.8 U/uL ProtK (20 mg/mL; e.g. NEB P8107S), mix and incubate at 65 C for 1 h. I'll then purify with any DNA purification kit (e.g. Zymo D4013), and run a gel/Bioanalyzer. You can also do the same to check ligation.
Best,
Tan
Hi @wuheng9, there's only a single library for each sample -- only size-selected (by beads) differently.
You can play with different combinations; I found just a single size (0.65X for META, 0.7X for Nextera) to suffice. Basically, the lower the bead ratio, the higher the percentage of reads containing contacts (which should be proportional to the average insert size) but potentially the fewer contacts in total.
For analysis, I analyze everything together.
Best,
Tan
Hi Tan,
I have used the command "hickit -i impute.pairs.gz --out-png impute.png" to generate a 2D contact map in PNG. But I got a black background and no names for each chromosome. How should I change parameters to chart a figures like one that you published in Science.
I also met another issues when I tried to view 3D structure by using a command "hickit -gl -I /THL8/home/liubin/data/t.dip/SRR7226668.cpg.3dg.gz --view".
The issue is listed as following:
hickit: invalid option -- 'g'
hickit: invalid option -- 'l'
[M::main] Version: r291
[M::main] CMD: hickit -gl -I /THL8/home/liubin/data/t.dip/SRR7226668.cpg.3dg.gz --view
So how should I view the 3D structure. And how can I see some specific region of a genome and save a png file on somewhere.
Thank you very much! I am looking forward to your reply.
Best,
Qian Wang
Hi Qian,
(1)
I visualize contact maps with Juicebox:
java -Xmx2g -jar juicer_tools.jar pre -n contacts.pairs.gz contacts.hic mm10
The resulting file contacts.hic
can then be visualized through Juicebox's web browser. Note that you may need to lower the color limit (by clicking the "-" button) to get a desirable color scale.
(2)
To use hickit --view
, you'll need to compile hickit with make gl=1
(see this section of the README). A pre-compiled version is provided in the v0.1 release (the file hickit-gl
; note that -gl
is part of the filename, not an option); however, note that it has not been provided for v0.1.1.
(3)
For detailed visualization with PyMol, please see this section of the dip-c README. It covers basic 3D viewing. Before that, you'll need to first convert hickit .3dg
files to dip-c .3dg
files following dip-c's typical workflow.
To highlight a specific genomic region, you can use dip-c reg3 -i interest.reg 20k.1.clean.3dg > interest.3dg
first, with a region file interest.reg
like this one (for the paternal copy of Chr2:1-2Mb):
2 0 1000000 2000000
You can then generate an additional mmCIF for interest.3dg
, to add to the existing whole-genome 20k.1.clean.3dg
file in your PyMol.
Best,
Tan
Hi Tan,
When I converted the *.3dg file into a *.cif file, there was error.
dip-c color -n home/data/t.dip/hg19.chr.txt /home/data/t.dip/SRR7226668.out.3dg | dip-c vis -c /dev/stdin home/data/t.dip/SRR7226668.out.3dg > /home/data/t.dip/SRR7226668.cif
Traceback (most recent call last):
File "home/software/dip-c-master/dip-c", line 130, in
File "home/software/dip-c-master/dip-c", line 130, in
main()
File "home/software/dip-c-master/dip-c", line 84, in main
main()
return_value = vis.vis(sys.argv[1:])
File "home/software/dip-c-master/dip-c", line 75, in main
File "home/software/dip-c-master/vis.py", line 58, in vis
return_value = color.color(sys.argv[1:])
File "home/software/dip-c-master/color.py", line 149, in color
g3d_data = file_to_g3d_data(open(args[0], "rb"))
File "home/software/dip-c-master/classes.py", line 1427, in file_to_g3d_data
g3d_data = file_to_g3d_data(open(args[0], "rb"))
File "home/software/dip-c-master/classes.py", line 1427, in file_to_g3d_data
g3d_data.add_g3d_particle(string_to_g3d_particle(g3d_file_line.strip()))
g3d_data.add_g3d_particle(string_to_g3d_particle(g3d_file_line.strip()))
File "home/software/dip-c-master/classes.py", line 1214, in string_to_g3d_particle
File "home/software/dip-c-master/classes.py", line 1214, in string_to_g3d_particle
hom_name, ref_locus, x, y, z = g3d_particle_string.split("\t")
ValueError: need more than 1 value to unpack
hom_name, ref_locus, x, y, z = g3d_particle_string.split("\t")
ValueError: need more than 1 value to unpack
I am looking forward to your reply!
Best,
Qian
Hi @wuheng9, can you send me your .3dg
file? You can gzip it to reduce size.
Best,
Tan
Hi @wuheng9, I can't see any files. Perhaps it's easier to attach through the GitHub website rather than email?
Best,
Tan
Hi Tan,
I try to attach the file through Github website. But I do not see a link where I can send the file.
Best,
Qian
Hi @wuheng9, did you use Python 3? The dip-c repo only supports Python 2. With Python 2.7, I have no problem running that line on my computer.
Best,
Tan
Hi Tan,
I stated the pymol software. It shows "Detected 4 CPU cores. Enabled multithreaded rendering.
Error: The requested stereo 3D visualization mode is not available." Is that the problem that no color for each chromosome.
Best,
Qian
Hi @wuheng9, your mmCIF file looks great on my computer, with PyMol 2.3.4. I turned on the color by clicking C->Spectrum->Rainbow on the right. See screenshot: