kishwarshafin/helen

margin docker run fail

lstxmu opened this issue · 27 comments

hi,
i ran the margin polishg progrecess (docker version) , and got a fail result.
root@ecs-9875:/media/datarun/blnanodata/data# tail marginPolish.log
/usr/bin/time -f '\nDEBUG_MAX_MEM:%M\nDEBUG_RUNTIME:%E\n' /opt/MarginPolish/build/marginPolish reads_2_assembly.bam new.fasta allParams.np.human.guppy-ff-233.json -t 32 -o output/marginpolish_images -f

Running OpenMP with 32 threads.

Parsing model parameters from file: allParams.np.human.guppy-ff-233.json
Calloc failed with request for -2 lots of 16 bytes
Command exited with non-zero status 1

DEBUG_MAX_MEM:3836
DEBUG_RUNTIME:0:00.00

Can you help me to fix it ?

Hello @lstxmu

So the model that you downloaded allParams.np.human.guppy-ff-233.json is corrupted. Can you please remove that file and download it this way:

wget https://raw.githubusercontent.com/UCSC-nanopore-cgl/MarginPolish/master/params/allParams.np.human.guppy-ff-235.json

This downloads the raw json file and makes sure you don't download html content.

Please run the same command with the newly downloaded model and it should work.

Hello@kishwarshafin
Thanks for yout suggession, and it works. But after 2h later, i got other error message as follow :
root@ecs-25a7:/media/datarun/data# tail marginPolish.log
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 139675447654144:
#000: ../../../src/H5D.c line 391 in H5Dclose(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 139675447654144:
#000: ../../../src/H5G.c line 777 in H5Gclose(): not a group
major: Invalid arguments to routine
minor: Inappropriate type

Can you tell me what had happen?

Hi @lstxmu,

We have seen and solved this error before in here. If you can run the command with sudo, it should work.

However, if you can please run the simple non-sudo walkthrough then change your commands the same way then that should work too. The walkthrough is E Coli and on a 40 CPU machine, it should take about 15-20mins.

hi,kishwarshafin
thanks for your advise, the issue was fixed.
But i found an new issue: the assembly quality of shasta(after polished wiith marinpolish and helen) was worse than wtdbg2. I valued the quality with BUSCO:
<shasta+marginpolish+helen> C:69.4%[S:68.6%,D:0.8%],F:12.7%,M:17.9%,n:4915
<wtdbg2+minimap2+pilon> C:94.8%[S:93.7%,D:1.1%],F:3.3%,M:1.9%,n:4915

Do you have the same issue?
Looking for your reply.

Best,
Luo

Hi Luo,

What sample/species is this? Which guppy version are you running?

Can you run BUSCO on the unpolished Shasta assembly to see how it performs? Also, wtdg2+minimap2, did you mean racon? Minimap2 can’t polish I believe.

Hi, kishwarshafin
1 The species i assembly was a bird,
2 I don't know guppy version, i installed the software follow your github instruction
3 i did not run BUSCO on the unpolished assembly result , i will run it and tell you the result in a few hours
4 I used pilon to polish wtdbg2 result
5 I used the chicken as the ref speices in august (run_BUSCO.py -i Assembly.fasta -c 144 -l /media/database/ncbidb/busco/aves_odb9 -m genome --out shastaraw -t /media/datarun3/temp/ -sp chicken)

I wad wondering about the sequencing protocol. Like, which basecaller version you used to basecall the raw reads and are all the data from ONT. Also, the raw/unpolished assembly comparison between Shasta and wtdbg2 would also help to answer the questions.

hi, kishwarshafin
1 I got the sequence result from novogene company in china, and they used guppy.
2 busco result :
shasta:C:35.6%[S:35.4%,D:0.2%],F:6.2%,M:58.2%,n:4915
shasta+marginpolish: C:61.2%[S:60.5%,D:0.7%],F:13.0%,M:25.8%,n:4915
shasta+marginpolish+helen:C:69.4%[S:68.6%,D:0.8%],F:12.7%,M:17.9%,n:4915

Hello,

Do you have the raw wtdbg2 busco numbers? You can also polish the wtdbg2 with MP and HELEN to see some improvement.

I think the issue would be average read length. Do you happen to know the read N50 or have a plot of the read length distribution?

hi, kishwarshafin
1 .the raw wtdbg2 busco : C:94.8%[S:93.9%,D:0.9%],F:3.3%,M:1.9%,n:4915
2. good suggestion, i will try it
3 read length statistic value as follow:
General summary:
Active channels: 2,678.0
Mean read length: 20,802.0
Mean read quality: 7.9
Median read length: 20,648.0
Median read quality: 8.6
Number of reads: 1,719,938.0
Read length N50: 27,841.0
Total bases: 35,778,191,178.0
Number, percentage and megabases of reads above quality cutoffs

Q5: 1460902 (84.9%) 33972.4Mb
Q7: 1243230 (72.3%) 29449.9Mb
Q10: 249370 (14.5%) 5954.4Mb
Q12: 179 (0.0%) 0.8Mb
Q15: 0 (0.0%) 0.0Mb
Top 5 highest mean basecall quality scores and their read lengths
1: 13.7 (212)
2: 13.4 (290)
3: 13.4 (300)
4: 13.4 (260)
5: 13.3 (357)
Top 5 longest reads and their mean basecall quality score
1: 1292137 (4.3)
2: 513973 (4.2)
3: 438576 (4.1)
4: 346715 (4.0)
5: 257072 (3.0)
LengthvsQualityScatterPlot_dot
Weighted_LogTransformed_HistogramReadlength

Hi Luo,

We had a brief discussion in our group about your findings. We want to debug this issue with your help so you can get a proper answer.

We think this might be a coverage issue. As Shasta has a strict cutoff, you may lose coverage if your reads are on the shorter side. You can get coverage information from one of these files in the assembly directory: AssemblySummary.html, ReadLengthHistogram.csv, Binned-ReadLengthHistogram.csv, and also from log output (stdout).

Is there any way you can share these files with us so we can further debug and help you with this issue?

Hi, kishwarshafin
Thanks for you reply .
I would check the coverage information file (if they are still in the server , otherwise I will rerun the assembly progress again), please give me some time. If i can get these files, I would share with you, I will contact you as soon as possible.
Best,
Luo

Luo, thank you and take your time. Also, if you get time please run MP+HELEN on the wtdbg2 assembly to make sure it’s not a polishing issue we are seeing here. Thanks a ton for reporting on this.

Hi, kishwarshafin
I checked the raw assembly directory and got the report files you asked for (except the log output).I had compressed them into the shastareport.zip
If you insist to get the log file. i would take some time to rerun the assembly .
Best,
Luo
Selection_221
shastareport.zip

Hi Luo,

I'm copying over a comment regarding your run of Shasta. Please let us know if we can help anyway:

There is no coverage issue as Shasta is seeing 76 Gb of coverage and this genome is a bit above 1 Gb, so we are around 70x coverage. I suggest that they run Quast to obtain an estimate of sequence quality over the entire genome. And they should do a comparison of pre-polished quality, otherwise, it is impossible to tell if the accuracy issue is due to the assembly or the polishing. The pre-polished analysis should not use Busco as we know that pre-polished accuracy, for all assemblers, is generally not sufficient to make the Busco analysis meaningful.

I'd greatly appreciate if you can polish the wtdbg2 assembly with MP+HELEN and give us the results, that'd clarify if there's anything wrong with the polishing pipeline.

Hi,kishwarshafin
I had just finished the polish job of wtdbg2 resulsts with MP+HELEN. The Busco result of polish assembly file is in progress . I will show you the values as it done. Please give me some time.
Best,
Luo

Hi, kishwarshafin
The polished wtdbg2 with MP+Helen as follow:
MP: C:63.8%[S:63.3%,D:0.5%],F:12.9%,M:23.3%,n:4915
helen: C:71.4%[S:70.6%,D:0.8%],F:12.1%,M:16.5%,n:4915

@lstxmu ,

This is very surprising. BUSCO analysis, in this case, seems to be very specific. I'm not exactly sure if it truly represents the sequence quality though. I'd suggest running Quast to get a better idea of what exactly the sequence qualities look like.

Hi, kishwarshafin
Thanks for your adivce.
I'll run Quast to compare these assembly file. This progress would take a few days, I would share the results to you once it is done.

Thanks @lstxmu , will wait until you are done with all the analysis.

Hi, kishwarshafin
I had run the Quast test with wtdbg2 , wtdbg2+margin, wtdbg2+margin+helen. You can see the appendix zip file.
quastreport.zip

Thanks, @lstxmu ,
After seeing the result, I wonder if the reference you are using is suitable for the species you sequenced. The genome fraction is only 0.33. If you have sequenced and assembled everything correctly you should see a very high match between the reference and the assembly. If this is a non-model organism, I think you don't have the right reference. I can't tell specifically though as I don't have information about what reference you are using and what species have you sequenced.

Hi,@kishwarshafin
The refernce genome was Gallus Gallus v6.0 (lastest version https://www.ncbi.nlm.nih.gov/assembly/GCF_000002315.6) , to be honest, gallus is not the close species to my sample (Egretta garzetta), But most avian genome result was not good enough and gallus was the model orgnism in avain study (also the intensive study species).

Hi @lstxmu ,

I am not an avian expert so really can't help you with this. But it looks like the tools are working fine. If you have any other issues with running the tools, please let us know. I'm closing this issue.

Thanks.

Hi,@kishwarshafin
Thanks for your help.
I still have a question: How much species you had test with shasta+marginpolish+helen? Can you share some detail with me?

As we reported in the paper, we extensively tested on the human genome. Outside the human genome, different labs have tried Shasta on fish and plant genomes and got satisfying results. Most reported on twitter.