nloyfer/wgbs_tools

wgbstools segment error

Closed this issue · 11 comments

I'm attempting to segment the (hg19) beta files available from here:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE186458

I'm getting the following error:

[ add_loci ] Failed! exception: [wt add_loci] line 0: endCpG < startCpG [wt segment] found 74,349 blocks (dropped 106,621 short blocks) Traceback (most recent call last): File "./wgbs_tools/wgbstools", line 96, in <module> main() File "./wgbs_tools/wgbstools", line 63, in main importlib.import_module(args.command).main() File "./wgbs_tools/src/python/segment.py", line 310, in main SegmentByChunks(args, betas).run() File "./wgbs_tools/src/python/segment.py", line 147, in run self.dump_result(df.reset_index(drop=True)) File "./wgbs_tools/src/python/segment.py", line 183, in dump_result add_bed_to_cpgs(temp_path, self.genome.genome, self.args.out_path) File "./wgbs_tools/src/python/convert.py", line 245, in add_bed_to_cpgs subprocess.check_call(cmd, shell=True) File "/exports/applications/apps/SL7/anaconda/5.0.1/lib/python3.6/subprocess.py", line 291, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'cat qfp0iepq | ./wgbs_tools/src/cpg2bed/add_loci ./wgbs_tools/references/hg19/CpG.bed.gz ./wgbs_tools/references/hg19/CpG.chrome.size > ./GSE186458/GSE186458.chr20.blocks.small.bed' returned non-zero exit status 1. Segmenting

Any thoughts on what the problem might be and how to fix it?

Thanks

D

If I attempt to use pre-generated blocks download from the same GSE repository then I get the following error:

Traceback (most recent call last): File "./wgbs_tools/wgbstools", line 96, in <module> main() File "./wgbs_tools/wgbstools", line 63, in main importlib.import_module(args.command).main() File "./wgbs_tools/src/python/beta_to_table.py", line 150, in main for chunk in chunks: File "./wgbs_tools/src/python/beta_to_table.py", line 133, in beta2table_generator yield get_table(subset_blocks, gf, min_cov, threads, verbose) File "./wgbs_tools/src/python/beta_to_table.py", line 68, in get_table is_nice, _ = is_block_file_nice(blocks_df) File "./wgbs_tools/src/python/beta_to_blocks.py", line 28, in is_block_file_nice if df[['startCpG', 'endCpG']].isna().values.sum() > 0: File "/exports/applications/apps/SL7/anaconda/5.0.1/lib/python3.6/site-packages/pandas/core/generic.py", line 3081, in __getattr__ return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'isna'

So I'm not having much luck, unfortunately.

Hi, can you post the exact command you used? And some example head command output of your input files?
A

Here is the command I've used for segmenting:

${WGBSTOOLS} segment --threads ${NSLOTS} --genome hg19 --betas ${IN}/*.beta --min_cpg 4 --max_bp 5000 -r chr${SGE_TASK_ID} -o ${PREFIX}.chr${SGE_TASK_ID}.blocks.small.bed

And here is an example of some chr20 data from from one of the files:

${WGBSTOOLS} view GSM5652211_Lung-Bronchus-Smooth-Muscle-Z00000421.beta | grep "^chr20" | head chr20 60178 60180 3 33 chr20 60425 60427 28 34 chr20 60431 60433 26 35 chr20 60550 60552 7 24 chr20 60577 60579 3 25 chr20 60721 60723 7 30 chr20 60806 60808 24 31 chr20 60962 60964 28 49 chr20 61069 61071 23 44 chr20 61181 61183 27 35

Edit:

For some reason github isn't liking pasting from linux. The output should be showing as a standard beta file format, but appears to be lacking carriage returns once pasted.

${WGBSTOOLS} view GSM5652211_Lung-Bronchus-Smooth-Muscle-Z00000421.beta | grep "^chr20" | head
chr20 60178 60180 3 33
chr20 60425 60427 28 34
chr20 60431 60433 26 35
chr20 60550 60552 7 24
chr20 60577 60579 3 25
chr20 60721 60723 7 30
chr20 60806 60808 24 31
chr20 60962 60964 28 49
chr20 61069 61071 23 44
chr20 61181 61183 27 35

Just to add to that, I've tried this with just one input beta file and the error is the same. If I use awk to output any lines where field 3 is less than field 2, then no lines are reported. This indicates that none of the lines in the beta file has an end position less than the start position.

${WGBSTOOLS} view GSM5652176_Adipocytes-Z000000T7.beta | awk '{if($3<$2) print $0;}' -

I'm not reproducing the error on my side with some random chromosomes i tried. Which chromosome? chromosome 20? all of them? I downloaded the data from the link you posted and set up a clean wgbstools and ran and did not reproduce. My best guess is that our reference/annotation data is different but I will try to see if I can find other causes.

I've just tried following the tutorial and run into the same problem at the point of segmenting:

${WGBSTOOLS} segment --betas *beta --min_cpg 3 --max_bp 2000 -r $region -o blocks.small.bed
[wt segment] found 9 blocks
(dropped 9 short blocks)
[ add_loci ] Failed! exception:
[wt add_loci] line 0: endCpG < startCpG
Traceback (most recent call last):
File "./wgbs_tools/wgbstools", line 96, in <module>
main()
File "./wgbs_tools/wgbstools", line 63, in main
importlib.import_module(args.command).main()
File "./wgbs_tools/src/python/segment.py", line 310, in main
SegmentByChunks(args, betas).run()
File "./wgbs_tools/src/python/segment.py", line 147, in run
self.dump_result(df.reset_index(drop=True))
File "./wgbs_tools/src/python/segment.py", line 183, in dump_result
add_bed_to_cpgs(temp_path, self.genome.genome, self.args.out_path)
File "./wgbs_tools/src/python/convert.py", line 245, in add_bed_to_cpgs
subprocess.check_call(cmd, shell=True)
File "/exports/applications/apps/SL7/anaconda/5.0.1/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cat sp68njbk | ./wgbs_tools/src/cpg2bed/add_loci ./wgbs_tools/references/hg19/CpG.bed.gz ./wgbs_tools/references/hg19/CpG.chrome.size > blocks.small.bed' returned non-zero exit status 1.

So I'm assuming its a problem with my software environment. Python version 3.6.3 is loaded, along with samtools 1.13 and BEDtools2.30.0. There were no warnings when installing wgbstools:

$ python setup.py
Compiling stdin2beta...
SUCCESS
Compiling stdin2pairs...
SUCCESS
Compiling pat_sampler...
SUCCESS
Compiling patter...
SUCCESS
Compiling bp_patter...
SUCCESS
Compiling snp_patter...
SUCCESS
Compiling match_maker...
SUCCESS
Compiling segmentor...
SUCCESS
Compiling cview...
SUCCESS
Compiling homog...
SUCCESS
Compiling add_cpg_counts...
SUCCESS
Compiling add_loci...
SUCCESS

Which pandas version are you using?

I'm using the default init_genome command:

$ ./wgbstools init_genome hg19
[wt init] Setting up genome reference files in ./wgbs_tools/references/hg19
[wt init] No reference FASTA provided. Attempting to download from
https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 904M 100 904M 0 0 23.8M 0 0:00:38 0:00:38 --:--:-- 25.6M
[wt init] successfully downloaded FASTA. Now gunzip and bgzip it...
[wt init] Indexing ./wgbs_tools/references/hg19/hg19.fa.gz
[wt init] Generated index file: ./wgbs_tools/references/hg19/hg19.fa.gz.fai
[wt init] Processing chromosomes...
[wt init] chromosome: chr1
[wt init] chromosome: chr2
[wt init] chromosome: chr3
[wt init] chromosome: chr4
[wt init] chromosome: chr5
[wt init] chromosome: chr7
[wt init] chromosome: chr6
[wt init] chromosome: chr8
[wt init] chromosome: chr9
[wt init] chromosome: chr10
[wt init] chromosome: chr11
[wt init] chromosome: chr12
[wt init] chromosome: chr13
[wt init] chromosome: chr14
[wt init] chromosome: chr15
[wt init] chromosome: chr16
[wt init] chromosome: chr17
[wt init] chromosome: chr18
[wt init] chromosome: chr19
[wt init] chromosome: chr20
[wt init] chromosome: chr21
[wt init] chromosome: chr22
[wt init] chromosome: chrX
[wt init] chromosome: chrY
[wt init] chromosome: chrM
[wt init] Building CpG-Index dictionary...
[wt init] bgzip and index...
[wt init] Finished initialization of genome hg19

According to pip list I'm using pandas (0.20.3).

I suspect the pandas version is the issue. I have version '1.3.5'. We will try to check what is the minimum version required, but I suspect this is the issue.

I've just installed pandas 1.5.1 and the tutorial segment worked fine, so it looks like that was the problem.

Thanks!

Excellent. Thank you for raising this issue.
I updated the README dependencies section to state pandas version >1 is required