nf-core/hlatyping

Cannot run HLATyping with user-specific BAM file

NTNguyen13 opened this issue · 25 comments

Hi, I have installed HLAtyping with nextflow, I can run
nextflow run nf-core/hlatyping -profile docker,test --outdir $PWD/results
However, I want to use my own bam file as input, so I used this command (revised from #70)

./nextflow run nf-core/hlatyping -profile docker --bam '/home/thanh/IGSR_Project/1000GVN_aln/VN_01_00_0089_01_01.bam' --genome GRCh38 -c igenomes.config --outdir /home/thanh/IGSR_Project/1000GVN_result/VN_01_00_0089_01_01/hla/

However I got this error:

N E X T F L O W  ~  version 20.01.0
Launching `nf-core/hlatyping` [tender_tuckerman] - revision: bf5d0c2d46 [master]
Unable to parse config file: '/home/thanh/igenomes.config'

  Compile failed for sources FixedSetSources[name='/groovy/script/Script4943CF089126BD872646A55E3C8F3819/_nf_config_c9b0ddec']. Cause: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
  /groovy/script/Script4943CF089126BD872646A55E3C8F3819/_nf_config_c9b0ddec: 7: unexpected token: < @ line 7, column 1.
     <!DOCTYPE html>
     ^

  1 error

Any help is appreciated

Hi @NTNguyen13,

did you try running it without --genome GRCh38and without -c genomes.config ?

yes, when I run it without those 2 parameters, I got this error:


N E X T F L O W  ~  version 20.01.0
Launching `nf-core/hlatyping` [spontaneous_stone] - revision: bf5d0c2d46 [master]
WARN: The access of `config` object is deprecated
WARN: Access to undefined parameter `genome` -- Initialise it to a default value eg. `params.genome = some_value`
WARN: Access to undefined parameter `fasta` -- Initialise it to a default value eg. `params.fasta = some_value`
BAM file format detected. Initiate remapping to HLA alleles with yara mapper.
[2m----------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/hlatyping v1.1.5
----------------------------------------------------
Cannot find any bam file matching: data/test*{1,2}.fq.gz
NB: Path needsto be enclosed in quotes!
Pipeline Release  : master
Run Name          : spontaneous_stone
File Type         : BAM
Seq Type          : dna
Index Location    : /home/thanh/.nextflow/assets/nf-core/hlatyping/data/indices/yara/hla_reference_dna
IP solver         : glpk
Enumerations      : 1
Beta              : 0.009
Prefix            : hla_run
Max Memory        : 128 GB
Max CPUs          : 16
Max Time          : 10d
Output dir        : /home/thanh/IGSR_Project/1000GVN_result/VN_01_00_0089_01_01/hla/
Working dir       : /home/thanh/work
Reads             : data/test*{1,2}.fq.gz
Fasta Ref         : null
Max Resources     : 128 GB memory, 16 cpus, 10d time per job
Container         : docker - nfcore/hlatyping:1.1.5
Launch dir        : /home/thanh
Script dir        : /home/thanh/.nextflow/assets/nf-core/hlatyping
User              : thanh
Config Profile    : docker
[2m----------------------------------------------------

Could you post the command you used this time? The lines at the beginning are just warnings because of the undefined parameters.

./nextflow run nf-core/hlatyping -profile docker --bam '/home/thanh/IGSR_Project/1000GVN_aln/VN_01_00_0089_01_01.bam' --outdir /home/thanh/IGSR_Project/1000GVN_result/VN_01_00_0089_01_01/hla/

this is the command that I used, I only removed those 2 parameters

Hi @NTNguyen13 , thanks for opening the issue!

Could you post the full error from the second time you run it? It does not seem to be complete, here there is no error, just warnings.

Hi @ggabernet , here is it:

Command:
./nextflow run nf-core/hlatyping -profile docker --bam "/home/thanh/IGSR_Project/1000GVN_aln/VN_01_00_0089_01_01.bam" --outdir /home/thanh/IGSR_Project/1000GVN_result/VN_01_00_0089_01_01/hla/

Output:

N E X T F L O W  ~  version 20.01.0
Launching `nf-core/hlatyping` [nostalgic_bardeen] - revision: bf5d0c2d46 [master]
WARN: The access of `config` object is deprecated
WARN: Access to undefined parameter `genome` -- Initialise it to a default value eg. `params.genome = some_value`
WARN: Access to undefined parameter `fasta` -- Initialise it to a default value eg. `params.fasta = some_value`
BAM file format detected. Initiate remapping to HLA alleles with yara mapper.
[2m----------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/hlatyping v1.1.5
----------------------------------------------------
Cannot find any bam file matching: data/test*{1,2}.fq.gz
NB: Path needsto be enclosed in quotes!
Pipeline Release  : master
Run Name          : nostalgic_bardeen
File Type         : BAM
Seq Type          : dna
Index Location    : /home/thanh/.nextflow/assets/nf-core/hlatyping/data/indices/yara/hla_reference_dna
IP solver         : glpk
Enumerations      : 1
Beta              : 0.009
Prefix            : hla_run
Max Memory        : 128 GB
Max CPUs          : 16
Max Time          : 10d
Output dir        : /home/thanh/IGSR_Project/1000GVN_result/VN_01_00_0089_01_01/hla/
Working dir       : /home/thanh/work
Reads             : data/test*{1,2}.fq.gz
Fasta Ref         : null
Max Resources     : 128 GB memory, 16 cpus, 10d time per job
Container         : docker - nfcore/hlatyping:1.1.5
Launch dir        : /home/thanh
Script dir        : /home/thanh/.nextflow/assets/nf-core/hlatyping
User              : thanh
Config Profile    : docker
[2m----------------------------------------------------

I have checked the output folder, there's only a folder named pipeline_info, with 1 file execution_trace.txt, content:
task_id hash native_id name status exit submit duration realtime %cpu peak_rss peak_vmem rchar wchar

Sorry @NTNguyen13 I just realised that the parameters are not used correctly.

The parameter --bam is a boolean parameter. Please specify the bam file using the reads parameter (--readPaths) and specify --bam additionally. If you have single end data you will need this parameter as well (--singleEnd).

Hi, I have edited my command into ./nextflow run nf-core/hlatyping -profile docker --readPaths "/home/thanh/IGSR_Project/1000GVN_aln/VN_01_00_0089_01_01.bam" --bam --outdir /home/thanh/IGSR_Project/1000GVN_result/VN_01_00_0089_01_01/hla/

This time I got the output:

BAM file format detected. Initiate remapping to HLA alleles with yara mapper.
[f3/a2888f] process > remap_to_hla          [100%] 1 of 1, failed: 1
[23/ade82d] process > make_ot_config        [100%] 1 of 1, failed: 1
[-        ] process > run_optitype          -
[6a/22117f] process > output_documentation  [100%] 1 of 1, failed: 1
[ed/ede44f] process > get_software_versions [100%] 1 of 1, failed: 1
[-        ] process > multiqc               -
Execution cancelled -- Finishing pending tasks before exit
Error executing process > 'output_documentation (1)'

Caused by:
  Process requirement exceed available memory -- req: 8 GB; avail: 7.6 GB

Command executed:

  markdown_to_html.r output.md results_description.html

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /home/thanh/work/6a/22117f7452ad97fbd0784b24a3d962

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

content of execution_trace:

task_id	hash	native_id	name	status	exit	submit	duration	realtime	%cpu	peak_rss	peak_vmem	rchar	wchar
3	6a/22117f	-	output_documentation (1)	FAILED	-	-	-	-	-	-	-	-	-
2	23/ade82d	-	make_ot_config	FAILED	-	-	-	-	-	-	-	-	-
1	f3/a2888f	-	remap_to_hla (1)	FAILED	-	-	-	-	-	-	-	-	-
4	ed/ede44f	-	get_software_versions	FAILED	-	-	-	-	-	-	-	-	-

Caused by:
Process requirement exceed available memory -- req: 8 GB; avail: 7.6 GB

As stated here, there is unfortunately not sufficient memory available.

Hi, I have used another computer with sufficient memory, and run the same command, this time it gives the following error:


[1e/66f512] process > remap_to_hla          [100%] 1 of 1, failed: 1 ✘
[6a/7e87e3] process > make_ot_config        [100%] 1 of 1 ✔
[-        ] process > run_optitype          -
[4f/c6d245] process > output_documentation  [100%] 1 of 1 ✔
[df/4fccff] process > get_software_versions [100%] 1 of 1 ✔
[35/fc5a21] process > multiqc               [100%] 1 of 1 ✔
Execution cancelled -- Finishing pending tasks before exit
[0;35m[nf-core/hlatyping] Pipeline completed with errors
WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.
Error executing process > 'remap_to_hla (1)'

Caused by:
  Process `remap_to_hla (1)` terminated with an error exit status (1)

Command executed:

  samtools view -@ 1 -h -f 0x40 h > output_1.bam
  samtools view -@ 1 -h -f 0x80 h > output_2.bam
  samtools bam2fq output_1.bam > output_1.fastq
  samtools bam2fq output_2.bam > output_2.fastq
  yara_mapper -e 3 -t 1 -f bam /home/lucis/.nextflow/assets/nf-core/hlatyping/data/indices/yara/hla_reference_dna output_1.fastq output_2.fastq > output.bam
  samtools view -@ 1 -h -F 4 -f 0x40 -b1 output.bam > mapped_1.bam
  samtools view -@ 1 -h -F 4 -f 0x80 -b1 output.bam > mapped_2.bam

Command exit status:
  1

Command output:
  (empty)

Command error:
  [E::hts_open_format] Failed to open file h
  samtools view: failed to open "h" for reading: No such file or directory

Work dir:
  /home/lucis/work/1e/66f512fe1ea492f6c081bf0b54b289

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

I have checked the path by using

file=/home/lucis/IGSR_Project/1000GVN_aln/VN_01_00_0089_01_01.bam

/home/lucis/IGSR_Project/1000GVN_aln/VN_01_00_0089_01_01.bam: gzip compressed data, extra field

What have gone wrong this time?

Just to be sure: your bam file originates from paired end data?

Yes, the bam file is paired-end, preprocessed by marking duplicate and sorting

I'm currently looking into the last error. Somehow the bam file name ($bams) here

samtools view -@ ${task.cpus} -h -f 0x40 $bams > output_1.bam
samtools view -@ ${task.cpus} -h -f 0x80 $bams > output_2.bam

gets substituted by the letter "h" in your case:

samtools view -@ 1 -h -f 0x40 h > output_1.bam
samtools view -@ 1 -h -f 0x80 h > output_2.bam

Are there maybe some odd characters in the command you used?

I didn't setup the docker on the new computer so I used conda instead, the full command is:

./nextflow run nf-core/hlatyping -profile conda --readPaths "/home/lucis/IGSR_Project/1000GVN_aln/VN_01_00_0089_01_01.bam" --bam --outdir /home/lucis/IGSR_Project/1000GVN_result/VN_01_00_0089_01_01/hla/

Could you please try it with --reads '/home/lucis/IGSR_Project/1000GVN_aln/VN_01_00_0089_01_01.bam' instead of --readPaths "/home/lucis/IGSR_Project/1000GVN_aln/VN_01_00_0089_01_01.bam".

Hi, sorry for not being able to get back sooner.

I tried to replace it with --reads, it seems to run smoothly until hit the runtime limit error:
./nextflow run nf-core/hlatyping -profile conda --reads "/home/lucis/IGSR_Project/1000GVN_aln/VN_01_00_0089_01_01.bam" --bam --outdir /home/lucis/IGSR_Project/1000GVN_result/VN_01_00_0089_01_01/hla/


[0;35m[nf-core/hlatyping] Pipeline completed with errors
WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.
Error executing process > 'remap_to_hla (1)'

Caused by:
  Process exceeded running time limit (2h)

Command executed:

  samtools view -@ 1 -h -f 0x40 VN_01_00_0089_01_01.bam > output_1.bam
  samtools view -@ 1 -h -f 0x80 VN_01_00_0089_01_01.bam > output_2.bam
  samtools bam2fq output_1.bam > output_1.fastq
  samtools bam2fq output_2.bam > output_2.fastq
  yara_mapper -e 3 -t 1 -f bam /home/lucis/.nextflow/assets/nf-core/hlatyping/data/indices/yara/hla_reference_dna output_1.fastq output_2.fastq > output.bam
  samtools view -@ 1 -h -F 4 -f 0x40 -b1 output.bam > mapped_1.bam
  samtools view -@ 1 -h -F 4 -f 0x80 -b1 output.bam > mapped_2.bam

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /home/lucis/work/d9/0ace74a9560295002f8fa42d0ee14d

I check htop and found that the samtools command run quite slow, maybe my BAM file is large? It's WGS at 30X.

At least it seems to run now ;). Could you please try the following:

  • Create a file whatevernameyoulike.config
  • Put the following content in the file
process{
   withName:remap_to_hla{
     time = { 5.h }
  }
}
  • Run the previous command again using the following additional parameters:
    -c whatevernameyoulike.config -resume

Hi, it's me again
I tried to increase it to 5h but it still didn't work.

Error executing process > 'remap_to_hla (1)'

Caused by:
  Process exceeded running time limit (5h)

Command executed:

  samtools view -@ 1 -h -f 0x40 VN_01_00_0089_01_01.bam > output_1.bam
  samtools view -@ 1 -h -f 0x80 VN_01_00_0089_01_01.bam > output_2.bam
  samtools bam2fq output_1.bam > output_1.fastq
  samtools bam2fq output_2.bam > output_2.fastq
  yara_mapper -e 3 -t 1 -f bam /home/lucis/.nextflow/assets/nf-core/hlatyping/data/indices/yara/hla_reference_dna output_1.fastq output_2.fastq > output.bam
  samtools view -@ 1 -h -F 4 -f 0x40 -b1 output.bam > mapped_1.bam
  samtools view -@ 1 -h -F 4 -f 0x80 -b1 output.bam > mapped_2.bam

May this be solved if I provide the fastq file instead? In case I have paired-end reads from multiple lanes, how can I organize the input?

Hi, if you want to use fastq data, please specify it as following:

--reads

Use this to specify the location of your input FastQ files. For example:

--reads 'path/to/data/sample_*_{1,2}.fastq'
Please note the following requirements:

- The path must be enclosed in quotes
- The path must have at least one * wildcard character
- When using the pipeline with paired end data, the path must use {1,2} notation to specify read pairs.
- If left unspecified, a default pattern is used: data/*{1,2}.fastq.gz 

The pattern has to match your paired-end files.

VN_01_00_0089_01_01_S2_L004_R2_001.fastq
VN_01_00_0089_01_01_S2_L004_R2_001.fastq
VN_01_00_0089_01_01_S2_L004_R2_001.fastq

Did you post this accidentally? :)

In case I have reads from 3 lanes, for example:

path/to/data/sample_L1_R1.fastq
path/to/data/sample_L1_R2.fastq
path/to/data/sample_L2_R1.fastq
path/to/data/sample_L2_R2.fastq
path/to/data/sample_L3_R1.fastq
path/to/data/sample_L3_R2.fastq

then I should input them as
--reads 'path/to/data/sample_L{1,2,3}_{1,2}.fastq'

am I right?

P/s: yes, I have deleted it

The reads from multiple lanes won't be added up. You can either use the reads from each lane separately or merge the lanes and provide the merged lanes as input.

Okay, I will try to use the fastq file

Hi, thank you for your support. I have tried both ways, using bam and fastq files:

  • Using BAM file taking really a lot of time. The remap_to_hla process took around 16hr on my sample, then 6 more hours in run_optitype. Besides, it creates many intermediate files, from 1 50GB BAM files, it make 4-5 intermediate files that take up total near 1TB of storage.
  • Using FASTQ file only take 3hr in total, and also takes less storage.

I'm curious about why using BAM files has such drawback.

Hi, so did it run through in the end?

Thanks for your feedback. I will take a look at the code again to see if something is wrong there.