popitsch/nanopanel2

Error: file /var/lib/minknow/data/seegene02-20210722-1/seegene02-20210722-1/20210722_1145_MN25814_FAQ61681_404c859e/fastq_pass/barcode01 was not found! Exiting...

Closed this issue · 9 comments

Hello:
I don't quite understand how to write this configuration file.
Now i have Fast5 files split by barcode.

❯ ls
barcode01  barcode15  barcode29  barcode43  barcode57  barcode71  barcode85
barcode02  barcode16  barcode30  barcode44  barcode58  barcode72  barcode86
barcode03  barcode17  barcode31  barcode45  barcode59  barcode73  barcode87
barcode04  barcode18  barcode32  barcode46  barcode60  barcode74  barcode88
barcode05  barcode19  barcode33  barcode47  barcode61  barcode75  barcode89
barcode06  barcode20  barcode34  barcode48  barcode62  barcode76  barcode90
barcode07  barcode21  barcode35  barcode49  barcode63  barcode77  barcode91
barcode08  barcode22  barcode36  barcode50  barcode64  barcode78  barcode92
barcode09  barcode23  barcode37  barcode51  barcode65  barcode79  barcode93
barcode10  barcode24  barcode38  barcode52  barcode66  barcode80  barcode94
barcode11  barcode25  barcode39  barcode53  barcode67  barcode81  barcode95
barcode12  barcode26  barcode40  barcode54  barcode68  barcode82  barcode96
barcode13  barcode27  barcode41  barcode55  barcode69  barcode83  unclassified
barcode14  barcode28  barcode42  barcode56  barcode70  barcode84

How can I configure the software to run normally.
command line

singularity run /home/guangzhoulab001/nanopanel2_1.01.sif call -c /home/guangzhoulab001/nanopanel2-1.01/config.json -o .

json file

{
        "dataset_name": "seegene02-20210722-1",                   # name of this dataset (will be used in the output file names/tables)
        "ref":          "/home/guangzhoulab001/GCF_000195955.2_ASM19595v2_genomic.fna",      # the amplicon reference sequence
        "fast5_dir":    "/var/lib/minknow/data/seegene02-20210722-1/seegene02-20210722-1/20210722_1145_MN25814_FAQ61681_404c859e/fast5_pass/barcode01",      # workspace output dir of guppy that contains basecalled FAST5 files 
        "fastq_dir":    "/var/lib/minknow/data/seegene02-20210722-1/seegene02-20210722-1/20210722_1145_MN25814_FAQ61681_404c859e/fastq_pass/barcode01",                # output dir of guppy that contains fastq.gz files; needed by porechop and can be omitted if no demultiplexing is configured 
        "basecall_grp": "Basecall_1D_001",              # the used basecall group identifier in the FAST5 files
        "demultiplex": {                                # This section is required only for multiplexed datasets.
                "BC01": "S01",                          # Maps the 1st barcode ('BC01') to a sample identified that will be used in the output files 
                "BC02": "S02",
                "BC03": "S03",
                "BC04": "S04",
                "BC05": "S05",
                "BC06": "S06",
                "BC07": "S07",
                "BC08": "S08"
                },
        "logfile": "nanopanel2.log",                    # name of the log file
        "consensus": "mean",                            # used for consensus calculation (only if multiple mappers are configured)
        "mappers": {                                    # configured long-read mappers. Supported types are 'minimap2', 'ngmlr' and 'last'. 
                "mm2" : {							
                        "type": "minimap2"
                        },
                "ngms": {
                        "type": "ngmlr",
                        "additional_param": [ "--no-smallinv", "--no-lowqualitysplit", "-k", "10", "--match", "3", "--mismatch", "-3", "--bin-size", "2", "--kmer-skip", "1" ] # additional runtime parameters for ngmlr
                        },
                "last": {
                        "type": "last"
                        }
                },
        "roi_intervals": ["chr:100-1000"],              # list of genomic intervals in which variant calling will be done 
        "truth_vcf": {                                  # only required if truth-set data is available. Links sample identifiers to truth set VCF files.
                "S01": "truth_vcf/S01.exp.vcf",
                "S02": "truth_vcf/S02.exp.vcf",
                "S03": "truth_vcf/S03.exp.vcf",
                "S04": "truth_vcf/S04.exp.vcf",
                "S05": "truth_vcf/S05.exp.vcf",
                "S06": "truth_vcf/S06.exp.vcf",
                "S07": "truth_vcf/S07.exp.vcf",
                "S08": "truth_vcf/S08.exp.vcf"
                },
        "threads":      8,                              # number of CPUs/threads used by np2 and 3rd part tools
        "suppress_snv": [],                             # list of filters; SNV calls filtered by those will not be included in the output VCF (but will still be in the output TSV file)
        "suppress_del": ["AF", "DP"],
        "suppress_ins": ["AF", "DP"],
        "max_h5_cache": 500,                            # maximum number of cached H5 files. Setting this to a number >= the number of input FAST5 will greatly speed up the pipeline (at the cost of memory) 
        "exe": {                                        # this section enables users to link to executables for 3rd party tools. Not needed when running via singularity. Supported sections: 'bgzip', 'samtools', 'porechop', 'minimap2', 'ngmlr', 'lastal', 'last-split', 'maf-convert')
                "ngmlr":    "singularity run $SOFTWARE/SIF/ngmlr_0.2.7.sif"  # in this example, ngmlr is called via an (external) singularity image
        }
}

Dear Trandamere
If your data is demultiplexed already then you would have to run np2 for each barcode individually. So you'd need a config file per barcode and configure the respective fast5 dir.

One way to automate this would be to write a "template" config file with name "config_bcXX.json.TEMPLATE":

{
        "dataset_name": "mydataset_bc@BC@",
        "fast5_dir":    "mydir/barcode@BC@/workspace/",
        [ add remaining config here ]
}

and then have a script that replaces @bc@ with the respective barcode (NOTE: this example works for 9 barcodes only):

#!/usr/bin/env bash
for bc in 01 02 03 04 05 06 07 08 09
do
        sed "s/@BC@/${bc}/g" config_bcXX.json.TEMPLATE > config_bc${bc}.json
done

If you are on a SLURM cluster you can then submit, e.g., as array job:

#!/usr/bin/env bash
#SBATCH --job-name=np2
#SBATCH --output=n2_%j.out
#SBATCH --mem=64gb
#SBATCH --cpus-per-task=8
#SBATCH --array=1,2,3,4,5,6,7,8,9
set -e
BC=('01' '02' '03' '04' '05' '06' '07' '08' '09')
bc=${BC[$SLURM_ARRAY_TASK_ID]} 
singularity run mypath/nanopanel2_1.01.sif call -c config_bc${bc}.json -o .

HTH, BW niko

I tried to test with part of the data, but failed.
command line

❯ singularity run nanopanel2_1.01.sif call -c /home/guangzhoulab-001/nanopanel2-1.01/config.json -o .
INFO:    Converting SIF file to temporary sandbox...
WARNING: underlay of /usr/share/zoneinfo/Etc/UTC required more than 50 (64) bind mounts
Traceback (most recent call last):
  File "/nanopanel2/nanopanel2.py", line 2077, in <module>
    nanopanel2_pipeline(config, outdir)
  File "/nanopanel2/nanopanel2.py", line 2001, in nanopanel2_pipeline
    samples = extract_fastq(config, demux_index, outdir)
  File "/nanopanel2/nanopanel2.py", line 487, in extract_fastq
    for file in os.listdir(config['fast5_dir']):
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/minknow/data/seegene02-20210722-1/seegene02-20210722-1/20210722_1145_MN25814_FAQ61681_404c859e/fast5_pass/barcode01/'
INFO:    Cleaning up image...

~ took 6s 
❯ ls /var/lib/minknow/data/seegene02-20210722-1/seegene02-20210722-1/20210722_1145_MN25814_FAQ61681_404c859e/fast5_pass/barcode01/
FAQ61681_pass_barcode01_3142826f_0.fast5

json

{
        "dataset_name": "barcode01",                   # name of this dataset (will be used in the output file names/tables)
        "ref":          "/home/guangzhoulab-001/GCF_000195955.2_ASM19595v2_genomic.fna",      # the amplicon reference sequence
        "fast5_dir":    "/var/lib/minknow/data/seegene02-20210722-1/seegene02-20210722-1/20210722_1145_MN25814_FAQ61681_404c859e/fast5_pass/barcode01/",      # workspace output dir of guppy that contains basecalled FAST5 files 
        "fastq_dir":    "/var/lib/minknow/data/seegene02-20210722-1/seegene02-20210722-1/20210722_1145_MN25814_FAQ61681_404c859e/fastq_pass/barcode01/",                # output dir of guppy that contains fastq.gz files; needed by porechop and can be omitted if no demultiplexing is configured 
        "basecall_grp": "Basecall_1D_001",              # the used basecall group identifier in the FAST5 files
        "logfile": "nanopanel2.log",                    # name of the log file
        "consensus": "mean",                            # used for consensus calculation (only if multiple mappers are configured)
        "mappers": {                                    # configured long-read mappers. Supported types are 'minimap2', 'ngmlr' and 'last'. 
                "mm2" : {							
                        "type": "minimap2"
                        },
                "ngms": {
                        "type": "ngmlr",
                        "additional_param": [ "--no-smallinv", "--no-lowqualitysplit", "-k", "10", "--match", "3", "--mismatch", "-3", "--bin-size", "2", "--kmer-skip", "1" ] # additional runtime parameters for ngmlr
                        },
                "last": {
                        "type": "last"
                        }
                },
        "roi_intervals": ["chr:100-1000"],              # list of genomic intervals in which variant calling will be done 
        "threads":      8,                              # number of CPUs/threads used by np2 and 3rd part tools
        "suppress_snv": [],                             # list of filters; SNV calls filtered by those will not be included in the output VCF (but will still be in the output TSV file)
        "suppress_del": ["AF", "DP"],
        "suppress_ins": ["AF", "DP"],
        "max_h5_cache": 500,                            # maximum number of cached H5 files. Setting this to a number >= the number of input FAST5 will greatly speed up the pipeline (at the cost of memory) 
        "exe": {                                        # this section enables users to link to executables for 3rd party tools. Not needed when running via singularity. Supported sections: 'bgzip', 'samtools', 'porechop', 'minimap2', 'ngmlr', 'lastal', 'last-split', 'maf-convert')
                "ngmlr":    "singularity run $SOFTWARE/SIF/ngmlr_0.2.7.sif"  # in this example, ngmlr is called via an (external) singularity image
        }
}

Hello:
Folder already exists

❯ ls /var/lib/minknow/data/seegene02202107221/seegene02202107221/20210722_1145_MN25814_FAQ61681_404c859e/fast5_pass/barcode01/
FAQ61681_pass_barcode01_3142826f_0.fast5

Hi
In the config above (and in the error message from np2) you link to
/var/lib/minknow/data/seegene02-20210722-1/seegene02-20210722-1/20210722_1145_MN25814_FAQ61681_404c859e/fast5_pass/barcode01/
i.e., with dashes in the 'seegene' directory names.
maybe this is the problem?

Replace dashes, program still reported an error

❯ singularity run nanopanel2_1.01.sif call -c /home/guangzhoulab-001/nanopanel2-1.01/config.json -o .                             
INFO:    Converting SIF file to temporary sandbox...
WARNING: underlay of /usr/share/zoneinfo/Etc/UTC required more than 50 (64) bind mounts
Traceback (most recent call last):
  File "/nanopanel2/nanopanel2.py", line 2077, in <module>
    nanopanel2_pipeline(config, outdir)
  File "/nanopanel2/nanopanel2.py", line 2001, in nanopanel2_pipeline
    samples = extract_fastq(config, demux_index, outdir)
  File "/nanopanel2/nanopanel2.py", line 487, in extract_fastq
    for file in os.listdir(config['fast5_dir']):
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/minknow/data/seegene02_20210722_1/seegene02_20210722_1/20210722_1145_MN25814_FAQ61681_404c859e/fast5_pass/barcode01/'
INFO:    Cleaning up image...


sorry, that's not what I meant.
Above, you show that the path '/var/lib/minknow/data/seegene02202107221/seegene02202107221/20210722_1145_MN25814_FAQ61681_404c859e/fast5_pass/barcode01/'
exists (containing one fast5 file).
But in your config file you reference the path
'/var/lib/minknow/data/seegene02-20210722-1/seegene02-20210722-1/20210722_1145_MN25814_FAQ61681_404c859e/fast5_pass/barcode01/'

The path exists after the file name is changed

❯ ls /var/lib/minknow/data/seegene02202107221/seegene02202107221/20210722_1145_MN25814_FAQ61681_404c859e/fast5_pass/barcode01/
FAQ61681_pass_barcode01_3142826f_0.fast5

The configuration file has been modified at the same time

{
        "dataset_name": "barcode01",                   # name of this dataset (will be used in the output file names/tables)
        "ref":          "/home/guangzhoulab-001/GCF_000195955.2_ASM19595v2_genomic.fna",      # the amplicon reference sequence
        "fast5_dir":    "/var/lib/minknow/data/seegene02202107221/seegene02202107221/20210722_1145_MN25814_FAQ61681_404c859e/fast5_pass/barcode01/",      # workspace output dir of guppy that contains basecalled FAST5 files 
        "fastq_dir":    "/var/lib/minknow/data/seegene02202107221/seegene02202107221/20210722_1145_MN25814_FAQ61681_404c859e/fastq_pass/barcode01/",                # output dir of guppy that contains fastq.gz files; needed by porechop and can be omitted if no demultiplexing is configured 
        "basecall_grp": "Basecall_1D_001",              # the used basecall group identifier in the FAST5 files
        "logfile": "nanopanel2.log",                    # name of the log file
        "consensus": "mean",                            # used for consensus calculation (only if multiple mappers are configured)
        "mappers": {                                    # configured long-read mappers. Supported types are 'minimap2', 'ngmlr' and 'last'. 
                "mm2" : {							
                        "type": "minimap2"
                        },
                "ngms": {
                        "type": "ngmlr",
                        "additional_param": [ "--no-smallinv", "--no-lowqualitysplit", "-k", "10", "--match", "3", "--mismatch", "-3", "--bin-size", "2", "--kmer-skip", "1" ] # additional runtime parameters for ngmlr
                        },
                "last": {
                        "type": "last"
                        }
                },
        "roi_intervals": ["chr:100-1000"],              # list of genomic intervals in which variant calling will be done 
        "threads":      8,                              # number of CPUs/threads used by np2 and 3rd part tools
        "suppress_snv": [],                             # list of filters; SNV calls filtered by those will not be included in the output VCF (but will still be in the output TSV file)
        "suppress_del": ["AF", "DP"],
        "suppress_ins": ["AF", "DP"],
        "max_h5_cache": 500,                            # maximum number of cached H5 files. Setting this to a number >= the number of input FAST5 will greatly speed up the pipeline (at the cost of memory) 
        "exe": {                                        # this section enables users to link to executables for 3rd party tools. Not needed when running via singularity. Supported sections: 'bgzip', 'samtools', 'porechop', 'minimap2', 'ngmlr', 'lastal', 'last-split', 'maf-convert')
                "ngmlr":    "singularity run $SOFTWARE/SIF/ngmlr_0.2.7.sif"  # in this example, ngmlr is called via an (external) singularity image
        }
}

```
But it's still not working

```
❯ singularity run nanopanel2_1.01.sif call -c /home/guangzhoulab-001/nanopanel2-1.01/config.json -o .                             
INFO:    Converting SIF file to temporary sandbox...
WARNING: underlay of /usr/share/zoneinfo/Etc/UTC required more than 50 (64) bind mounts
Traceback (most recent call last):
  File "/nanopanel2/nanopanel2.py", line 2077, in <module>
    nanopanel2_pipeline(config, outdir)
  File "/nanopanel2/nanopanel2.py", line 2001, in nanopanel2_pipeline
    samples = extract_fastq(config, demux_index, outdir)
  File "/nanopanel2/nanopanel2.py", line 487, in extract_fastq
    for file in os.listdir(config['fast5_dir']):
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/minknow/data/seegene02202107221/seegene02202107221/20210722_1145_MN25814_FAQ61681_404c859e/fast5_pass/barcode01/'
INFO:    Cleaning up image...

```

The error message means that np2 cannot access the configured directory, please refer to the singularity docs how to add user-defined bind paths:
https://sylabs.io/guides/3.0/user-guide/bind_paths_and_mounts.html