/nanopore-basecaller-training

Experiments on training the nanopore basecaller (bonito -> guppy is the goal)

Primary LanguageShellMIT LicenseMIT

Nanopore Basecaller Training

Several folks are interested, trying to centralize the info.

Getting Started

Install Bonito

conda update -n base -c defaults conda
conda create -n bonito python=3.8 pip
conda activate --stack bonito
pip install --extra-index-url https://download.pytorch.org/whl/cu116 ont-bonito
pip install -r bonito/requirements.txt

Change cu116 to your CUDA version (run nvidia-smi to find it, it’ll be on the top right)

Download Models

bonito download --models --show
bonito download --models --all

You can cancel it at the training data (I think?). My gpugpu machine has very little hard drive space so I did.

Create a minimap2 index

minimap2 -d index.mmi assembly.fasta

Basecall with --save-ctc

Trying this now

bonito basecaller dna_r9.4.1_e8_sup@v3.3 ~/stonefly/all_fast5/ --batchsize 384 --reference index.mmi --save-ctc --recursive --device "cuda:0" --alignment-threads 16 > basecalled-default-model/basecalls.bam

Did not work.

bonito basecaller dna_r9.4.1_e8_sup@v3.3 ~/stonefly/all_fast5/ --batchsize 384 --reference index.mmi --save-ctc --recursive --device "cuda:0" --alignment-threads 16 | samtools view -S -b - > basecalls.bam
Note
It has to be in the order as above, or the ctc will not save! basecaller model filepath THEN options.

Default batch size is 32 (I’m 90% certain). Best to try and increase it. When it crashes from memory, best to killall bonito from another shell. I can get batchsize of 384 and it cuts a little over half an hour off on my dataset (51Gb of fast5 files). And we can work with that number on the following steps.

This led to this error:

> reading fast5 outputting aligned sam loading model dna_r9.4.1_e8_sup@v3.3 loading reference
Exception in thread Thread-27: Traceback (most recent call last): File
  "/home/josephguhlin/.asdf/installs/python/anaconda3-2021.05/envs/bonito/lib/python3.8/threading.py", line 932,
  in _
bootstrap_inner self.run() File
  "/home/josephguhlin/.asdf/installs/python/anaconda3-2021.05/envs/bonito/lib/python3.8/site-packages/bonito/io.py",
line 563, in run np.save(os.path.join(output_directory, "chunks.npy"), chunks) File "<__array_function__
  internals>", line 5, in save File
  "/home/josephguhlin/.asdf/installs/python/anaconda3-2021.05/envs/bonito/lib/python3.8/site-packages/numpy/lib/npyio
.py", line 525, in save file_ctx = open(file, "wb") FileNotFoundError: [Errno 2] No such file or directory:
'/proc/2487219/fd/chunks.npy'
> completed reads: 2393454 duration: 1:59:37 samples per second 3.3E+06
> done

The problem was samtools. Bonito tries to be smart about the output directory while using pipes, but this breaks using pipes for anything other than file output.

Update

Crashing from memory or shoddy connection to the server. Copying a subset over and trying again.

This is the type of log I’m getting for training, so a very slight improvement. Now trying without using a pretrained model.

[366841/366841]: 100%|######################################################| [1:40:59, loss=0.0226]
[epoch 1] directory=./aptera_tuned loss=0.0231 mean_acc=99.336% median_acc=99.344%
[366841/366841]: 100%|######################################################| [1:40:46, loss=0.0213]
[epoch 2] directory=./aptera_tuned loss=0.0224 mean_acc=99.349% median_acc=99.358%
[366841/366841]: 100%|######################################################| [1:40:53, loss=0.0211]
[epoch 3] directory=./aptera_tuned loss=0.0220 mean_acc=99.355% median_acc=99.364%
[366841/366841]: 100%|######################################################| [1:40:50, loss=0.0209]
[epoch 4] directory=./aptera_tuned loss=0.0217 mean_acc=99.359% median_acc=99.369%
[366841/366841]: 100%|######################################################| [1:40:38, loss=0.0207]
[epoch 5] directory=./aptera_tuned loss=0.0215 mean_acc=99.362% median_acc=99.371%
[366841/366841]: 100%|######################################################| [1:40:26, loss=0.0196]
[epoch 6] directory=./aptera_tuned loss=0.0213 mean_acc=99.366% median_acc=99.372%
[366841/366841]: 100%|######################################################| [1:40:39, loss=0.0195]
[epoch 7] directory=./aptera_tuned loss=0.0212 mean_acc=99.368% median_acc=99.376%
[366841/366841]: 100%|######################################################| [1:40:55, loss=0.0186]
[epoch 8] directory=./aptera_tuned loss=0.0211 mean_acc=99.369% median_acc=99.374%
[366841/366841]: 100%|######################################################| [1:40:20, loss=0.0189]
[epoch 9] directory=./aptera_tuned loss=0.0211 mean_acc=99.371% median_acc=99.375%
[366841/366841]: 100%|######################################################| [1:40:05, loss=0.0182]
[epoch 10] directory=./aptera_tuned loss=0.0210 mean_acc=99.373% median_acc=99.379%
[366841/366841]: 100%|######################################################| [1:39:59, loss=0.0182]
[epoch 11] directory=./aptera_tuned loss=0.0210 mean_acc=99.371% median_acc=99.378%
[63540/366841]:  17%|#########8                                               | [17:20, loss=0.0179]

Added evaluate script

josephguhlin in joseph-gpugpu in nanopore-basecaller-training on  main [?] via 🅒 bonito  took 3s
❯ bash evaluate_trained.bash                                                                                                                                                                                                                (bonito)
* loading data
[validation set not found: splitting training set]
* loading model 11
* calling
* mean      99.41%
* median    99.42%
* time      1.02
* samples/s 9.78E+06

josephguhlin in joseph-gpugpu in nanopore-basecaller-training on  main [?] via 🅒 bonito  took 4s
❯ bash evaluate_default.bash                                                                                                                                                                                                                (bonito)
* loading data
[validation set not found: splitting training set]
* loading model 1
* calling
* mean      99.19%
* median    99.18%
* time      1.03
* samples/s 9.72E+06

josephguhlin in joseph-gpugpu in nanopore-basecaller-training on  main [?] via 🅒 bonito  took 4s

So some small improvement