Unknown error when trying to Train

Question

Unknown error when trying to Train

Closed this issue a year ago · 4 comments

Hi,

For some reason i cannot get the code to work as it is not showing any errors currently. Could you help me figure out what is going on?

i am trying to create a nnue file for a 10x10 variant.

Could it be that the pytorch-lighning and pytorch is incompatible?
Not sure why assertion would fail :(

There are two extra pieces, would i have to code them manually into the code?

sorry if these questions are simple i'm trying my best to learn.

Thank you so much for your dedication, the chess engine world is thankful for all this amazing work

ERROR

(siege) C:\Users\Kosmic\Desktop\variant-nnue-pytorch-master>python train.py --smart-fen-skipping --random-fen-skipping 3 --batch-size 16384 --threads 20 --num-workers 20 --gpus 1 C:\Users\Kosmic\Desktop\Variant\Validation-data\1mil9depth.bin C:\Users\Kosmic\Desktop\Variant\Validation-data\1mil12depth.bin
Feature set: HalfKAv2^
Num real features: 150000
Num virtual features: 1600
Num features: 151600
Training with C:\Users\Kosmic\Desktop\Variant\Validation-data\1mil9depth.bin validating with C:\Users\Kosmic\Desktop\Variant\Validation-data\1mil12depth.bin
Global seed set to 42
Seed 42
Using batch size 16384
Smart fen skipping: True
Random fen skipping: 3
limiting torch to 20 threads.
Using log dir logs/
C:\Users\Kosmic\anaconda3\envs\siege\Lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:487: LightningDeprecationWarning: Argument period in ModelCheckpoint is deprecated in v1.3 and will be removed in v1.5. Please use every_n_epochs instead.
rank_zero_deprecation(
C:\Users\Kosmic\anaconda3\envs\siege\Lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:432: UserWarning: ModelCheckpoint(save_last=True, save_top_k=None, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None).
rank_zero_warn(
ModelCheckpoint(save_last=True, save_top_k=-1, monitor=None) will duplicate the last checkpoint saved.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Using c++ data loader
Assertion failed: bits <= 6, file C:/Users/Kosmic/Desktop/variant-nnue-pytorch-master/lib/nnue_trainLOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
ing_datRanger optimizer loaded.
Gradient Centralization usage = False
a_formats.h, line 662
Assertion failed: bits <= 6, file C:/Users/Kosmic/Desktop/variant-nnue-pytorch-master/lib/nnue_training_data_formats.h, line 662

| Name | Type | Params

0 | input | DoubleFeatureTransformerSlice | 78.8 M
1 | layer_stacks | LayerStacks | 152 K

79.0 M Trainable params
0 Non-trainable params
79.0 M Total params
315.939 Total estimated model params size (MB)
Validation sanity check: 0it [00:00, ?it/s]C:\Users\Kosmic\anaconda3\envs\siege\Lib\site-packages\pytorch_lightning\trainer\data_loading.py:105: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 20 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
rank_zero_warn(
Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]
(siege) C:\Users\Kosmic\Desktop\variant-nnue-pytorch-master>

env Package!

absl-py 1.4.0 pypi_0 pypi
aiohttp 3.8.5 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
annotated-types 0.5.0 pypi_0 pypi
ansicon 1.89.0 pypi_0 pypi
anyio 3.7.1 pypi_0 pypi
arrow 1.2.3 pypi_0 pypi
async-timeout 4.0.3 pypi_0 pypi
attrs 23.1.0 pypi_0 pypi
backoff 2.2.1 pypi_0 pypi
beautifulsoup4 4.12.2 pypi_0 pypi
blessed 1.20.0 pypi_0 pypi
bzip2 1.0.8 he774522_0
ca-certificates 2023.7.22 h56e8100_0 conda-forge
cachetools 5.3.1 pypi_0 pypi
certifi 2022.12.7 pypi_0 pypi
charset-normalizer 2.1.1 pypi_0 pypi
click 8.1.7 pypi_0 pypi
colorama 0.4.6 pypi_0 pypi
contourpy 1.1.0 pypi_0 pypi
croniter 1.4.1 pypi_0 pypi
cuda-version 11.8 h70ddcb2_2 conda-forge
cudatoolkit 11.8.0 h09e9e62_12 conda-forge
cupy 12.2.0 py311h77068d7_0 conda-forge
cycler 0.11.0 pypi_0 pypi
dateutils 0.6.12 pypi_0 pypi
deepdiff 6.3.1 pypi_0 pypi
fastapi 0.103.0 pypi_0 pypi
fastrlock 0.8.2 py311h12c1d0e_0 conda-forge
filelock 3.9.0 pypi_0 pypi
fonttools 4.42.1 pypi_0 pypi
frozenlist 1.4.0 pypi_0 pypi
fsspec 2023.6.0 pypi_0 pypi
future 0.18.3 pypi_0 pypi
google-auth 2.22.0 pypi_0 pypi
google-auth-oauthlib 1.0.0 pypi_0 pypi
grpcio 1.57.0 pypi_0 pypi
h11 0.14.0 pypi_0 pypi
idna 3.4 pypi_0 pypi
inquirer 3.1.3 pypi_0 pypi
intel-openmp 2023.2.0 h57928b3_49496 conda-forge
itsdangerous 2.1.2 pypi_0 pypi
jinja2 3.1.2 pypi_0 pypi
jinxed 1.2.0 pypi_0 pypi
kiwisolver 1.4.5 pypi_0 pypi
libblas 3.9.0 17_win64_mkl conda-forge
libcblas 3.9.0 17_win64_mkl conda-forge
libffi 3.4.4 hd77b12b_0
libhwloc 2.9.1 h51c2c0f_0 conda-forge
libiconv 1.17 h8ffe710_0 conda-forge
liblapack 3.9.0 17_win64_mkl conda-forge
libxml2 2.10.4 h0ad7f3c_1
lightning 2.0.7 pypi_0 pypi
lightning-cloud 0.5.37 pypi_0 pypi
lightning-utilities 0.9.0 pypi_0 pypi
markdown 3.4.4 pypi_0 pypi
markdown-it-py 3.0.0 pypi_0 pypi
markupsafe 2.1.2 pypi_0 pypi
matplotlib 3.7.2 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
mkl 2022.1.0 h6a75c08_874 conda-forge
mpmath 1.2.1 pypi_0 pypi
multidict 6.0.4 pypi_0 pypi
networkx 3.0 pypi_0 pypi
numpy 1.24.1 pypi_0 pypi
oauthlib 3.2.2 pypi_0 pypi
openssl 3.1.2 hcfcfb64_0 conda-forge
ordered-set 4.1.0 pypi_0 pypi
packaging 23.1 pypi_0 pypi
pillow 9.3.0 pypi_0 pypi
pip 23.2.1 py311haa95532_0
protobuf 4.24.2 pypi_0 pypi
psutil 5.9.5 pypi_0 pypi
pthreads-win32 2.9.1 hfa6e2cd_3 conda-forge
pyasn1 0.5.0 pypi_0 pypi
pyasn1-modules 0.3.0 pypi_0 pypi
pydantic 2.1.1 pypi_0 pypi
pydantic-core 2.4.0 pypi_0 pypi
pydeprecate 0.3.1 pypi_0 pypi
pygments 2.16.1 pypi_0 pypi
pyjwt 2.8.0 pypi_0 pypi
pyparsing 3.0.9 pypi_0 pypi
python 3.11.4 he1021f5_0
python-chess 0.31.4 pypi_0 pypi
python-dateutil 2.8.2 pypi_0 pypi
python-editor 1.0.4 pypi_0 pypi
python-multipart 0.0.6 pypi_0 pypi
python_abi 3.11 2_cp311 conda-forge
pytorch-lightning 1.4.9 pypi_0 pypi
pytz 2023.3 pypi_0 pypi
pyyaml 6.0.1 pypi_0 pypi
readchar 4.0.5 pypi_0 pypi
requests 2.28.1 pypi_0 pypi
requests-oauthlib 1.3.1 pypi_0 pypi
rich 13.5.2 pypi_0 pypi
rsa 4.9 pypi_0 pypi
setuptools 68.0.0 py311haa95532_0
six 1.16.0 pypi_0 pypi
sniffio 1.3.0 pypi_0 pypi
soupsieve 2.4.1 pypi_0 pypi
sqlite 3.41.2 h2bbff1b_0
starlette 0.27.0 pypi_0 pypi
starsessions 1.3.0 pypi_0 pypi
sympy 1.11.1 pypi_0 pypi
tbb 2021.9.0 h91493d7_0 conda-forge
tensorboard 2.14.0 pypi_0 pypi
tensorboard-data-server 0.7.1 pypi_0 pypi
tk 8.6.12 h2bbff1b_0
torch 2.0.1+cu118 pypi_0 pypi
torchaudio 2.0.2+cu118 pypi_0 pypi
torchmetrics 0.7.0 pypi_0 pypi
torchvision 0.15.2+cu118 pypi_0 pypi
tqdm 4.66.1 pypi_0 pypi
traitlets 5.9.0 pypi_0 pypi
typing-extensions 4.7.1 pypi_0 pypi
tzdata 2023c h04d1e81_0
ucrt 10.0.22621.0 h57928b3_0 conda-forge
urllib3 1.26.13 pypi_0 pypi
uvicorn 0.23.2 pypi_0 pypi
vc 14.2 h21ff451_1
vc14_runtime 14.36.32532 hfdfe4a8_17 conda-forge
vs2015_runtime 14.36.32532 h05e6639_17 conda-forge
wcwidth 0.2.6 pypi_0 pypi
websocket-client 1.6.2 pypi_0 pypi
websockets 11.0.3 pypi_0 pypi
werkzeug 2.3.7 pypi_0 pypi
wheel 0.38.4 py311haa95532_0
xz 5.4.2 h8cc25b3_0
yarl 1.9.2 pypi_0 pypi
zlib 1.2.13 h8cc25b3_0

Answer 1 · 2023-08-28T07:50:18.000Z

https://github.com/fairy-stockfish/variant-nnue-pytorch/wiki/FAQ#assertion-failed-bits--6

Answer 2 · 2023-08-28T16:00:16.000Z

Hi, Thank you so much for the quick reply! You are amazing. It has started training.

But If i may i would like to ask something i am confused about.

So to generate training data we can either use the classical Eval, or nnue. And it is recommended to use Nnue for better training data evaluations.

But if there is no nnue for the custom variant, i would have to use the classical eval.
Then after that train with the data generated with pytorch nnue to create a nnue file.

And if i use the nnue eval file back to generate more training data with the nnue, and do the whole process again.
Wouldn't that be a perpetual cycle that doesnt improve on itself? since it is using the data from the classival eval file, and not a purely generated data from nnue playing.

Thank you so much for your time.
<3

Answer 3 · 2023-08-28T20:20:56.000Z

Yes, in principle a loop of "take best eval -> generate training data -> train -> get better eval" is working and that basically is what we are doing. There are diminishing returns though, so progress slows down and eventually stops.

Answer 4 · 2023-08-29T04:56:58.000Z

Thank you so much!

| Name | Type | Params

0 | input | DoubleFeatureTransformerSlice | 78.8 M 1 | layer_stacks | LayerStacks | 152 K

0 | input | DoubleFeatureTransformerSlice | 78.8 M
1 | layer_stacks | LayerStacks | 152 K