ThilinaRajapakse/BERT_binary_text_classification

No progress running converter.py

ChrisPalmerNZ opened this issue · 18 comments

I am running converter.py as there didn't seem to be any progress running the process in the Notebook. I set the process count to cpu_count() -1, which is 7 processes. However, even after 6 hours I see no progress in the tdqm bar, even though the CPU and memory are definitely under load. Windows 10, i7-4770 CPU @3.40GHz, 16GB RAM. This is what I see in the IPython console of Spyder, from where I am running:
image

Strange. Can you check whether it works if you don't use multiprocessing?

train_features = [convert_examples_to_features.convert_example_to_feature(example) for example in train_examples_for_processing]

I am ran it as you suggested and it completed in less than half an hour.

I find that the result is a 560,000 long array full of objects such as <convert_examples_to_features.InputFeatures at 0x232d4105a58> - is that expected?

BTW I had also tried running it with a single process via Pool for almost six hours, but again, no progress. Should I expect to see the progress bar move reasonably quickly? I wasn't confident about how quickly it would move which is why I left it for so long.

You definitely should be seeing some progress. While Ryzen will have the advantage due to more cores, the time taken to tokenize one example should be relatively comparable. It definitely should not take over six hours for one example!
I would suggest running it on about 10 examples. It really shouldn't take more than a few minutes (if that) for such a small number.
Also, try adding print statements inside the convert_example_to_feature function to see whether it's called and whether it hangs somewhere.
If you are using a custom dataset, try using the yelp one to check whether it's a data issue.
You could also try running the script directly in the terminal instead of Spyder. It's a long shot but weirder shit has happened!

Oh you answered after I made an edit to show that its now completed! I just want to confirm that the data is as expected. Also, I note that the saving of the pickle file should have included the .pkl extension when opening the file to save to.

I have another question - the use of the names 'config.json' and 'pytorch_model.bin' - my actual files which I downloaded directly from the huggingface site are bert-base-cased-config.json and bert-base-cased-pytorch_model.bin - should I use those names instead?

Yes, that is how the features should look.

The Pool issue is strange. I'll check it on my end later to see whether I can reproduce it. For now, pleas use the list comprehension method.
Good catch on the missing .pkl. I should have been more careful with the converter script, it seems.

For the names, I suggest just keeping them as is. If you get an error while loading the model, change them to what I've used. If that still causes issues, you can just let the code download the pretrained model automatically. It's only a few hundred MBs if memory serves.

Thanks, re the saving the pickle file, the notebook expects to find it in the data subdirectory, so you might want to fix that also. Re the names, are you saying I can try using the full names, or to just leave them as is?

BTW, even though I placed these files into the cache subdirectory the files are still getting downloaded directly from the huggingface site - so now my cache directory looks like this:
image

If you pass the name of a default Bert model to the model loading function, it will download it from the cloud.
If you want to use a model that is available locally (pretrained or fine tuned), you need to compress the config file and the bin file into a tar.gz archive and pass the path to the archive to the loading function.
E.g: 'cache/my_bert.tar.gz

I think the loading function is from_pretrained() but I'm not at my computer to check and give the exact names, sorry.

OK, thanks - I will experiment with this, but as the downloads are not big I am also OK with the default behaviour...

That works. But you'll need to do it for your fine tuned models.

How long do you expect the training to take? I had to reduce batch size to 16 to fit on my GTX 1080, and I've been running it for an hour now. I am getting a display of losses but no progress on the epoch...
image

It depends on the size of the dataset. For me, the yelp dataset took a little over 3 hours on a RTX 2080. There's a progress bar that shows the prices per epoch that shows the approximate time remaining.

It's the Yelp dataset. So I'll wait for a few hours, but again the lack of progress showing on the progress bar is disconcerting...

Just found that my tqdm isn't working correctly, its version 4.32.1 - I will investigate it
image

Oh yeah, I put progress bars on anything that takes longer than a few seconds! If tqdm isn't working properly, you could modify the the line that prints the loss to print out the step as well.

Actually, found that tqdm is actually working, but tqdm_notebook isn't .
e.g. tqdm at 500:
image
But tqdm_notebook does this at the start:
image
And then by iteration 9836 of my 10,000 it has just locked up my notebook, and in the meantime no progress bar has printed at all. So, it looks like tqdm_notebook is causing a hang...
image

I'm not sure how it is in Spyder but tqdm_notebooke is for Jupyter notebooks and tqdm works for scripts.
That's some weird behaviour with tqdm though. I wonder if it's due to Spyder.

It was in the Notebook I was having the problems. Went to Spyder as a backup.

Thanks for all your help on this. I will try and sort tqdm and jupyter notebook out but so far no luck. The model took 3hrs : 46min to run, I got a tqdm progress bar at the very end. It definitely is not a problem with the Bert process...

I've had similar issues with the tqdm progress bar not showing up. It's very hard to trace the issue. I think I ended up just making a new conda environment.

My thoughts too...