WikiExtractor.py - No such file or directory
Tomas0413 opened this issue · 6 comments
@yoheikikuta thanks for creating this repository and sharing the instructions on how to train BERT with Japanese wiki data! I'm trying to reproduce everything from scratch and I can't find WikiExtractor.py file.
python3 src/data-download-and-extract.py
100.0% 2906087424 / 2906079739
python3: can't open file '/data/bert-japanese/src/../../wikiextractor/WikiExtractor.py': [Errno 2] No such file or directory
ありがとうございました
I'm going to try a WikiExtractor.py from here:
https://github.com/attardi/wikiextractor
Maybe it's what you used?
@Tomas0413
Yes, this repository uses wikiextractor
you mentioned.
This wikiextractor` is installed during the docker build.
Did you properly run the script in a docker container?
Hi, @yoheikikuta, thanks for the response!
I should have looked at the Dockerfile indeed. Anyway, I was able to download the data and extract it, so this step worked fine. Now will move onto the SentencePiece, I initially tried running it in a small VM with 8GB of RAM, so I ran out of RAM and will what you used: n1-standard-8.
n1-standard-8 (8CPUs, 30GB memories)
You need enhance the instance's memory to avoid the memory error.
I'm sure 30 [GB] is enough.
Yep, SentencePiece on 30GB RAM finished without problems.
I'll close this issue.
Please make another issues if you have any other problems.