yoheikikuta/bert-japanese

WikiExtractor.py - No such file or directory

Tomas0413 opened this issue · 6 comments

@yoheikikuta thanks for creating this repository and sharing the instructions on how to train BERT with Japanese wiki data! I'm trying to reproduce everything from scratch and I can't find WikiExtractor.py file.

python3 src/data-download-and-extract.py

100.0% 2906087424 / 2906079739
python3: can't open file '/data/bert-japanese/src/../../wikiextractor/WikiExtractor.py': [Errno 2] No such file or directory

ありがとうございました

I'm going to try a WikiExtractor.py from here:

https://github.com/attardi/wikiextractor

Maybe it's what you used?

@Tomas0413
Yes, this repository uses wikiextractor you mentioned.
This wikiextractor` is installed during the docker build.
Did you properly run the script in a docker container?

Hi, @yoheikikuta, thanks for the response!

I should have looked at the Dockerfile indeed. Anyway, I was able to download the data and extract it, so this step worked fine. Now will move onto the SentencePiece, I initially tried running it in a small VM with 8GB of RAM, so I ran out of RAM and will what you used: n1-standard-8.

@Tomas0413

n1-standard-8 (8CPUs, 30GB memories)

You need enhance the instance's memory to avoid the memory error.
I'm sure 30 [GB] is enough.

Yep, SentencePiece on 30GB RAM finished without problems.

I'll close this issue.
Please make another issues if you have any other problems.